In this notebook, we'll focus on feature extraction from the two text columns, i.e. title and body, so that the data set will be ready for model training.

In [1]:
import os

project_dir = os.getcwd()
data_dir = os.path.join(project_dir, "data")

In [2]:
import pandas as pd
from tqdm import tqdm

pd.options.display.max_colwidth = 255
tqdm.pandas()

In [3]:
df = pd.read_pickle(f"{data_dir}/preprocessed.pkl")

In [4]:
df.head()

Unnamed: 0,id,title_tokenized,body_tokenized,tags
0,80,"[multiple, queries, one, statement]","[written, database, generation, script, SQL, and, want, execute, Adobe, AIR, application, Create, Table, tRole, roleID, integer, Primary, Key, roleName, varchar, Create, Table, tFile, fileID, integer, Primary, Key, fileName, varchar, fileDescription, ...","[flex, actionscript-3, air]"
1,90,"[Good, branching, and, merging, tutorials, for, TortoiseSVN]","[Are, there, any, really, good, tutorials, explaining, branching, and, merging, with, Apache, Subversion, All, the, better, specific, TortoiseSVN, client]","[svn, tortoisesvn, branch, branching-and-merging]"
2,120,"[Site, Maps]","[Has, anyone, got, experience, creating, providers, got, the, default, XML, file, working, properly, with, Menu, and, SiteMapPath, controls, but, need, way, for, the, users, site, create, and, modify, pages, dynamically, need, tie, page, viewing, perm...","[sql, asp.net, sitemap]"
3,180,"[Function, for, creating, color, wheels]","[This, something, many, times, and, never, quite, found, solution, That, stuck, with, The, problem, come, with, way, generate, colors, that, are, distinguishable, possible, where, parameter]","[algorithm, language-agnostic, colors, color-space]"
4,260,"[Adding, scripting, functionality, applications]","[have, little, game, written, C, #, uses, database, trading, card, game, and, wanted, implement, the, function, the, cards, script, What, mean, that, essentially, have, interface, ICard, which, card, class, implements, public, class, ICard, and, which...","[c#, .net, scripting, compiler-construction]"


### Number of tags (i.e. classes)

In [5]:
from collections import Counter

tag_count = Counter()

def count_tag(tags):
    for tag in tags:
        tag_count[tag] += 1

df["tags"].apply(count_tag)

len(tag_count.values())

38146

As there are over 38,000 tags in the dataset, which is too much for a multi-label classification, I can only keep data with the top 20 tags (which will cover ~30% of the questions)

In [6]:
most_common_tags = [count[0] for count in tag_count.most_common(20)]
df["tags"] = df["tags"].progress_apply(lambda tags: [tag for tag in tags if tag in most_common_tags])

100%|████████████████████████████████████████████████████████████████████| 1264216/1264216 [00:04<00:00, 276116.86it/s]


In [7]:
df[df["tags"].map(lambda tags: len(tags) > 0)].shape

(850988, 4)

In [8]:
print(f"Only {1264216 - 850988:,} rows of data will be dropped while number of classes is reduced from {len(tag_count.values()):,} to 20!")

Only 413,228 rows of data will be dropped while number of classes is reduced from 38,146 to 20!


In [9]:
df = df[df["tags"].map(lambda tags: len(tags) > 0)]

### Feature vector generation using TfidfVectorizer

#### Lowecasing before feature generation

In [11]:
def lowercase(words):
    words_filtered = []
    for word in words:
        words_filtered.append(word.lower())
    return words_filtered

In [12]:
df["body_tokenized"] = df["body_tokenized"].progress_apply(lowercase)

100%|████████████████████████████████████████████████████████████████████████| 850988/850988 [01:37<00:00, 8766.90it/s]


In [13]:
df["title_tokenized"] = df["title_tokenized"].progress_apply(lowercase)

100%|████████████████████████████████████████████████████████████████████████| 850988/850988 [07:01<00:00, 2018.93it/s]


In [14]:
df.head()

Unnamed: 0,id,title_tokenized,body_tokenized,tags
2,120,"[site, maps]","[has, anyone, got, experience, creating, providers, got, the, default, xml, file, working, properly, with, menu, and, sitemappath, controls, but, need, way, for, the, users, site, create, and, modify, pages, dynamically, need, tie, page, viewing, perm...","[sql, asp.net]"
4,260,"[adding, scripting, functionality, applications]","[have, little, game, written, c, #, uses, database, trading, card, game, and, wanted, implement, the, function, the, cards, script, what, mean, that, essentially, have, interface, icard, which, card, class, implements, public, class, icard, and, which...","[c#, .net]"
5,330,"[should, use, nested, classes, this, case]","[working, collection, classes, used, for, video, playback, and, recording, have, one, main, class, which, acts, like, the, public, interface, with, methods, like, play, stop, pause, record, etc, then, have, workhorse, classes, which, the, video, decod...",[c++]
6,470,"[homegrown, consumption, web, services]","[been, writing, few, web, services, for, .net, app, now, ready, consume, them, seen, numerous, examples, where, there, homegrown, code, for, consuming, the, service, opposed, using, the, auto, generated, methods, visual, studio, creates, when, adding,...",[.net]
8,650,"[automatically, update, version, number]","[would, like, the, version, property, application, incremented, for, each, build, but, not, sure, how, enable, this, functionality, visual, studio, have, tried, specify, the, assemblyversion, but, does, get, exactly, what, want, also, using, settings,...",[c#]


#### Feature generation

##### By setting stop_words='english', the stop words will be removed; but ['C', 'c', '#'] will be removed also (if we could retokenize 'C#' it would be the perfect scenario)

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

# we have already tokenize the text so we need a dummy one to bypass tokenization
def dummy_tokenizer(string): return string

# we will only get the 10,000 most common words for title to limit size of dataset
title_vectorizer = TfidfVectorizer(tokenizer=dummy_tokenizer, 
                                   lowercase=False,
                                   stop_words='english',
                                   max_features=10000)
x_title = title_vectorizer.fit_transform(df["title_tokenized"])



In [16]:
# we will get the 100,000 most common words for body
body_vectorizer = TfidfVectorizer(tokenizer=dummy_tokenizer, 
                                  lowercase=False,
                                  stop_words='english',
                                  max_features=100000)
x_body = body_vectorizer.fit_transform(df["body_tokenized"])

In [17]:
df.iloc[[1]]

Unnamed: 0,id,title_tokenized,body_tokenized,tags
4,260,"[adding, scripting, functionality, applications]","[have, little, game, written, c, #, uses, database, trading, card, game, and, wanted, implement, the, function, the, cards, script, what, mean, that, essentially, have, interface, icard, which, card, class, implements, public, class, icard, and, which...","[c#, .net]"


In [18]:
pd.DataFrame(x_title[:11].toarray(), columns=title_vectorizer.get_feature_names_out()) \
  .iloc[1].sort_values(ascending=False).where(lambda v: v > 0).dropna().head(10)

scripting        0.602903
functionality    0.505980
applications     0.493765
adding           0.369714
Name: 1, dtype: float64

In [19]:
pd.DataFrame(x_body[:11].toarray(), columns=body_vectorizer.get_feature_names_out()) \
  .iloc[1].sort_values(ascending=False).where(lambda v: v > 0).dropna().head(10)

card                0.503108
cards               0.355600
game                0.251770
currentgamestate    0.200055
assembly            0.199405
essentially         0.197748
language            0.172273
database            0.172025
class               0.171883
trading             0.153585
Name: 1, dtype: float64

### Concantenate dataset and train val test split

In [20]:
# give a weight of 2 to title as it should contain more important words than body
x_title = x_title * 2

In [21]:
from scipy.sparse import hstack
from sklearn.model_selection import train_test_split

X = hstack([x_title, x_body])
y = df[["tags"]]

In [22]:
from sklearn.preprocessing import MultiLabelBinarizer

multi_label_binarizer = MultiLabelBinarizer()
y = multi_label_binarizer.fit_transform(y["tags"])

### Splitting into train, val, test set as best practice

In [23]:
train_ratio = 0.8
val_ratio = 0.1
test_ratio = 0.1

In [24]:
df[["tags"]][110:120]

Unnamed: 0,tags
223,"[html, css]"
225,[.net]
226,[c]
227,[sql]
228,[asp.net]
229,[java]
231,[ruby-on-rails]
232,"[c++, c]"
233,[c#]
236,"[c#, c++]"


Thus, I cannot set stratify = y in train_test_split (an error occured becauses of instances with index 175, 183). Maybe this should be avoided if I keep only the first tag in questions with multiple tags and split into train, val, test accordingly to secure class balance between the 3 sets

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1 - train_ratio, random_state=0)

### Check class distributions in train, test (test + val) sets

In [32]:
import numpy as np 

# Step 1: Sum up the one-hot encoded vectors to get the count for each class
train_class_counts = np.sum(y_train, axis=0)
test_class_counts = np.sum(y_test, axis=0)

# Step 2: Calculate the percentage distribution for each class
total_train_instances = y_train.shape[0]
total_test_instances = y_test.shape[0]

train_class_distribution = train_class_counts / total_train_instances * 100
test_class_distribution = test_class_counts / total_test_instances * 100

# Print the distributions
print("Train Set Class Distribution (%):")
for class_idx, percentage in enumerate(train_class_distribution):
    print(f"Class {class_idx}: {percentage:.2f}%")

print("\nTest Set Class Distribution (%):")
for class_idx, percentage in enumerate(test_class_distribution):
    print(f"Class {class_idx}: {percentage:.2f}%")

Train Set Class Distribution (%):
Class 0: 2.83%
Class 1: 10.67%
Class 2: 2.39%
Class 3: 3.52%
Class 4: 2.73%
Class 5: 11.89%
Class 6: 5.59%
Class 7: 4.97%
Class 8: 6.94%
Class 9: 5.52%
Class 10: 2.53%
Class 11: 13.52%
Class 12: 14.60%
Class 13: 9.21%
Class 14: 4.99%
Class 15: 3.17%
Class 16: 11.61%
Class 17: 7.59%
Class 18: 3.03%
Class 19: 4.18%

Test Set Class Distribution (%):
Class 0: 2.83%
Class 1: 10.60%
Class 2: 2.41%
Class 3: 3.53%
Class 4: 2.72%
Class 5: 11.90%
Class 6: 5.59%
Class 7: 4.98%
Class 8: 6.90%
Class 9: 5.56%
Class 10: 2.53%
Class 11: 13.62%
Class 12: 14.54%
Class 13: 9.30%
Class 14: 5.00%
Class 15: 3.13%
Class 16: 11.62%
Class 17: 7.58%
Class 18: 3.02%
Class 19: 4.30%


In [33]:
X_val, X_test, y_val, y_test = train_test_split(X, y, test_size=test_ratio/(test_ratio + val_ratio), random_state=0)

### Check class distributions in test, val  sets

In [36]:
import numpy as np 

# Step 1: Sum up the one-hot encoded vectors to get the count for each class
val_class_counts = np.sum(y_val, axis=0)
test_class_counts = np.sum(y_test, axis=0)

# Step 2: Calculate the percentage distribution for each class
total_val_instances = y_val.shape[0]
total_test_instances = y_test.shape[0]

val_class_distribution = val_class_counts / total_val_instances * 100
test_class_distribution = test_class_counts / total_test_instances * 100

# Print the distributions
print("Train Set Class Distribution (%):")
for class_idx, percentage in enumerate(val_class_distribution):
    print(f"Class {class_idx}: {percentage:.2f}%")

print("\nTest Set Class Distribution (%):")
for class_idx, percentage in enumerate(test_class_distribution):
    print(f"Class {class_idx}: {percentage:.2f}%")

Train Set Class Distribution (%):
Class 0: 2.82%
Class 1: 10.69%
Class 2: 2.39%
Class 3: 3.49%
Class 4: 2.72%
Class 5: 11.87%
Class 6: 5.56%
Class 7: 4.95%
Class 8: 6.92%
Class 9: 5.51%
Class 10: 2.54%
Class 11: 13.49%
Class 12: 14.57%
Class 13: 9.23%
Class 14: 4.98%
Class 15: 3.18%
Class 16: 11.62%
Class 17: 7.66%
Class 18: 3.04%
Class 19: 4.18%

Test Set Class Distribution (%):
Class 0: 2.84%
Class 1: 10.61%
Class 2: 2.39%
Class 3: 3.56%
Class 4: 2.74%
Class 5: 11.91%
Class 6: 5.62%
Class 7: 4.99%
Class 8: 6.94%
Class 9: 5.54%
Class 10: 2.53%
Class 11: 13.59%
Class 12: 14.60%
Class 13: 9.23%
Class 14: 5.00%
Class 15: 3.15%
Class 16: 11.60%
Class 17: 7.52%
Class 18: 3.02%
Class 19: 4.23%


In [37]:
import joblib

joblib.dump(X_train, f"{data_dir}/x_train.pkl")
joblib.dump(X_test, f"{data_dir}/x_test.pkl")
joblib.dump(X_val, f"{data_dir}/x_val.pkl")
joblib.dump(y_train, f"{data_dir}/y_train.pkl")
joblib.dump(y_test, f"{data_dir}/y_test.pkl")
joblib.dump(y_val, f"{data_dir}/y_val.pkl")
joblib.dump(multi_label_binarizer.classes_, f"{data_dir}/y_classes.pkl")

['C:\\Users\\sotir\\Documents\\git\\satori-case-study\\data/y_classes.pkl']