In the notebook, I will focus on data cleansing, such as removing HTML tags from body text, and some other text processing, so that the dataset will be ready for feature extraction.

In [1]:
import os

project_dir = os.getcwd()
data_dir = os.path.join(project_dir, "data")

In [2]:
import nltk
import pandas as pd
from bs4 import BeautifulSoup
from tqdm import tqdm
import re

tqdm.pandas()
pd.options.display.max_colwidth = 255

In [3]:
df = pd.read_pickle(f"{data_dir}/eda.pkl")

In [4]:
df.head()

Unnamed: 0,id,title,body,tag,tag_count
0,80,SQLStatement.execute() - multiple queries in one statement,"<p>I've written a database generation script in <a href=""http://en.wikipedia.org/wiki/SQL"">SQL</a> and want to execute it in my <a href=""http://en.wikipedia.org/wiki/Adobe_Integrated_Runtime"">Adobe AIR</a> application:</p>\n\n<pre><code>Create Table t...","[flex, actionscript-3, air]",3
1,90,Good branching and merging tutorials for TortoiseSVN?,"<p>Are there any really good tutorials explaining <a href=""http://svnbook.red-bean.com/en/1.8/svn.branchmerge.html"" rel=""nofollow"">branching and merging</a> with Apache Subversion? </p>\n\n<p>All the better if it's specific to TortoiseSVN client.</p>\n","[svn, tortoisesvn, branch, branching-and-merging]",4
2,120,ASP.NET Site Maps,"<p>Has anyone got experience creating <strong>SQL-based ASP.NET</strong> site-map providers?</p>\n\n<p>I've got the default XML file <code>web.sitemap</code> working properly with my Menu and <strong>SiteMapPath</strong> controls, but I'll need a way ...","[sql, asp.net, sitemap]",3
3,180,Function for creating color wheels,"<p>This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate <code>N</code> colors, that are as distinguishable as possible where <code>N</code> is a parame...","[algorithm, language-agnostic, colors, color-space]",4
4,260,Adding scripting functionality to .NET applications,"<p>I have a little game written in C#. It uses a database as back-end. It's \na <a href=""http://en.wikipedia.org/wiki/Collectible_card_game"">trading card game</a>, and I wanted to implement the function of the cards as a script.</p>\n\n<p>What I mean ...","[c#, .net, scripting, compiler-construction]",4


### Understand text (body, title) length

In [5]:
min_title_length = df["title"].str.len().min()
max_title_length = df["title"].str.len().max()
min_body_length = df["body"].str.len().min()
max_body_length = df["body"].str.len().max()

In [6]:
print(f"min_title_length: {min_title_length}")
print(f"max_title_length: {max_title_length}")
print(f"min_body_length: {min_body_length}")
print(f"max_body_length: {max_body_length}")

min_title_length: 9
max_title_length: 189
min_body_length: 18
max_body_length: 46489


In [7]:
df[df["title"].str.len() == min_title_length]

Unnamed: 0,id,title,body,tag,tag_count
9695,622900,C# hashes,"<p>I'm new to C#</p>\n\n<blockquote>\n <ol>\n <li>How do i hash files with C#</li>\n <li>What is available ? (md5, crc, sha1, etc)</li>\n <li>Is there an interface i should inherit?</li>\n </ol>\n</blockquote>\n\n<p>Basically i want to checksum m...","[c#, hash]",2


In [8]:
df[df["title"].str.len() == max_title_length]

Unnamed: 0,id,title,body,tag,tag_count
694578,23691000,How to convert Office 365 ï¿½Éï¿½ï¿½ï¿½ï¿½é¼ï¿½Ìï¿½ï¿½[ï¿½Uï¿½[ï¿½Ìï¿½ï¿½[ï¿½ï¿½ï¿½{ï¿½bï¿½Nï¿½Xï¿½ÖÌAï¿½Nï¿½Zï¿½Xï¿½ÉÖï¿½ï¿½ï¿½Kï¿½Cï¿½hï¿½`ï¿½ï¿½ï¿½`ï¿½ï¿½ï¿½[ï¿½gï¿½ï¿½ï¿½Aï¿½ï¿½,"<p>I am reading source of the following url but the title is coming as bunch of ?? marks, how do I convert it to actual language that the web page is presenting.</p>\n\n<p><a href=""http://support.microsoft.com/common/survey.aspx?scid=sw;ja;3703&amp;sh...","[c#, character-encoding]",2


We can see that the title has an encoding error

In [9]:
df[df["body"].str.len() == min_body_length]

Unnamed: 0,id,title,body,tag,tag_count
14480,858790,How to setup TeamCity under IIS?,<p>Any ideas?</p>\n,"[version-control, teamcity]",2


In [10]:
df[df["body"].str.len() == max_body_length]

Unnamed: 0,id,title,body,tag,tag_count
1168127,37657980,Elasticsearch - How to provide custom synonyms when querying?,"<p>I'm developping a search engine for my client which has to use synonym expansion. I can properly setup my index with a synonym token filter and a custom file (synonym.txt). </p>\n\n<p>Example: ipod, i-pod, i pod</p>\n\n<p>However, whenever we want ...",[elasticsearch],1


The actual text of the body is not particularly long, but rather the original poster has included a long portion of code for reference. We need to consider if we want to retain this type of information, as it may deviate the model assumption by a lot.

### 1. Remove HTML tags from body text

In [11]:
df["body"] = df["body"].progress_apply(lambda text: BeautifulSoup(text, "lxml").text)

100%|██████████████████████████████████████████████████████████████████████| 1264216/1264216 [06:37<00:00, 3180.71it/s]


In [12]:
df.head()

Unnamed: 0,id,title,body,tag,tag_count
0,80,SQLStatement.execute() - multiple queries in one statement,"I've written a database generation script in SQL and want to execute it in my Adobe AIR application:\nCreate Table tRole (\n roleID integer Primary Key\n ,roleName varchar(40)\n);\nCreate Table tFile (\n fileID integer Primary Key\n ,f...","[flex, actionscript-3, air]",3
1,90,Good branching and merging tutorials for TortoiseSVN?,Are there any really good tutorials explaining branching and merging with Apache Subversion? \nAll the better if it's specific to TortoiseSVN client.\n,"[svn, tortoisesvn, branch, branching-and-merging]",4
2,120,ASP.NET Site Maps,"Has anyone got experience creating SQL-based ASP.NET site-map providers?\nI've got the default XML file web.sitemap working properly with my Menu and SiteMapPath controls, but I'll need a way for the users of my site to create and modify pages dynamic...","[sql, asp.net, sitemap]",3
3,180,Function for creating color wheels,"This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate N colors, that are as distinguishable as possible where N is a parameter.\n","[algorithm, language-agnostic, colors, color-space]",4
4,260,Adding scripting functionality to .NET applications,"I have a little game written in C#. It uses a database as back-end. It's \na trading card game, and I wanted to implement the function of the cards as a script.\nWhat I mean is that I essentially have an interface, ICard, which a card class implements...","[c#, .net, scripting, compiler-construction]",4


### 2. Check for NaN values

In [13]:
df.isnull().sum()

id           0
title        0
body         0
tag          0
tag_count    0
dtype: int64

### 3. Remove emojis

In [14]:
def deEmojify(body):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',body)

In [15]:
df['body'] = df['body'].progress_apply(deEmojify)

100%|█████████████████████████████████████████████████████████████████████| 1264216/1264216 [00:24<00:00, 51956.94it/s]


In [16]:
df['title'] = df['title'].progress_apply(deEmojify)

100%|████████████████████████████████████████████████████████████████████| 1264216/1264216 [00:03<00:00, 398626.62it/s]


In [17]:
df.head()

Unnamed: 0,id,title,body,tag,tag_count
0,80,SQLStatement.execute() - multiple queries in one statement,"I've written a database generation script in SQL and want to execute it in my Adobe AIR application:\nCreate Table tRole (\n roleID integer Primary Key\n ,roleName varchar(40)\n);\nCreate Table tFile (\n fileID integer Primary Key\n ,f...","[flex, actionscript-3, air]",3
1,90,Good branching and merging tutorials for TortoiseSVN?,Are there any really good tutorials explaining branching and merging with Apache Subversion? \nAll the better if it's specific to TortoiseSVN client.\n,"[svn, tortoisesvn, branch, branching-and-merging]",4
2,120,ASP.NET Site Maps,"Has anyone got experience creating SQL-based ASP.NET site-map providers?\nI've got the default XML file web.sitemap working properly with my Menu and SiteMapPath controls, but I'll need a way for the users of my site to create and modify pages dynamic...","[sql, asp.net, sitemap]",3
3,180,Function for creating color wheels,"This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate N colors, that are as distinguishable as possible where N is a parameter.\n","[algorithm, language-agnostic, colors, color-space]",4
4,260,Adding scripting functionality to .NET applications,"I have a little game written in C#. It uses a database as back-end. It's \na trading card game, and I wanted to implement the function of the cards as a script.\nWhat I mean is that I essentially have an interface, ICard, which a card class implements...","[c#, .net, scripting, compiler-construction]",4


###  4. Remove newline and punctuations; tokenize and handle symbols in topics

In [18]:
import nltk
nltk.download("punkt")

# we have to keep a list of topics with symbols or digits that people will actually type in because of how nltk handles word tokenization

topics_with_symbols = ["c#", "c++", "html5", "asp.net", "objective-c", ".net", "sql-server", "node.js", "asp.net-mvc", "vb.net"]


df["body_tokenized"] = df["body"].progress_apply(lambda text: [word for word in nltk.word_tokenize(text) \
                                                               if word.isalpha() or word in list("#") + topics_with_symbols])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sotir\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
100%|██████████████████████████████████████████████████████████████████████| 1264216/1264216 [20:28<00:00, 1029.09it/s]


In [19]:
df["title_tokenized"] = df["title"].progress_apply(lambda text: [word for word in nltk.word_tokenize(text) \
                                                               if word.isalpha() or word in list("#") + topics_with_symbols])

100%|██████████████████████████████████████████████████████████████████████| 1264216/1264216 [03:10<00:00, 6647.15it/s]


In [20]:
df.head()

Unnamed: 0,id,title,body,tag,tag_count,body_tokenized,title_tokenized
0,80,SQLStatement.execute() - multiple queries in one statement,"I've written a database generation script in SQL and want to execute it in my Adobe AIR application:\nCreate Table tRole (\n roleID integer Primary Key\n ,roleName varchar(40)\n);\nCreate Table tFile (\n fileID integer Primary Key\n ,f...","[flex, actionscript-3, air]",3,"[I, written, a, database, generation, script, in, SQL, and, want, to, execute, it, in, my, Adobe, AIR, application, Create, Table, tRole, roleID, integer, Primary, Key, roleName, varchar, Create, Table, tFile, fileID, integer, Primary, Key, fileName, ...","[multiple, queries, in, one, statement]"
1,90,Good branching and merging tutorials for TortoiseSVN?,Are there any really good tutorials explaining branching and merging with Apache Subversion? \nAll the better if it's specific to TortoiseSVN client.\n,"[svn, tortoisesvn, branch, branching-and-merging]",4,"[Are, there, any, really, good, tutorials, explaining, branching, and, merging, with, Apache, Subversion, All, the, better, if, it, specific, to, TortoiseSVN, client]","[Good, branching, and, merging, tutorials, for, TortoiseSVN]"
2,120,ASP.NET Site Maps,"Has anyone got experience creating SQL-based ASP.NET site-map providers?\nI've got the default XML file web.sitemap working properly with my Menu and SiteMapPath controls, but I'll need a way for the users of my site to create and modify pages dynamic...","[sql, asp.net, sitemap]",3,"[Has, anyone, got, experience, creating, providers, I, got, the, default, XML, file, working, properly, with, my, Menu, and, SiteMapPath, controls, but, I, need, a, way, for, the, users, of, my, site, to, create, and, modify, pages, dynamically, I, ne...","[Site, Maps]"
3,180,Function for creating color wheels,"This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate N colors, that are as distinguishable as possible where N is a parameter.\n","[algorithm, language-agnostic, colors, color-space]",4,"[This, is, something, I, many, times, and, never, quite, found, a, solution, That, stuck, with, me, The, problem, is, to, come, up, with, a, way, to, generate, N, colors, that, are, as, distinguishable, as, possible, where, N, is, a, parameter]","[Function, for, creating, color, wheels]"
4,260,Adding scripting functionality to .NET applications,"I have a little game written in C#. It uses a database as back-end. It's \na trading card game, and I wanted to implement the function of the cards as a script.\nWhat I mean is that I essentially have an interface, ICard, which a card class implements...","[c#, .net, scripting, compiler-construction]",4,"[I, have, a, little, game, written, in, C, #, It, uses, a, database, as, It, a, trading, card, game, and, I, wanted, to, implement, the, function, of, the, cards, as, a, script, What, I, mean, is, that, I, essentially, have, an, interface, ICard, whic...","[Adding, scripting, functionality, to, applications]"


### 5. Step to retokenize C# (based on nltk.word_tokenize() behavior)

### 6. Remove stopwords (figured out that I can do in feature engineering step also, as an argument to TfidfVectorizer() and decided to do it there)

In [21]:
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def remove_stopwords(words):
    words_filtered = []
    for word in words:
        if word not in stop_words:
            words_filtered.append(word)
    return words_filtered

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sotir\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Preliminary test

In [22]:
sample_text = "Oh man, this is pretty cool. We will do more such things."
text_tokens = nltk.tokenize.word_tokenize(sample_text)

tokens_without_sw = remove_stopwords(text_tokens)

print("Pre stopword removal: ", text_tokens)
print("Post stopword removal: ", tokens_without_sw) 

Pre stopword removal:  ['Oh', 'man', ',', 'this', 'is', 'pretty', 'cool', '.', 'We', 'will', 'do', 'more', 'such', 'things', '.']
Post stopword removal:  ['Oh', 'man', ',', 'pretty', 'cool', '.', 'We', 'things', '.']


In [None]:
df["body_tokenized"] = df["body_tokenized"].progress_apply(remove_stopwords)

In [None]:
df["title_tokenized"] = df["title_tokenized"].progress_apply(remove_stopwords)

In [23]:
df.head()

Unnamed: 0,id,title,body,tag,tag_count,body_tokenized,title_tokenized
0,80,SQLStatement.execute() - multiple queries in one statement,"I've written a database generation script in SQL and want to execute it in my Adobe AIR application:\nCreate Table tRole (\n roleID integer Primary Key\n ,roleName varchar(40)\n);\nCreate Table tFile (\n fileID integer Primary Key\n ,f...","[flex, actionscript-3, air]",3,"[I, written, a, database, generation, script, in, SQL, and, want, to, execute, it, in, my, Adobe, AIR, application, Create, Table, tRole, roleID, integer, Primary, Key, roleName, varchar, Create, Table, tFile, fileID, integer, Primary, Key, fileName, ...","[multiple, queries, in, one, statement]"
1,90,Good branching and merging tutorials for TortoiseSVN?,Are there any really good tutorials explaining branching and merging with Apache Subversion? \nAll the better if it's specific to TortoiseSVN client.\n,"[svn, tortoisesvn, branch, branching-and-merging]",4,"[Are, there, any, really, good, tutorials, explaining, branching, and, merging, with, Apache, Subversion, All, the, better, if, it, specific, to, TortoiseSVN, client]","[Good, branching, and, merging, tutorials, for, TortoiseSVN]"
2,120,ASP.NET Site Maps,"Has anyone got experience creating SQL-based ASP.NET site-map providers?\nI've got the default XML file web.sitemap working properly with my Menu and SiteMapPath controls, but I'll need a way for the users of my site to create and modify pages dynamic...","[sql, asp.net, sitemap]",3,"[Has, anyone, got, experience, creating, providers, I, got, the, default, XML, file, working, properly, with, my, Menu, and, SiteMapPath, controls, but, I, need, a, way, for, the, users, of, my, site, to, create, and, modify, pages, dynamically, I, ne...","[Site, Maps]"
3,180,Function for creating color wheels,"This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate N colors, that are as distinguishable as possible where N is a parameter.\n","[algorithm, language-agnostic, colors, color-space]",4,"[This, is, something, I, many, times, and, never, quite, found, a, solution, That, stuck, with, me, The, problem, is, to, come, up, with, a, way, to, generate, N, colors, that, are, as, distinguishable, as, possible, where, N, is, a, parameter]","[Function, for, creating, color, wheels]"
4,260,Adding scripting functionality to .NET applications,"I have a little game written in C#. It uses a database as back-end. It's \na trading card game, and I wanted to implement the function of the cards as a script.\nWhat I mean is that I essentially have an interface, ICard, which a card class implements...","[c#, .net, scripting, compiler-construction]",4,"[I, have, a, little, game, written, in, C, #, It, uses, a, database, as, It, a, trading, card, game, and, I, wanted, to, implement, the, function, of, the, cards, as, a, script, What, I, mean, is, that, I, essentially, have, an, interface, ICard, whic...","[Adding, scripting, functionality, to, applications]"


#### we can apply also stemming (via SnowballStemmer)

### 6. Single letter Removal (I had this step before i checked that I can remove the stopwords in the feature engineering step. In order to save some time (because in order to remove the stopwords from the whole dataset demanded ~10 hours) I've decided to remove words with length < 2 except '#, C, c')

In [24]:
def remove_single_letter(words):
    words_filtered = []
    for word in words:
        if len(word) > 2 or not word.isalpha() or word in ["C", "c"]:
            words_filtered.append(word)
    return words_filtered

In [25]:
df["body_tokenized"] = df["body_tokenized"].progress_apply(remove_single_letter)

100%|█████████████████████████████████████████████████████████████████████| 1264216/1264216 [00:42<00:00, 29978.38it/s]


In [26]:
df["title_tokenized"] = df["title_tokenized"].progress_apply(remove_single_letter)

100%|██████████████████████████████████████████████████████████████████████| 1264216/1264216 [02:29<00:00, 8446.04it/s]


In [27]:
df.head()

Unnamed: 0,id,title,body,tag,tag_count,body_tokenized,title_tokenized
0,80,SQLStatement.execute() - multiple queries in one statement,"I've written a database generation script in SQL and want to execute it in my Adobe AIR application:\nCreate Table tRole (\n roleID integer Primary Key\n ,roleName varchar(40)\n);\nCreate Table tFile (\n fileID integer Primary Key\n ,f...","[flex, actionscript-3, air]",3,"[written, database, generation, script, SQL, and, want, execute, Adobe, AIR, application, Create, Table, tRole, roleID, integer, Primary, Key, roleName, varchar, Create, Table, tFile, fileID, integer, Primary, Key, fileName, varchar, fileDescription, ...","[multiple, queries, one, statement]"
1,90,Good branching and merging tutorials for TortoiseSVN?,Are there any really good tutorials explaining branching and merging with Apache Subversion? \nAll the better if it's specific to TortoiseSVN client.\n,"[svn, tortoisesvn, branch, branching-and-merging]",4,"[Are, there, any, really, good, tutorials, explaining, branching, and, merging, with, Apache, Subversion, All, the, better, specific, TortoiseSVN, client]","[Good, branching, and, merging, tutorials, for, TortoiseSVN]"
2,120,ASP.NET Site Maps,"Has anyone got experience creating SQL-based ASP.NET site-map providers?\nI've got the default XML file web.sitemap working properly with my Menu and SiteMapPath controls, but I'll need a way for the users of my site to create and modify pages dynamic...","[sql, asp.net, sitemap]",3,"[Has, anyone, got, experience, creating, providers, got, the, default, XML, file, working, properly, with, Menu, and, SiteMapPath, controls, but, need, way, for, the, users, site, create, and, modify, pages, dynamically, need, tie, page, viewing, perm...","[Site, Maps]"
3,180,Function for creating color wheels,"This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate N colors, that are as distinguishable as possible where N is a parameter.\n","[algorithm, language-agnostic, colors, color-space]",4,"[This, something, many, times, and, never, quite, found, solution, That, stuck, with, The, problem, come, with, way, generate, colors, that, are, distinguishable, possible, where, parameter]","[Function, for, creating, color, wheels]"
4,260,Adding scripting functionality to .NET applications,"I have a little game written in C#. It uses a database as back-end. It's \na trading card game, and I wanted to implement the function of the cards as a script.\nWhat I mean is that I essentially have an interface, ICard, which a card class implements...","[c#, .net, scripting, compiler-construction]",4,"[have, little, game, written, C, #, uses, database, trading, card, game, and, wanted, implement, the, function, the, cards, script, What, mean, that, essentially, have, interface, ICard, which, card, class, implements, public, class, ICard, and, which...","[Adding, scripting, functionality, applications]"


### 7. Save preprocessed dataset

In [28]:
# checkpoint
df.rename(columns={"tag": "tags"}, inplace=True)
df[["id", "title_tokenized", "body_tokenized", "tags"]].to_pickle(f"{data_dir}/preprocessed.pkl")