In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 1. Downloading libraries

In [1]:
!pip install wordcloud



# 2. Importing libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
from wordcloud import WordCloud
import unicodedata
from nltk import WordNetLemmatizer
nltk.download('wordnet')
from nltk import PorterStemmer
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
import os
os.getcwd()

'/content'

# 3. Loading and selecting data

In [3]:
df = pd.read_csv('/content/drive/My Drive/stack_exchange_dataset_no_nulls_no_htmltags_no_newlines.csv',index_col=0)
df.head()

Unnamed: 0,Title,Body
0,How to check if an uploaded file is an image w...,I'd like to check if an uploaded file is an im...
1,How can I prevent firefox from closing when I ...,"In my favorite editor (vim), I regularly use c..."
2,R Error Invalid type (list) for variable,I am import matlab file and construct a data f...
3,How do I replace special characters in a URL?,"This is probably very simple, but I simply can..."
4,How to modify whois contact details?,function modify(.......){ $mcontact = file_ge...


It turns out that tokenizing over 450,000 instances for both **Title** and **Body** takes up too much computational resources. Further processing can cause the kernel to crash. 

Therefore, we are going to select only the first 100,000 rows for our analysis and maybe use some of the rest for testing our topic model later on. 

In [16]:
df_text = df.copy()
df_text.shape

(420545, 2)

In [17]:
df_text.head()

Unnamed: 0,Title,Body
0,How to check if an uploaded file is an image w...,I'd like to check if an uploaded file is an im...
1,How can I prevent firefox from closing when I ...,"In my favorite editor (vim), I regularly use c..."
2,R Error Invalid type (list) for variable,I am import matlab file and construct a data f...
3,How do I replace special characters in a URL?,"This is probably very simple, but I simply can..."
4,How to modify whois contact details?,function modify(.......){ $mcontact = file_ge...


## 3. Word tokenization

Tokenization splits a sentence or text document into tokens which could be words, special characters, punctuations etc. Hence, it's slightly more effective than simply using the **split** function. 

In [18]:
#Applying tokenization to the text
df_text['Title'] = df_text['Title'].apply(word_tokenize)
df_text.head()

Unnamed: 0,Title,Body
0,"[How, to, check, if, an, uploaded, file, is, a...",I'd like to check if an uploaded file is an im...
1,"[How, can, I, prevent, firefox, from, closing,...","In my favorite editor (vim), I regularly use c..."
2,"[R, Error, Invalid, type, (, list, ), for, var...",I am import matlab file and construct a data f...
3,"[How, do, I, replace, special, characters, in,...","This is probably very simple, but I simply can..."
4,"[How, to, modify, whois, contact, details, ?]",function modify(.......){ $mcontact = file_ge...


Looks like the tokenization worked fine as verified by the shape and first five instances of the resulting dataset. 

## 4. Changing all words to lowercase

In [0]:
def list_to_lower(list_of_words):
    to_lower = [x.lower() for x in list_of_words]
    return to_lower

In [20]:
df_text['Title'] = df_text['Title'].apply(list_to_lower)
df_text['Title'].head()

0    [how, to, check, if, an, uploaded, file, is, a...
1    [how, can, i, prevent, firefox, from, closing,...
2    [r, error, invalid, type, (, list, ), for, var...
3    [how, do, i, replace, special, characters, in,...
4        [how, to, modify, whois, contact, details, ?]
Name: Title, dtype: object

The first 5 instances confirm that the words have been converted to lowercase. 

## 5. Handling stopwords, special characters and symbols

First we extract a list of all stopwords in English using the *stopwords* function from the *nltk* library.

In [21]:
#Retrieving all stopwords in English
stop_words = stopwords.words('english')
print ("First 10 stopwords")
print (stop_words[:10])

First 10 stopwords
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


To check for the above stopwords in our dataframe, we first write a function which takes in a list of tokenized words and returns the same list without the stopwords as shown below. 

In [0]:
def remove_stop_words(list_of_tokenized_words):
    no_stops = [i for i in list_of_tokenized_words if i not in stop_words]
    return no_stops

In [23]:
df_text['Title'] = df_text['Title'].apply(remove_stop_words)
print ("Printing the first 10 instances of the Title column without stopwords")
print ()
print (df_text['Title'].head(10))

Printing the first 10 instances of the Title column without stopwords

0    [check, uploaded, file, image, without, mime, ...
1           [prevent, firefox, closing, press, ctrl-w]
2      [r, error, invalid, type, (, list, ), variable]
3               [replace, special, characters, url, ?]
4                 [modify, whois, contact, details, ?]
5     [setting, proxy, active, directory, environment]
6                       [draw, barplot, way, coreplot]
7                   [fetch, xml, feed, using, asp.net]
8           [.net, library, generating, javascript, ?]
9    [sql, server, :, procedure, call, ,, inline, c...
Name: Title, dtype: object


Looks like the removal of stopwords was successful to a great extent. However, there are still words such as *I, The and 'd* which also behave as potential stopwords. We can create a list of custom stopwords to remove. 

Let's get word frequencies to figure out what other custom stopwords we need to remove. To do that, we first create a list of all words in the dataset. 

In [0]:
def combine_list_of_words(word_list):
    combined_list = []
    for i in word_list:
        combined_list.extend(i)
    return combined_list

In [25]:
title_words = combine_list_of_words(df_text['Title'])
title_words[:10]

['check',
 'uploaded',
 'file',
 'image',
 'without',
 'mime',
 'type',
 '?',
 'prevent',
 'firefox']

In [26]:
df_title_words = pd.DataFrame()
df_title_words['word'] = nltk.FreqDist(title_words).keys()
df_title_words['freq'] = nltk.FreqDist(title_words).values()

df_title_words = df_title_words.sort_values("freq",ascending=False)
df_title_words.head(50)

Unnamed: 0,word,freq
7,?,121876
16,(,31584
18,),31522
48,:,30038
51,",",28943
40,using,26068
144,-,21580
2,file,16174
251,$,13210
82,'',12706


### Creating a custom list of words, characters and symbols to remove

Looking at the newly ordered highest occurring words above, there are still characters like *(, & and <* that are occur quite often but do not add much value to the analysis. 

Next we can manually put such characters in a list and append to the original stopwords list, and then remove them from our dataset. We can also add single digits to this removal list.

In [27]:
#Manually adding specific characters and symbols that could 
add_stop_words = ["I'd",'The','I','(',')','{','}','<','>','[',']','*','?','!','&','...','``','=',':',';','/','@',
                  "'","''",',','"','-','--','$','.','+','%','//','..',"n't","'m","'s",'It','what','how',
                  'why','when','where','lt','gt','na',"'ve'",'href=','rel=','using','file','get','use','user']
digits = range(10)

add_stop_words.extend(digits)
add_stop_words.extend(map(str,digits))
add_stop_words[:10]

["I'd", 'The', 'I', '(', ')', '{', '}', '<', '>', '[']

In [28]:
#Appending custom stopwords to the NLTK stopwords list
stop_words.extend(add_stop_words)
stop_words[-20:]

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9']

Now that we have all the stopwords we want, we can remove them from the dataset again. 

In [30]:
df_text['Title'] = df_text['Title'].apply(remove_stop_words)
print ("Printing the first 10 instances of the Title column without stopwords")
print ()
print (df_text['Title'].head())

Printing the first 10 instances of the Title column without stopwords

0    [check, uploaded, image, without, mime, type]
1       [prevent, firefox, closing, press, ctrl-w]
2        [r, error, invalid, type, list, variable]
3              [replace, special, characters, url]
4                [modify, whois, contact, details]
Name: Title, dtype: object


## 6. Lemmatization using *verb* Part of Speech (POS)

Here, we are going to apply lemmatization in order to reduce different variations of a word to a single word. For example, *runs, ran and running* are all going to be changed to the root word *run*. However, we need to provide a context for the change. In this example, the context or POS is set to a **verb**. Hence, all the variations are changed to the verb form. By default, lemmatizer uses the noun form. 

In our case, verb would be the most appropriate POS to use, especially for the body column, since a lot of the questions are about the user wanting to do something or something that they have already tried. This will allow us to group all the variations of a word together for topic modeling. 

We have not opted for the *PorterStemmer* method as that tends to reduce words to root words that may not be in the English language. 

In [31]:
#Example
lem = nltk.WordNetLemmatizer()
lem.lemmatize('simply',pos='v')

'simply'

In [0]:
def list_lemmatize(list_of_tokens):
    lemmatized = [lem.lemmatize(x,pos='v') for x in list_of_tokens]
    return lemmatized

In [33]:
#Applying lemmatization to our dataset
df_text_title = df_text['Title'].apply(list_lemmatize)

print ("Printing 45th instance of the Title column")
print (df_text['Title'][44])
print ()
print ("Printing 45th instance of the Title column after lemmatization")
print (df_text_title[44])

Printing 45th instance of the Title column
['technology', '.net-winforms', 'c++-midi', 'mfc', 'based', 'best', 'reading', 'writing', 'com', 'port', 'data']

Printing 45th instance of the Title column after lemmatization
['technology', '.net-winforms', 'c++-midi', 'mfc', 'base', 'best', 'read', 'write', 'com', 'port', 'data']


Above, we have verified the effect of lemmatization on the 45th instance of the title. Words like **based, reading and writing** have been lemmatized to **base, read and write**. Looks like it worked as we anticipated. 

In [34]:
df_text['Title'] = df_text_title
df_text['Title'].head()

0    [check, upload, image, without, mime, type]
1       [prevent, firefox, close, press, ctrl-w]
2      [r, error, invalid, type, list, variable]
3             [replace, special, character, url]
4               [modify, whois, contact, detail]
Name: Title, dtype: object