Pre-processing Text: For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame. Then,

A. Convert all text to lowercase letters.

B. Remove all punctuation from the text.

C. Remove stop words.

D. Apply NLTK’s PorterStemmer.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# loading the json lines file into a dataframe
df_comments = pd.read_json('controversial-comments.jsonl', lines = True)

In [3]:
# Check the dataframe to see how the data gets loaded.
df_comments.head()

Unnamed: 0,con,txt
0,0,Well it's great that he did something about th...
1,0,You are right Mr. President.
2,0,You have given no input apart from saying I am...
3,0,I get the frustration but the reason they want...
4,0,I am far from an expert on TPP and I would ten...


In [4]:
# Viewing the info on the dataframe
df_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950000 entries, 0 to 949999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   con     950000 non-null  int64 
 1   txt     950000 non-null  object
dtypes: int64(1), object(1)
memory usage: 14.5+ MB


In [5]:
# size of the sample dataframe
df_sample = df_comments.sample(n=50000)
df_sample.shape

(50000, 2)

In [6]:
# Using string from the conversion and remove all punctuations 
import unicodedata
import sys

# Create a dictionary of punctuation characters
punctuation = dict.fromkeys(i for i in range(sys.maxunicode)
                           if unicodedata.category(chr(i)).startswith('P'))

In [7]:
# preparing the array of stop words from the nltk package
from nltk.corpus import stopwords

# Load the stop words
stop_words = stopwords.words('english')

In [63]:
# view some stop words from the array
stop_words[0:5]

['i', 'me', 'my', 'myself', 'we']

In [9]:
# using NLTK’s PorterStemmer

from nltk.stem.porter import PorterStemmer
# Create stemmer
porter = PorterStemmer()


In [10]:
# This shows how stemmer breaks down words to its stem values.
# porter.stem('jumps')

In [65]:
# Created a function in order to perform preprocessing on the text value passed as a series
# All of the 4 tasks that was given in the Exercise 1 are done in here
# I also added a few more preprocessing tasks after the D
def preprocess_txt(input_txt):
    preprocessed_text = input_txt
    # A. Convert all text to lowercase letters.
    preprocessed_text = " ".join(word.lower() for word in preprocessed_text.split())
    # B. Remove all punctuation from the text.
    preprocessed_text = " ".join(word.translate(punctuation) for word in preprocessed_text.split())
    # C. Remove stop words.
    preprocessed_text = " ".join(word for word in preprocessed_text.split() if word not in stop_words)
    # D. Apply NLTK’s PorterStemmer.
    preprocessed_text = " ".join(porter.stem(word) for word in preprocessed_text.split())
    return(preprocessed_text)


In [66]:
df_sample['processed_txt'] = df_sample['txt'].apply(lambda x: preprocess_txt(x))

In [109]:
# The following post processing is done on the processed text series
# This is to further remove some texts with no space that do not make any contribution towards analysis

df_sample['processed_txt'] = df_sample['processed_txt'].str.replace(r'\[removed\]',"")
df_sample['processed_txt'] = df_sample['processed_txt'].str.replace(r'\[deleted\]',"")
df_sample['processed_txt'] = df_sample['processed_txt'].str.replace(r'&.*;',"")
df_sample['processed_txt'] = df_sample['processed_txt'].str.replace(r'\[',"")

In [110]:
df_sample['len_b4'] = df_sample['txt'].apply(lambda x: len(x))

In [111]:
df_sample['len_aftr'] = df_sample['processed_txt'].apply(lambda x: len(x))

In [112]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 400139 to 146752
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   con            50000 non-null  int64 
 1   txt            50000 non-null  object
 2   processed_txt  50000 non-null  object
 3   len_b4         50000 non-null  int64 
 4   len_aftr       50000 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 3.5+ MB


In [113]:
df_sample.head()

Unnamed: 0,con,txt,processed_txt,len_b4,len_aftr
400139,0,Lets use military action to stop these people ...,Lets use military action to stop these people ...,144,144
550836,0,I can't fathom why these women would make a sh...,I can't fathom why these women would make a sh...,361,361
284328,0,Plus we heard the SJWs and the alt left folks ...,Plus we heard the SJWs and the alt left folks ...,242,242
777068,0,Can some ELI5: how this can even be a discussi...,Can some ELI5: how this can even be a discussi...,72,72
929858,0,[removed],,9,0


In [114]:
df_sample.shape

(50000, 5)

In [115]:
# Filter out the rows from the data frame having zero length after the preprocessing
fltr = df_sample['len_aftr']==0
df_sample_fltr = df_sample.drop(index = df_sample[fltr].index) 

In [116]:
df_sample_fltr.shape

(45993, 5)

Apply each of the following steps (individually) to the pre-processed data.

A. Convert each text entry into a word-count vector (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook).

B. Convert each text entry into a part-of-speech tag vector (see section 6.7 in the Machine Learning with Python Cookbook).

C. Convert each entry into a term frequency-inverse document frequency (tfidf) vector (see section 6.9 in the Machine Learning with Python Cookbook).

In [117]:
# This below implementation of the .values returns the values from the particular column as a numpy array
# This could be now passed on as an array to generate bag of words
# df_sample['processed_txt'].values

In [139]:
# A. Convert each text entry into a word-count vector.
from sklearn.feature_extraction.text import CountVectorizer

# Create a bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(df_sample_fltr['processed_txt'].values)

# bag_of_words.toarray()

words = count.get_feature_names()
feature_matrix = pd.DataFrame(bag_of_words.toarray(),columns=words)
feature_matrix

Unnamed: 0,00,000,000000000001,0000001,000000606,00000156,00000158,0001,000217,00040,...,сороса,теперь,трампа,украине,феликс,эдмундович,этом,яepublican,яepublicans,ಠ_ಠ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45988,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45989,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45990,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45991,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [119]:
# convert dense to sparse
# from scipy.sparse import csr_matrix
# calculate sparcity
# np.count_nonzero(feature_matrix)
# feature_matrix.size
# sparsity = 1 -  np.count_nonzero(feature_matrix)/feature_matrix.size
# sparsity
# convert to sparse matrix (CSR method)
# feature_matrix_csr = csr_matrix(feature_matrix)
# feature_matrix_csr.shape
# reconstruct dense matrix
# feature_matrix_dense = feature_matrix_csr.todense()
# feature_matrix_dense.shape

0.9992691001705665

In [120]:
# str(df_sample['processed_txt'].values)

In [121]:
# B. Convert each text entry into a part-of-speech tag vector
# Using NLTK's pre trained parts of speech tagger

from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Use the pre trained parts of speech tagger

text_tagged = pos_tag(word_tokenize(str(df_sample_fltr['processed_txt'].values)))

In [122]:
text_tagged[0]

('[', 'NN')

In [131]:
# C. Convert each entry into a term frequency-inverse document frequency (tfidf) vector 
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df_sample_fltr['processed_txt'].values)

# show the matrix
# tfidf_matrix.toarray()

words = tfidf.get_feature_names()
tfidf_matrix = pd.DataFrame(tfidf_matrix.toarray(),columns=words)
tfidf_matrix

Unnamed: 0,00,000,000000000001,0000001,000000606,00000156,00000158,0001,000217,00040,...,сороса,теперь,трампа,украине,феликс,эдмундович,этом,яepublican,яepublicans,ಠ_ಠ
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45988,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45989,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45990,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45991,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [136]:
# convert dense to sparse
# calculate sparcity
# np.count_nonzero(feature_matrix)
# feature_matrix.size
sparsity = 1 -  np.count_nonzero(tfidf_matrix)/tfidf_matrix.size
sparsity
# convert to sparse matrix (CSR method)
tfidf_matrix_csr = csr_matrix(tfidf_matrix)
# tfidf_matrix_csr.shape
# reconstruct dense matrix
tfidf_matrix_dense = tfidf_matrix_csr.todense()
# tfidf_matrix_dense.shape

In [137]:
#tfidf.vocabulary_