# Ramsey King
# DSC 550 - Data Mining
# September 9, 2021
# Week 2: Handling Categorical Data, Text, Dates & Times
# Exercise 2.2 


You can find the dataset controversial-comments.jsonl for this exercise in the Weekly Resources: Week 2 Data Files.
Pre-processing Text: For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame.

In [9]:
import pandas as pd
import numpy as np

comments_df = pd.read_json('controversial-comments.jsonl', lines=True)
comments_df.head()
# I am going to limit the number of entries from 950,000 to 50,000 as per the instructions in the assignment to help with the runtime.

comments_df = comments_df.sample(50000).copy()


A. Convert all text to lowercase letters.

In [10]:
comments_df['txt'] = comments_df['txt'].apply(str.lower)
comments_df['txt'][:5]

185126    my well being depends on my income which depen...
924751                            ah a racist. \n\nfuck off
416556    yes, are you sad about that?  i'm not personal...
719281                                /r/thingsjonsnowknows
293906    it's karma for a lot more than that, let's be ...
Name: txt, dtype: object

B. Remove all punctuation from the text.


In [11]:
import re
comments_df['txt'] = comments_df['txt'].apply(lambda remove_punct: re.sub(r'[^\w\s]', '', remove_punct))
comments_df['txt'][:5]

185126    my well being depends on my income which depen...
924751                             ah a racist \n\nfuck off
416556    yes are you sad about that  im not personally ...
719281                                  rthingsjonsnowknows
293906    its karma for a lot more than that lets be honest
Name: txt, dtype: object

C. Remove stop words.


In [12]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# nltk.download('stopwords')
stop_words = stopwords.words('english')

def rem_stop_words(rem_list):
    return [word for word in rem_list if word not in stop_words]

comments_df['txt_tokenized'] = comments_df['txt'].apply(word_tokenize)
comments_df['txt_tokenized'] = comments_df['txt_tokenized'].apply(rem_stop_words)
comments_df['txt_tokenized'][:5]

185126    [well, depends, income, depends, amount, work,...
924751                                   [ah, racist, fuck]
416556         [yes, sad, im, personally, respect, opinion]
719281                                [rthingsjonsnowknows]
293906                           [karma, lot, lets, honest]
Name: txt_tokenized, dtype: object

D. Apply NLTK’s PorterStemmer.

In [13]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

def apply_porter_stemmer(tokenized_words):
    return [porter.stem(word) for word in tokenized_words]

comments_df['txt_tokenized_stemmed'] = comments_df['txt_tokenized'].apply(apply_porter_stemmer)
comments_df['txt_tokenized_stemmed'][:5]

185126    [well, depend, incom, depend, amount, work, do...
924751                                   [ah, racist, fuck]
416556              [ye, sad, im, person, respect, opinion]
719281                                 [rthingsjonsnowknow]
293906                            [karma, lot, let, honest]
Name: txt_tokenized_stemmed, dtype: object

Now that the data is pre-processed, you will apply three different techniques to get it into a usable form for
model-building. Apply each of the following steps (individually) to the pre-processed data.

A. Convert each text entry into a word-count vector (see sects 5.3 & 6.8 in the Machine Learning with Python Cookbook).

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
results = count.fit_transform(comments_df['txt'])
print(comments_df.info())
print(type(results))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 185126 to 19050
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   con                    50000 non-null  int64 
 1   txt                    50000 non-null  object
 2   txt_tokenized          50000 non-null  object
 3   txt_tokenized_stemmed  50000 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.9+ MB
None
<class 'scipy.sparse.csr.csr_matrix'>


B. Convert each text entry into a part-of-speech tag vector (see section 6.7 in the Machine Learning with Python Cookbook).

In [15]:
from nltk import pos_tag
from nltk import word_tokenize
from sklearn.preprocessing import MultiLabelBinarizer
# nltk.download('averaged_perceptron_tagger')
tweets = comments_df['txt'].tolist()
tagged_tweets = []
for tweet in tweets:
    tweet_tag = nltk.pos_tag(word_tokenize(tweet))
    tagged_tweets.append([tag for word, tag in tweet_tag])

one_hot_multi = MultiLabelBinarizer()
one_hot_multi.fit_transform(tagged_tweets)


array([[0, 0, 0, ..., 1, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

C. Convert each entry into a term frequency-inverse document frequency (tfidf) vector
(see section 6.9 in the Machine Learning with Python Cookbook).

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
text_data = np.array(tweets)
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)
tfidf.vocabulary_

{'my': 27266,
 'well': 44108,
 'being': 4988,
 'depends': 10887,
 'on': 28806,
 'income': 21698,
 'which': 44270,
 'the': 40150,
 'amount': 3047,
 'work': 44739,
 'dont': 12128,
 'depend': 10879,
 'mcdonalds': 25780,
 'cashier': 6994,
 'to': 40767,
 'sustain': 39401,
 'level': 24310,
 'of': 28615,
 'comfort': 8364,
 'its': 22832,
 'more': 26924,
 'selfish': 35838,
 'demand': 10731,
 'support': 39249,
 'people': 30118,
 'know': 23636,
 'who': 44354,
 'should': 36446,
 'themselves': 40210,
 '5060': 1272,
 'hours': 19312,
 'week': 44057,
 'get': 16317,
 'by': 6584,
 'need': 27539,
 'be': 4849,
 'paying': 29943,
 'for': 15352,
 'anyone': 3480,
 'elses': 13062,
 'insurance': 22274,
 'because': 4890,
 'they': 40287,
 'arent': 3718,
 'willing': 44482,
 'put': 32393,
 'in': 21595,
 'furthermore': 15943,
 'fucking': 15827,
 'hilarious': 18858,
 'how': 19342,
 'aca': 2019,
 'your': 45202,
 'opinion': 28934,
 'im': 21369,
 'now': 28258,
 'hook': 19179,
 'these': 40278,
 'peoples': 30133,
 'lives'

For the three techniques in problem (2) above, give an example where each would be useful.

Word count vector - can help identify a subject of a text and a hidden subjet within text.  Can also be used for sentiment analysis - THere are references out there that use sentiment analysis to their benefit.  One example would be an airline company that reads comments from Twitter.  If the airline company identifies certain words that have a negative sentiment, they can look out for these words and make marketing adjustments as necessary to help with customer satisfaction.

Part of speech tag vector - Tagging parts of speech is important in that it allows for the context of a word to be identified.  For example, if we continue with the twitter comment airline example, the word "late" without any context would probably be viewed as negative when it applies to an airline company.  For example, if a comment was " #airlinecompany flight 1234 is running late as usual!" then the word "late" in this context has a negative sentiment.  However, if another comment was "#airlinecompany made sure that I made my flight even though I got to the airport really late.", then that can be viewed positively.

Term frequency - inverse document frequency - One of the most important uses of term frequency - inverse document frequency is found in something we use almost every day, Google search.  When we search for a term in Google, the words we search are compared with all the text that Google is able to search in its universe.  Those terms are compared with the texts Google searches upon, and by using the Term Frequency - Inverse Document Frequency, it is able to return the most relevant documents (websites) to our terms.