###  1.  You can find the dataset controversial-comments.jsonl for this exercise in the Weekly Resources: Week 2 Data Files.Pre-processing Text: For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame. Then

In [1]:
# importing pandas
import pandas as pd

In [2]:
# Reading the jsonl file and converting it as list
with open('controversial-comments.jsonl') as f:
    data=f.readlines()

lines=[]

for d in data:
    lines.append(eval(d))

In [3]:
# converting the list to data frame
df=pd.DataFrame(lines)

In [4]:
df.head()

Unnamed: 0,con,txt
0,0,Well it's great that he did something about th...
1,0,You are right Mr. President.
2,0,You have given no input apart from saying I am...
3,0,I get the frustration but the reason they want...
4,0,I am far from an expert on TPP and I would ten...


#### A. Convert all text to lowercase letters.

In [5]:
# converting all strings to lower case using lambda expressions
df=df.applymap(lambda st:st.lower() if type(st)==str else st)

In [6]:
df.head()

Unnamed: 0,con,txt
0,0,well it's great that he did something about th...
1,0,you are right mr. president.
2,0,you have given no input apart from saying i am...
3,0,i get the frustration but the reason they want...
4,0,i am far from an expert on tpp and i would ten...


#### B. Remove all punctuation from the text.

In [7]:
# importing string
import string

In [8]:
# creating a function to remove punctuations
exclusion=set(string.punctuation)
def remove_punc(s):
    return ''.join(ch for ch in s if ch not in exclusion)

In [9]:
# applying the remove_punc function to the data frame
df.txt=df.txt.apply(remove_punc)

In [10]:
df.head()

Unnamed: 0,con,txt
0,0,well its great that he did something about tho...
1,0,you are right mr president
2,0,you have given no input apart from saying i am...
3,0,i get the frustration but the reason they want...
4,0,i am far from an expert on tpp and i would ten...


#### C. Remove stop words.

In [11]:
# importing stop words
import nltk
from nltk.corpus import stopwords

In [12]:
#removing stop words
sw=set(stopwords.words("english"))
df['txt']=df['txt'].apply(lambda x: ' '.join([word for word in x.split() if word not in (sw)]))

In [13]:
df.head()

Unnamed: 0,con,txt
0,0,well great something beliefs office doubt trum...
1,0,right mr president
2,0,given input apart saying wrong argument clearly
3,0,get frustration reason want way foundation com...
4,0,far expert tpp would tend agree lot problems u...


#### D. Apply NLTK’s PorterStemmer.

In [14]:
#imporitng PorterStemmer
from nltk import PorterStemmer

In [15]:
# apply steaming 
stemmer=PorterStemmer()
df['txt']=df['txt'].str.split()
df['txt']=df['txt'].apply(lambda x:' '.join([stemmer.stem(y) for y in x]))

In [16]:
df.head()

Unnamed: 0,con,txt
0,0,well great someth belief offic doubt trump wou...
1,0,right mr presid
2,0,given input apart say wrong argument clearli
3,0,get frustrat reason want way foundat complex p...
4,0,far expert tpp would tend agre lot problem und...


### 2. Now that the data is pre-processed, you will apply three different techniques to get it into a usable form for model-building. Apply each of the following steps (individually) to the pre-processed data.

#### A. Convert each text entry into a word-count vector

In [17]:
# importing count vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
# Converting to word count vector
count=CountVectorizer()
words_bag=count.fit_transform(df['txt'])

In [19]:
words_bag

<950000x226147 sparse matrix of type '<class 'numpy.int64'>'
	with 15344177 stored elements in Compressed Sparse Row format>

#### B. Convert each text entry into a part-of-speech tag vector

In [20]:
# importing pos and word tokenize
from nltk import pos_tag_sents
from nltk import word_tokenize

In [21]:
# Converting to POS
pos=pos_tag_sents(df['txt'].apply(word_tokenize).tolist())

In [22]:
pos

[[('well', 'RB'),
  ('great', 'JJ'),
  ('someth', 'JJ'),
  ('belief', 'NN'),
  ('offic', 'JJ'),
  ('doubt', 'NN'),
  ('trump', 'NN'),
  ('would', 'MD'),
  ('fight', 'VB'),
  ('un', 'JJ'),
  ('im', 'NN'),
  ('realli', 'NN'),
  ('realli', 'NN'),
  ('happi', 'NN'),
  ('obama', 'NN'),
  ('someth', 'VBZ'),
  ('couldoh', 'JJ'),
  ('wait', 'NN')],
 [('right', 'JJ'), ('mr', 'NN'), ('presid', 'NN')],
 [('given', 'VBN'),
  ('input', 'JJ'),
  ('apart', 'RB'),
  ('say', 'VBP'),
  ('wrong', 'JJ'),
  ('argument', 'NN'),
  ('clearli', 'NN')],
 [('get', 'VB'),
  ('frustrat', 'JJ'),
  ('reason', 'NN'),
  ('want', 'VBP'),
  ('way', 'NN'),
  ('foundat', 'NN'),
  ('complex', 'NN'),
  ('problem', 'NN'),
  ('advanc', 'NN'),
  ('grade', 'VBD'),
  ('get', 'VB'),
  ('decent', 'JJ'),
  ('grade', 'NN'),
  ('sat', 'VBD'),
  ('type', 'JJ'),
  ('test', 'NN'),
  ('math', 'NN'),
  ('dont', 'NN'),
  ('realli', 'NN'),
  ('understand', 'NN'),
  ('lot', 'NN'),
  ('mathemat', 'FW'),
  ('way', 'NN'),
  ('get', 'VB'),
  ('r

#### C. Convert each entry into a term frequency-inverse document frequency (tfidf) vector 

In [23]:
# importing tfidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
# converting to tfidf vector
tfidf=TfidfVectorizer()
feature_matrix=tfidf.fit_transform(df['txt'])

In [26]:
print(feature_matrix)

  (0, 216587)	0.20543847701939705
  (0, 43495)	0.520434028084594
  (0, 143531)	0.17069389230218696
  (0, 78830)	0.2378378996139853
  (0, 163099)	0.30522421133569577
  (0, 112005)	0.13715355350480687
  (0, 208894)	0.3010725068607446
  (0, 63161)	0.21620970896384695
  (0, 222143)	0.12608257419537275
  (0, 203167)	0.10681551746595325
  (0, 53030)	0.2242419799540024
  (0, 144412)	0.20031140933114625
  (0, 25872)	0.2440448649167132
  (0, 179925)	0.3375268523054628
  (0, 71213)	0.19244684898061007
  (0, 218385)	0.15376090457657124
  (1, 156520)	0.46601666362367494
  (1, 136191)	0.7724222327426152
  (1, 167511)	0.4315001316221952
  (2, 38062)	0.3583557158027595
  (2, 20674)	0.3322817326300631
  (2, 222414)	0.3001203220949293
  (2, 171986)	0.2190579045027151
  (2, 19826)	0.44374960177741674
  (2, 114171)	0.5500585955332149
  :	:
  (949996, 150626)	0.05706131222417195
  (949996, 19670)	0.07918698026328105
  (949996, 25892)	0.07864753522117626
  (949996, 146237)	0.08762714634480105
  (949996, 16

### Follow-Up Question

#### Let us consider an example of building a spam identifier to indentify spam message from ham. Word count vector is used to extract different words from the sentence. Different speech tags are found by parts_of_speech. Tfidf frequency is the main library to analyze the importance of each word