### Coding Challenge #1: Natural Language Processing

In this Coding Challenge, you will be exposed to the steps needed to get data organized for modelling purposes. You will be exposed to a range of NLP related concepts such as **a)** Tokenization, **b)** Stopwords, **c)** Stemming/Lemmatization, and **d)** Vectorization. 

Walking through this challenge will equip you with the necessay knowledge to work through the first part of the Project Assignment.

**Dataset**: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection







**Step 1**: Explore the dataset to ascertain the following:

**a)** Determine whether there are any missing values. If missing values are diagnosed, treat them. 

**b)** Ascertain the breakdown/count of messages. 1) How many "Spam" messages are there and 2) How many "Ham" messages are there?

In [71]:
# Step 1
# Get the data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip -o smsspamcollection.zip
!head SMSSpamCollection

--2018-06-11 23:58:54--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/zip]
Saving to: ‘smsspamcollection.zip.2’


2018-06-11 23:58:55 (506 KB/s) - ‘smsspamcollection.zip.2’ saved [203415/203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives a

In [74]:
# Read with pandas
import pandas as pd
sms_data = pd.read_table('./SMSSpamCollection', header=None,
                         names=['category', 'content'])
print (sms_data.head())
sms_data.shape

  category                                            content
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...


(5572, 2)

In [75]:
print (sms_data.isnull().any())
print ("ham messages:",(sms_data.category == 'ham').sum())
print ("spam messages:",(sms_data.category == 'spam').sum())

category    False
content     False
dtype: bool
ham messages: 4825
spam messages: 747


**Step 2: **Massage/Pre-process the dataset:

**a)** You will need to eliminate punctuations

**b)** You will have to deal with/remove stopwords

**c)** Tokenize the text

**d)** Stem or Lemmatize the text

In [76]:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /content/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [77]:
stop_words = set(stopwords.words('english'))
ps = nltk.PorterStemmer()
#preprocess the dataset
text = sms_data['content']
def preprocess_text(text):
    words = word_tokenize(text)
    sentence = []
    for w in words:
      if (w not in stop_words and w.isalpha()):
        sentence.append(ps.stem(w))
    return sentence    
clean_data = sms_data.copy()  
clean_data['content'] = text.apply(preprocess_text) 
clean_data.head()


Unnamed: 0,category,content
0,ham,"[Go, jurong, point, avail, bugi, n, great, wor..."
1,ham,"[Ok, lar, joke, wif, u, oni]"
2,spam,"[free, entri, wkli, comp, win, FA, cup, final,..."
3,ham,"[U, dun, say, earli, hor, U, c, alreadi, say]"
4,ham,"[nah, I, think, goe, usf, live, around, though]"


**Step 3:** Perform Vectorization - you will apply 3 different vectorization techniques. Each technique will generate similar document term matrices where the rows of the matrix will represent the respective text messages and the columns will represent each word or a combination of words. Note that the biggest difference between the techniques is the value depicted in the actual cells of the matrix. 

**1)** Create a document term matrix based on the count of the words in the document. You may want to restrict the # of features/columns based on the top most features ordered by term frequency across the document

**2)** Create a trigram vector using a combination of adjacent words. In this case, n=3

**3) ** Create a TF-IDF vector wherein the cells of the matrix contain values (i.e. weights) to depict how important a word is to an individual SMS message




In [78]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
count_vector = CountVectorizer(analyzer = preprocess_text,max_features = 1000)
count = count_vector.fit_transform(text)
print (count.shape)
pd.DataFrame(count.toarray())

(5572, 1000)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [79]:
#Technique 2: Document Trigram Matrix
# This will generate a document term matrix based on a combination of adjacent words. In this case, n = 3
trigram_vector = CountVectorizer(ngram_range=(3,3))
W_counts = trigram_vector.fit_transform(text)
print(W_counts.shape)
print(trigram_vector.get_feature_names())


(5572, 54461)


In [80]:
#Generate the document term matrices based on the combination of the words in the document

W_counts_df = pd.DataFrame(W_counts.toarray())
W_counts_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54451,54452,54453,54454,54455,54456,54457,54458,54459,54460
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [81]:
# This will generate a document term matrix wherein the cells of the matrix contain values (i.e. weights) to depict how important a word is to an individual SMS message

from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vector = TfidfVectorizer(analyzer=preprocess_text)
W_tf_idf = tf_idf_vector.fit_transform(text)
print(W_tf_idf.shape)
print(tf_idf_vector.get_feature_names())

(5572, 6076)
['A', 'AD', 'AG', 'AH', 'AL', 'AM', 'AN', 'AS', 'AT', 'AV', 'Ah', 'Al', 'Am', 'An', 'As', 'At', 'Ay', 'B', 'BE', 'BK', 'BT', 'BY', 'Bc', 'Be', 'Bt', 'Bx', 'By', 'C', 'CC', 'CD', 'CL', 'CM', 'CU', 'Ca', 'Co', 'Cs', 'D', 'DA', 'DD', 'DE', 'DO', 'Da', 'De', 'Do', 'Dr', 'E', 'ER', 'EY', 'Ee', 'Eh', 'Em', 'En', 'Er', 'Ew', 'F', 'FA', 'FM', 'Fr', 'G', 'GE', 'GM', 'GN', 'GO', 'Gd', 'Ge', 'Gn', 'Go', 'H', 'HI', 'HL', 'HU', 'Ha', 'He', 'Hi', 'Hm', 'Ho', 'I', 'ID', 'IF', 'IL', 'IM', 'IN', 'IQ', 'IS', 'IT', 'Ic', 'Id', 'If', 'Im', 'In', 'Is', 'It', 'J', 'JD', 'K', 'KR', 'L', 'LE', 'Lk', 'M', 'ME', 'MF', 'MO', 'MR', 'MY', 'Ma', 'Me', 'Mm', 'Mr', 'My', 'N', 'NA', 'NO', 'NY', 'No', 'Nt', 'Nw', 'O', 'OF', 'OH', 'OK', 'ON', 'OR', 'Of', 'Oh', 'Oi', 'Ok', 'On', 'Or', 'Oz', 'P', 'PA', 'PC', 'PO', 'PS', 'Pa', 'Pg', 'Pl', 'Po', 'Q', 'R', 'RV', 'Re', 'Rs', 'S', 'SF', 'SI', 'SN', 'SO', 'SP', 'ST', 'Sh', 'Si', 'So', 'St', 'T', 'TA', 'TC', 'TH', 'TO', 'TS', 'TV', 'TX', 'Ta', 'Tb', 'To', 'Ts', 'U',

In [82]:
#Generate the document term matrices based on the combination of the words in the document

W_tfidf_df = pd.DataFrame(W_tf_idf.toarray())
W_tfidf_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6066,6067,6068,6069,6070,6071,6072,6073,6074,6075
0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
2,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
3,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
4,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
5,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
6,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
8,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
9,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
