### Coding Challenge #1: Natural Language Processing

In this Coding Challenge, you will be exposed to the steps needed to get data organized for modelling purposes. You will be exposed to a range of NLP related concepts such as **a)** Tokenization, **b)** Stopwords, **c)** Stemming/Lemmatization, and **d)** Vectorization. 

Walking through this challenge will equip you with the necessay knowledge to work through the first part of the Project Assignment.

**Dataset**: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection







**Step 1**: Explore the dataset to ascertain the following:

**a)** Determine whether there are any missing values. If missing values are diagnosed, treat them. 

**b)** Ascertain the breakdown/count of messages. 1) How many "Spam" messages are there and 2) How many "Ham" messages are there?

In [1]:
# Step 1
# Get the data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip -o smsspamcollection.zip
!head SMSSpamCollection

--2018-06-11 16:08:39--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/zip]
Saving to: ‘smsspamcollection.zip.1’


2018-06-11 16:08:40 (432 KB/s) - ‘smsspamcollection.zip.1’ saved [203415/203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives a

In [2]:
# Read with pandas
import pandas as pd
sms_data = pd.read_table('./SMSSpamCollection', header=None,
                         names=['category', 'content'])
sms_data.head()

Unnamed: 0,category,content
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
print('Any Missing Values:', sms_data.isnull().any().any())
counts = sms_data.groupby('category').count()
print('ham messages: {}, spam messages: {}'.format(counts.loc['ham'].values[0],
                                                   counts.loc['spam'].values[0]))

Any Missing Values: False
ham messages: 4825, spam messages: 747


**Step 2: **Massage/Pre-process the dataset:

**a)** You will need to eliminate punctuations

**b)** You will have to deal with/remove stopwords

**c)** Tokenize the text

**d)** Stem or Lemmatize the text

In [0]:
import nltk

In [5]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /content/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /content/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [0]:
def process_text(sentence):
    # tokenize
    sentence = nltk.word_tokenize(sentence)
    
    # remove stop words and punctuation, lemmatize
    sentence = [wnl.lemmatize(w) for w in sentence 
                if w not in stop_words and w.isalpha()]
    
    return sentence

In [7]:
stop_words = set(nltk.corpus.stopwords.words('english'))
wnl = nltk.stem.WordNetLemmatizer()

tokenized = sms_data.copy()
tokenized['content'] = sms_data['content'].apply(process_text)
tokenized.head()

Unnamed: 0,category,content
0,ham,"[Go, jurong, point, Available, bugis, n, great..."
1,ham,"[Ok, lar, Joking, wif, u, oni]"
2,spam,"[Free, entry, wkly, comp, win, FA, Cup, final,..."
3,ham,"[U, dun, say, early, hor, U, c, already, say]"
4,ham,"[Nah, I, think, go, usf, life, around, though]"


**Step 3:** Perform Vectorization - you will apply 3 different vectorization techniques. Each technique will generate similar document term matrices where the rows of the matrix will represent the respective text messages and the columns will represent each word or a combination of words. Note that the biggest difference between the techniques is the value depicted in the actual cells of the matrix. 

**1)** Create a document term matrix based on the count of the words in the document. You may want to restrict the # of features/columns based on the top most features ordered by term frequency across the document

**2)** Create a trigram vector using a combination of adjacent words. In this case, n=3

**3) ** Create a TF-IDF vector wherein the cells of the matrix contain values (i.e. weights) to depict how important a word is to an individual SMS message




In [0]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [9]:
corpus = [' '.join(sentence) for sentence in tokenized['content']]
corpus[:5]

['Go jurong point Available bugis n great world la e buffet Cine got amore wat',
 'Ok lar Joking wif u oni',
 'Free entry wkly comp win FA Cup final tkts May Text FA receive entry question std txt rate T C apply',
 'U dun say early hor U c already say',
 'Nah I think go usf life around though']

## Document Term Matrix

In [0]:
max_features = 100

In [18]:
vectorizer = CountVectorizer(max_features=max_features)
vectorizer.fit(corpus)
document_term_matrix = vectorizer.transform(corpus)
document_term_matrix.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [0]:
def sparse_to_dataframe(vectorizer, matrix):
    vocab = vectorizer.vocabulary_
    columns = [' ' for _ in range(len(vocab))]
    for word, number in vocab.items():
        columns[number] = word

    df = pd.DataFrame(matrix.todense(), columns=columns)
    return df

In [20]:
term_frequency = sparse_to_dataframe(vectorizer, document_term_matrix)
term_frequency.head()

Unnamed: 0,already,amp,and,are,ask,back,but,call,can,claim,...,way,we,week,well,what,work,yeah,yes,you,your
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Document Trigram Matrix

In [14]:
vectorizer = CountVectorizer(max_features=max_features, ngram_range=(3, 3))
vectorizer.fit(corpus)
ngram_matrix = vectorizer.transform(corpus)
ngram_frequency = sparse_to_dataframe(vectorizer, ngram_matrix)
ngram_frequency.head()

Unnamed: 0,account statement show,admirer looking make,anytime network min,await collection sae,bonus caller prize,call claim code,call customer service,call identifier code,call just per,call land line,...,urgent please call,urgent we trying,urgent your mobile,we trying contact,week nokia tone,week txt nokia,won guaranteed cash,your account statement,your mobile no,your mobile number
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### TF-IDF

In [17]:
vectorizer = TfidfVectorizer(max_features=max_features)
vectorizer.fit(corpus)
tfidf_matrix = vectorizer.transform(corpus)
tfidf = sparse_to_dataframe(vectorizer, tfidf_matrix)
tfidf.head()

Unnamed: 0,already,amp,and,are,ask,back,but,call,can,claim,...,way,we,week,well,what,work,yeah,yes,you,your
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.465053,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
