[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_11-NLP/blob/master/M11_CC_DJ_NLP_Coding_Challenge__1.ipynb)

### Coding Challenge #1: Natural Language Processing

In this Coding Challenge, you will be exposed to the steps needed to get data organized for modelling purposes. You will be exposed to a range of NLP related concepts such as **a)** Tokenization, **b)** Stopwords, **c)** Stemming/Lemmatization, and **d)** Vectorization. 

Walking through this challenge will equip you with the necessay knowledge to work through the first part of the Project Assignment.

**Dataset**: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection







**Step 1**: Explore the dataset to ascertain the following:

**a)** Determine whether there are any missing values. If missing values are diagnosed, treat them. 

**b)** Ascertain the breakdown/count of messages. 1) How many "Spam" messages are there and 2) How many "Ham" messages are there?

In [1]:
%%time
# Step 1
# Get the data
#!conda install -c anaconda nltk # nltk is part of the Anaconda distribution
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip -o smsspamcollection.zip
!head SMSSpamCollection
!ls -lh SMSSpamCollection

--2018-06-12 09:48:14--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/zip]
Saving to: ‘smsspamcollection.zip’


2018-06-12 09:48:16 (480 KB/s) - ‘smsspamcollection.zip’ saved [203415/203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives aroun

In [2]:
# Read with pandas
import pandas as pd
sms_data = pd.read_table('./SMSSpamCollection', header=None,
                         names=['category', 'content'])
sms_data.head()

Unnamed: 0,category,content
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
#sms_data
sms_data.shape


(5572, 2)

In [4]:
sms_data.isnull().sum() # No nulls

category    0
content     0
dtype: int64

In [5]:
sms_data[sms_data.category == 'ham'].count() + sms_data[sms_data.category == 'spam'].count()

category    5572
content     5572
dtype: int64

In [6]:
# HAM and SPAM counts
print('ham',sms_data[sms_data.category == 'ham'].count())
print('spam',sms_data[sms_data.category == 'spam'].count())

ham category    4825
content     4825
dtype: int64
spam category    747
content     747
dtype: int64


**Step 2: **Massage/Pre-process the dataset:

**a)** You will need to eliminate punctuations

**b)** You will have to deal with/remove stopwords

**c)** Tokenize the text

**d)** Stem or Lemmatize the text

In [9]:
%%time  # I have to be at the top
# Step 2

#import pandas as pd
import string, re
import nltk
nltk.download('all')

from nltk.tokenize import LineTokenizer, SpaceTokenizer, TweetTokenizer, RegexpTokenizer
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to
[n

In [10]:
%%time
# punctuation removal
df_spam = sms_data[sms_data.category == 'spam']
df_ham = sms_data[sms_data.category == 'ham']

CPU times: user 5.25 ms, sys: 1.25 ms, total: 6.5 ms
Wall time: 13.4 ms


In [50]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [13]:
sms_data.shape

(5572, 2)

In [15]:
%%time

table = str.maketrans('','',string.punctuation)
#print(df_ham.content)
stripped = [w.translate(table) for w in sms_data.content]

CPU times: user 27.6 ms, sys: 12.1 ms, total: 39.7 ms
Wall time: 54.7 ms


In [20]:
stripped[0]

'Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat'

In [17]:
# Get my english stopwords
stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

In [18]:
%%time 
# Tokenize based on whitespace
tokens = re.split('\W+',stripped)

TypeError: expected string or bytes-like object

In [48]:
#print(df_ham.shape)

def preprocess_nlp(werdz):
  df_ham['no_punc'] = df_ham.
  return(werdz)

werds_nopunc = preprocess_nlp(df_spam)

werds_nopunc.isnull().count()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


category    747
content     747
no_punc     747
dtype: int64

In [49]:
#dir(werds_nopunc)
werds_nopunc.shape()

TypeError: ignored

**Step 3:** Perform Vectorization - you will apply 3 different vectorization techniques. Each technique will generate similar document term matrices where the rows of the matrix will represent the respective text messages and the columns will represent each word or a combination of words. Note that the biggest difference between the techniques is the value depicted in the actual cells of the matrix. 

**1)** Create a document term matrix based on the count of the words in the document. You may want to restrict the # of features/columns based on the top most features ordered by term frequency across the document

**2)** Create a trigram vector using a combination of adjacent words. In this case, n=3

**3) ** Create a TF-IDF vector wherein the cells of the matrix contain values (i.e. weights) to depict how important a word is to an individual SMS message




In [None]:
# Step 3