# Natural Language Processing Using NLTK and Classification Models

In this project, natural language processing techniques, including regular expression and several tools from NLTK package were used to clean and process the text data. A pipeline including text processing, numeric feature extraction and model optimization using GridSearchCV for Natural Language Processing was established. The pipeline was applied to [UCI SMS Spam Collection data set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection), and [Yelp Reviews dataset from kaggle](https://www.kaggle.com/c/yelp-recsys-2013) datasets.

Three classification models, multinomial naive bayer, logistic regression and support vector machine were used to predict the classes of the messages or reviews. Support vector machine showed better performance in both precision and recall values for both **SMS Spam** and **Yelp review** datasets.

## Load Packages and Data
Among the loaded packages, PorterStemmer was loaded for word steaming and stopwords for identifying words that are commonly appear in documents and therefore, do not provide unique information for differentiating documents. 

In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
%matplotlib inline

then, load the data.

In [64]:
messages = pd.read_csv('smsspamcollection/SMSSpamCollection', sep='\t',names=["label", "message"])
messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


The dataset contains two columns: 'label' column tells us if a message is ham or spam. This will be used as our target variable in classification models; 'message' column contains text content of the message. The purpose of this project is to predict whether or not new messages are spam messages based on the text content of messages using calssification models.


 Since natural language processing focues on the text processing. The graphic data exploration is usually not intensively used as in other machine learning applications. Here I just briefly explore the dataset uisng pandas **groupby** and **describe** functions.


 Since the natural language process primary focues on the text processing. The graphic data exploration is not intensive used as in other machine learning application. Here I just briefly explore the SMS Spam dataset uisng pandas **groupby** and **describe** functions.

In [65]:
messages.groupby('label').describe()

Unnamed: 0_level_0,message,message,message,message
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


This data summary shows that there are 4825 useful messages and 747 spam messages included in the dataset. In addition, the most frequently appeared useful message is "Sorry, I'll call later", while the most frequently appeared spam message is much longer, and contains key words such as "customer", "service" and "representative" that are more likely to occur in business conversations. In the next step, regular expression and some NLTK tools will be used to extract features from the text content of messages to differentiate the two message categories.

## Text Pre-processing

### Punctuation and stop words removal
This project used the bag-of-words model to convert text words in messages to numerical feature vectors. Actually, there are some useful tools in sklearn that can extract and convert text words to feature vectors. In feature vectors, the value of each element represents the frequency of the corresponding word that appears in the document, and each unique word has a unique position or index in the vector. The word frequency can also be "normalized" by considering the frequency of the words in the entire document pool (**Inverse Document Frequency**, idf adjustment). The idf "normalized" feature frequency may (or may not) help us to compare and differentiate documents better, depending on the problems. For example, even if some words appear in all the documents, but if their frequencies in some documents are much higher than in the others, and we can find certain patterns between their frequencies and the classes of documents, then integrating these words will still provide useful information to differentiate the documents. 

Special characters, such as punctuations, were removed using Python's built-in **string** library. Emoticons such as ':)' were extracted using regular expressions, and kept in the message, since these emoticons do contain meaningful information.

Some common words (**stop words**), such as "and", "is", "a", etc. appear in almost every document and therefore may not provide very useful information to differentiate documents (this also depends on the problems). These stop words can be removed using the stop word list provided by NLTK package.

Below is the text_process() function that was implemented to remove punctuation characters except for emoticons from messages. I extracted the stop word list from NLTK package and stored the list in a variable called 'stop', which can be used to remove stop words from messages.  

In [144]:
def text_process(inputstring):
    """
    input
      inputString: a string text, with words separated by space
    output
      a list of strings that removed all punctuation and stop words
    
    """
    # remove punctuation characters
    emotions=re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|p)',inputstring)
    nopunc = ''.join([char for char in inputstring if char not in string.punctuation])+ ' '.join(emotions).replace('-','')
        
    return nopunc

stop= stopwords.words('english')
    

Now let's test the text_process() function using a test input string of **"This :) is :+ :( a test :-)@!"**. This string contains emoticons and punctuations. We can see that the text_process function extracted the useful fragments from the input string text, and removed punctuations.

In [116]:
text_process("This :) is :+ :( a test :-)@!")

'This  is   a test :) :( :)'

Next, I remove all the punctuation characters from the message column using text_process function and apply function

In [117]:
messages['message']=messages['message'].apply(text_process)

In [118]:
messages.head()

Unnamed: 0,label,message,length
0,ham,Go until jurong point crazy Available only in ...,111
1,ham,Ok lar Joking wif u oni,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor U c already then say,49
4,ham,Nah I dont think he goes to usf he lives aroun...,61


### Word Stemming
Word stemming is the process of transforming a word into its root form. This allows us to map related words to the same stem. This can effectively reduce the size of the vocabulary vectors. Here I used the Porter stemmer algorithm implemented in NLTK. I implemented two functions to tokenize the input string: one with stemming, and the other just splitted the input string to a word list without stemming.

In [145]:
from nltk.stem.porter import PorterStemmer
def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    porter=PorterStemmer()
    
    return[porter.stem(word) for word in text.split()]

### Vectorization
The pre-processed text messages are then converted to numeric feature vectors by TfidfVectorizer object in sklearn. In the next cell, I created two IfidfVectorizer objects. One used idf adjustment, one only used the raw word counts.

In [120]:
from sklearn.feature_extraction.text import TfidfVectorizer

TFIDF_transformer=TfidfVectorizer(stop_words=stop,tokenizer=tokenizer).fit(messages['message'])
TFIDF_transformer_NoIDF=TfidfVectorizer(stop_words=stop,tokenizer=tokenizer,use_idf=False,norm=None,smooth_idf=False).fit(messages['message'])

To see the effects of these transformation, let's randomly take one text message from the dataset:

In [146]:
message4 = messages['message'][3]
print(message4)

U dun say so early hor U c already then say


and see its vector representation after the transformation using and without using the idf adjustment:

In [122]:
TFIDF_message4_NoIDF=TFIDF_transformer_NoIDF.transform([message4])
TFIDF_message4=TFIDF_transformer.transform([message4])

print "TFIDF Frequency W/O adjustment: "
print TFIDF_message4_NoIDF
print "\n"
print "TFIDF Frequency With adjustment: "
print TFIDF_message4


TFIDF Frequency W/O adjustment: 
  (0, 1145)	1.0
  (0, 1930)	1.0
  (0, 3044)	1.0
  (0, 3065)	1.0
  (0, 4282)	1.0
  (0, 7328)	2.0
  (0, 8767)	2.0


TFIDF Frequency With adjustment: 
  (0, 8767)	0.319491587697
  (0, 7328)	0.559700094048
  (0, 4282)	0.464527625007
  (0, 3065)	0.335574365269
  (0, 3044)	0.309125465389
  (0, 1930)	0.287037034059
  (0, 1145)	0.279850047024


From the printed list, we see that there are seven unique words in message number 4. It seems that in the results without adjustment, frequencies of some words are exactly twice of the others. Let's check some of the words:

In [123]:
print(TFIDF_transformer.get_feature_names()[1145])
print(TFIDF_transformer.get_feature_names()[1930])
print(TFIDF_transformer.get_feature_names()[7328])
print(TFIDF_transformer.get_feature_names()[8767])

already
c
say
u


comparing the content of message 4, we observed that 'U' and 'say' appeared twice and other words only appeared once, which is consistent to the numbers obtained in the 'TFIDF Frequency W/O adjustment' treatment. To summarize, frequencies in 'TFIDF Frequency W/O adjustment' are the raw word frequencies in the message, while frequencies in 'TFIDF with adjustment' were idf adjusted.

Next, let's check the effects of word stemming by setting the 'tokenizer' value of the TfidfVectorizer object.

In [124]:
TFIDF_transformer=TfidfVectorizer(stop_words=stop,tokenizer=tokenizer).fit(messages['message'])
TFIDF_transformer_porter=TfidfVectorizer(stop_words=stop,tokenizer=tokenizer_porter).fit(messages['message'])

In [125]:
TFIDF_message4=TFIDF_transformer.transform([messages.loc[4,'message']])
TFIDF_message4_porter=TFIDF_transformer_porter.transform([messages.loc[4,'message']])

print "TFIDF Frequency W/O porter: "
print TFIDF_message4
print "\n"
print "TFIDF Frequency With porter stemming: "
print TFIDF_message4_porter


TFIDF Frequency W/O porter: 
  (0, 8902)	0.400223752528
  (0, 8461)	0.354773054283
  (0, 8442)	0.266683165396
  (0, 5780)	0.405100532009
  (0, 5134)	0.449291680455
  (0, 3852)	0.356888309363
  (0, 2948)	0.228484387562
  (0, 1309)	0.311918709413


TFIDF Frequency With porter stemming: 
  (0, 7628)	0.423556459568
  (0, 7251)	0.37545602397
  (0, 7235)	0.269177568398
  (0, 5023)	0.428717551178
  (0, 4479)	0.336580477749
  (0, 3374)	0.377694596636
  (0, 2612)	0.241804834549
  (0, 1248)	0.330103306957


Now, let's see some of the words in the vocabulary lists of the two TfidfVectorizer objects with and without idf adjustment:

In [128]:
print(TFIDF_transformer.get_feature_names()[8902])
print(TFIDF_transformer.get_feature_names()[8461])
print(TFIDF_transformer.get_feature_names()[8442])
print(TFIDF_transformer.get_feature_names()[5780])
print(TFIDF_transformer.get_feature_names()[5134])

usf
though
think
nah
lives


In [129]:
print(TFIDF_transformer_porter.get_feature_names()[7628])
print(TFIDF_transformer_porter.get_feature_names()[7251])
print(TFIDF_transformer_porter.get_feature_names()[7235])
print(TFIDF_transformer_porter.get_feature_names()[5023])
print(TFIDF_transformer_porter.get_feature_names()[4479])

usf
though
think
nah
live


Comparing the words in the two vocabulary lists, we can see the effects of word stemming, for example, 'lives' was converted to 'live'. Now, let's compare the sizes of the vocabulary lists of the two transformation:

In [130]:
TFIDF_message4_porter.shape

(1, 8333)

In [81]:
TFIDF_message4.shape

(1, 9708)

The transformation using porter stemming has a smaller vocabulary size than transformation without stemming. This is reasonable, since some words that are considered as different words before may have the same stems, and therefore, are considered as the same words after stemming.

## Model Training and Evaluation

The dataset was splitted into training and test datasets. Three classification models were used to predict the message classes, including the multinomial naive bayer, logistic regression and support vector machine. 

In [135]:
from sklearn.model_selection import train_test_split
Feature_X=messages['message']
target_y=messages['label'].apply(lambda x: 1 if x=='spam' else 0)
X_train,X_test,y_train,y_test=train_test_split(Feature_X,target_y,test_size=0.3)

In [136]:
X_train.head()

1029       Lol you forgot it eh  Yes Ill bring it in babe
5407    Yup he msg me is tat yijue Then i tot its my g...
3252                            I‘ll leave around four ok
1588    Dont search love let love find U Thats why its...
4921       G says you never answer your texts confirmdeny
Name: message, dtype: object

In [137]:
X_train.shape

(3900L,)

In [138]:
y_train.head()

1029    0
5407    0
3252    0
1588    0
4921    0
Name: label, dtype: int64

Now, I loaded the naive bayer, logistic regression and support vector machine models, GridSearchCV, Pipeline, classification_report and accuracy_score modules for model optimization and evaluation. 

In [142]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report,accuracy_score

### Naive Bayer Classifier
In this part of the project, I used multinomial naive bayer classifier with GridSearchCV to optimize the word pre-process procedures. The grid_param contains two dictionaries. One used idf adjustment as the default settings, and the other used raw word counts by setting use_idf=False, norm=None and smooth_idf=False. In addition, in each dictionary, the GridSearchCV compared and evaluated the model performance with and without stop word removal and word stemming procedures, and selected models based on the performance evaluated by the default 3-fold cross validation.

In [90]:
grid_param=[{'vect__stop_words':[stop,None],
             'vect__tokenizer':[tokenizer,tokenizer_porter]},
            {'vect__stop_words':[stop,None],
             'vect__tokenizer':[tokenizer,tokenizer_porter],
             'vect__use_idf':[False],
             'vect__norm':[None],
             'vect__smooth_idf':[False]
            
            }
           ]




In [91]:
nb_tfidf=Pipeline([('vect',TfidfVectorizer(preprocessor=None,analyzer='word')),('clf',MultinomialNB())])
gs=GridSearchCV(nb_tfidf,param_grid=grid_param)
gs.fit(X_train,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,...rue,
        vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'vect__tokenizer': [<function tokenizer at 0x00000000131D1048>, <function tokenizer_porter at 0x000000001083CD68>], 'vect__stop_words': [[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'..., u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn'], None]}],
       pre_dispatch='2*n_jobs', refit=True, return_

The multinomial naive bayer model was evaluated using the classification report. The model performance is pretty good.

In [92]:
print classification_report(y_test, gs.predict(X_test))

             precision    recall  f1-score   support

          0       0.98      0.99      0.99      1445
          1       0.95      0.90      0.93       227

avg / total       0.98      0.98      0.98      1672



### Logistic Regression
Logistic regression model was applied to the dataset using the same GridSearchCV parameters and pipeline as the multinomial naive bayer model used in the last section. Performance of the best model was evaluated by classification report. The performance is comparable to the multinomial naive bayer model.

In [93]:
lr_tfidf=Pipeline([('vect',TfidfVectorizer(preprocessor=None,analyzer='word')),('clf',LogisticRegressionCV())])
gs_lr=GridSearchCV(nb_tfidf,param_grid=grid_param)
gs_lr.fit(X_train,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,...rue,
        vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'vect__tokenizer': [<function tokenizer at 0x00000000131D1048>, <function tokenizer_porter at 0x000000001083CD68>], 'vect__stop_words': [[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'..., u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn'], None]}],
       pre_dispatch='2*n_jobs', refit=True, return_

In [94]:
print classification_report(y_test, gs_lr.predict(X_test))

             precision    recall  f1-score   support

          0       0.98      0.99      0.99      1445
          1       0.95      0.90      0.93       227

avg / total       0.98      0.98      0.98      1672



### Support Vector Machine
Finally, I evaluated the performance of SVM. Since both naive bayer and logistic regression showed very good classification performance, a linear kernel should be enough. Therefore, I used linear kernel and optimized C parameter. Results showed that SVM had better performance than both naive bayer and logistic regression model. 

In [95]:
grid_param_svc=[{
                'vect__stop_words':[stop,None],
                'vect__tokenizer':[tokenizer,tokenizer_porter],
                'clf__C' : [0.01,0.1,1,10.0]                
                },
            {'vect__stop_words':[stop,None],
             'vect__tokenizer':[tokenizer,tokenizer_porter],
             'vect__use_idf':[False],
             'vect__norm':[None],
             'vect__smooth_idf':[False],
             'clf__C':[0.01,0.1,1,10.0] 
            }
           ]

In [96]:
svc_tfidf=Pipeline([('vect',TfidfVectorizer(preprocessor=None,analyzer='word')),('clf',SVC(kernel="linear"))])
gs_svc=GridSearchCV(svc_tfidf,param_grid=grid_param_svc)
gs_svc.fit(X_train,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'vect__tokenizer': [<function tokenizer at 0x00000000131D1048>, <function tokenizer_porter at 0x000000001083CD68>], 'clf__C': [0.01, 0.1, 1, 10.0], 'vect__stop_words': [[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourse...tokenizer_porter at 0x000000001083CD68>], 'vect__use_idf': [False], 'clf__C': [0.01, 0.1, 1, 10.0]}],
       pre_dispatch='2*n_jobs', refit=True, return_

In [97]:
print classification_report(y_test, gs_svc.predict(X_test))

             precision    recall  f1-score   support

          0       0.99      1.00      0.99      1445
          1       0.99      0.92      0.95       227

avg / total       0.99      0.99      0.99      1672



## Applying Pipeline to Yelp Dataset
Next, I applied the Natural Language Processing pipeline established in this project to[Yelp Reviews dataset from kaggle](https://www.kaggle.com/c/yelp-recsys-2013).  

In this project, I only selected observations having 1 and 5 stars, and thus, converted the problem to a binary classification problem. The classification models then predicted and classified the reviews either as 1 or 5 stars. In addition, only the text content ('text' column) was used in model training and prediction. All the other columns in the dataset were dropped. The same text_process() procedure used in the **SMS Spam** dataset was applied in this dataset. For model training and evaluation, I used GridSearchCV with the same param_grid and pipelines as in the **SMS Spam** dataset. The same three models, multinomial naive bayer, logistic regression and support vector machine with linear kernel were used and evaluated.

In [98]:
yelp=pd.read_csv("yelp.csv")

In [99]:
yelp.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [100]:
yelp_df=yelp.loc[(yelp['stars']==1)|(yelp['stars']==5),:]
X_yelp=yelp_df['text'].apply(text_process)
y_yelp=yelp_df['stars']
X_train_yelp,X_test_yelp,y_train_yelp,y_test_yelp=train_test_split(X_yelp,y_yelp,test_size=0.3)

In [101]:
grid_param=[{'vect__stop_words':[stop,None],
             'vect__tokenizer':[tokenizer,tokenizer_porter]},
            {'vect__stop_words':[stop,None],
             'vect__tokenizer':[tokenizer,tokenizer_porter],
             'vect__use_idf':[False],
             'vect__norm':[None],
             'vect__smooth_idf':[False]
            
            }
           ]


### Multinomial Naive Bayer

In [102]:
nb_tfidf=Pipeline([('vect',TfidfVectorizer(preprocessor=None,analyzer='word')),('clf',MultinomialNB())])
gs_yelp=GridSearchCV(nb_tfidf,param_grid=grid_param)
gs_yelp.fit(X_train_yelp,y_train_yelp)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,...rue,
        vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'vect__tokenizer': [<function tokenizer at 0x00000000131D1048>, <function tokenizer_porter at 0x000000001083CD68>], 'vect__stop_words': [[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'..., u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn'], None]}],
       pre_dispatch='2*n_jobs', refit=True, return_

In [103]:
print classification_report(y_test_yelp,gs_yelp.predict(X_test_yelp))

             precision    recall  f1-score   support

          1       0.88      0.67      0.76       224
          5       0.93      0.98      0.95      1002

avg / total       0.92      0.92      0.92      1226



### Logistic Regression 

In [104]:
lr_tfidf=Pipeline([('vect',TfidfVectorizer(preprocessor=None,analyzer='word')),('clf',LogisticRegressionCV())])
gs_lr_yelp=GridSearchCV(nb_tfidf,param_grid=grid_param)
gs_lr_yelp.fit(X_train_yelp,y_train_yelp)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,...rue,
        vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'vect__tokenizer': [<function tokenizer at 0x00000000131D1048>, <function tokenizer_porter at 0x000000001083CD68>], 'vect__stop_words': [[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'..., u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn'], None]}],
       pre_dispatch='2*n_jobs', refit=True, return_

In [105]:
print classification_report(y_test_yelp,gs_lr_yelp.predict(X_test_yelp))

             precision    recall  f1-score   support

          1       0.88      0.67      0.76       224
          5       0.93      0.98      0.95      1002

avg / total       0.92      0.92      0.92      1226



### Support Vector Machine with Linear Kernel

In [111]:
grid_param_svc=[{
                'vect__stop_words':[stop,None],
                'vect__tokenizer':[tokenizer,tokenizer_porter],
                'clf__C' : [0.01,0.1,1,10.0]                
                },
            {'vect__stop_words':[stop,None],
             'vect__tokenizer':[tokenizer,tokenizer_porter],
             'vect__use_idf':[False],
             'vect__norm':[None],
             'vect__smooth_idf':[False],
             'clf__C':[0.01,0.1,1,10.0]             
            }
           ]

In [112]:
svc_tfidf=Pipeline([('vect',TfidfVectorizer(preprocessor=None,analyzer='word')),('clf',SVC(kernel="linear"))])
gs_svc_yelp=GridSearchCV(svc_tfidf,param_grid=grid_param_svc)
gs_svc_yelp.fit(X_train_yelp,y_train_yelp)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'vect__tokenizer': [<function tokenizer at 0x00000000131D1048>, <function tokenizer_porter at 0x000000001083CD68>], 'clf__C': [0.01, 0.1, 1, 10.0], 'vect__stop_words': [[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourse...tokenizer_porter at 0x000000001083CD68>], 'vect__use_idf': [False], 'clf__C': [0.01, 0.1, 1, 10.0]}],
       pre_dispatch='2*n_jobs', refit=True, return_

In [113]:
print classification_report(y_test_yelp,gs_svc_yelp.predict(X_test_yelp))

             precision    recall  f1-score   support

          1       0.92      0.74      0.82       224
          5       0.94      0.99      0.96      1002

avg / total       0.94      0.94      0.94      1226



## Conclusion:
A pipeline including text processing, numeric feature extraction and model optimization using GridSearchCV for Natural Language Processing was established. The pipeline was applied to [UCI SMS Spam Collection data set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection), and [Yelp Reviews dataset from kaggle](https://www.kaggle.com/c/yelp-recsys-2013) datasets. Three classification models, including multinomial naive bayer, logistic regression and support vector machine were used to predict the classes of the messages or reviews. Support vector machine showed better performance in both precision and recall values for both **SMS Spam** and **Yelp review** datasets. 