# Natural language processing and modelling using Scikitlearn pipelines

In this notebook, I am taking a data set including sms messages with labels indicating whether they are hams or spams, and try to come up with the model to predict spams, based on the content of the message. It contains the following parts:

**Part 0: Loading data**

**Part 1: Defining methods for preprocessing texts**

**Part 2: Splitting data and applying processing methods on them**

**Part 3: naive bayes method** 

**Part 4: k-fold hold out**

**Part 5: Random Forest, Gradient Boosting and modelling by GridSearch**

**Part 6: Pipeline**

**Part 7: Feature union in pipeline**


In [4]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer,TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = nltk.corpus.stopwords.words("english")

# Part 0. Loading data

In [5]:
data = pd.read_csv("SMSSpamCollection.tsv", sep='\t', header=None)#importing the tab delimited data set
data.columns = ['label', 'sms'] #assigning apprpriate headers for the data

In [6]:
data.head(5)

Unnamed: 0,label,sms
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


## Part 1: Defining methods for preprocessing texts

In [25]:

def tokenize(text):
    '''this method does the following
    1. normalizing all the words to lower size
    2. removes punctuations
    3. splits the words
    4. removes the stopwords like am,is,have,you,...
    5. lammetizes the words for example running-->run
    '''
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())    # normalize case and remove punctuation
    tokens = word_tokenize(text)    # tokenize text
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]    # lemmatize andremove stop words
    return tokens
def prep_data(text,method=CountVectorizer):
    '''
    this method counts either counts the words 
    in sentences (CountVectorizer) or wights them 
    based on their importance in the sentence 
    and entire data(TfidfVectorizer):
    '''
    count_vector = method(tokenizer=tokenize)
    count_vector.fit(text)
    doc_array = count_vector.transform(text).toarray()
    frequency_matrix_count = pd.DataFrame(doc_array, columns=count_vector.get_feature_names())
    return frequency_matrix_count,frequency_matrix_count.values,count_vector

def vectorize(text,vectorizer):
    '''
    to use vectorizer extracted from prep_data in other data sets.
    this is important because we need the same vectorizer built on 
    training data to be applied on test data. if we apply prep_data
    twice, once on train data and once on test data, since the words 
    are different they wont be giving simmilar names in the headers
    in the vectorized data frame. so model trained on train data cant be applied 
    on test data since they had seen different words
    '''
    doc_array=vectorizer.transform(text).toarray()
    frequency_matrix_count = pd.DataFrame(doc_array, columns=vectorizer.get_feature_names())
    return frequency_matrix_count,frequency_matrix_count.values
    
def display_results(y_test, y_pred):
    '''
    function to display confusion matrix
    '''
    labels = np.unique(y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
    print("Confusion Matrix:\n", confusion_mat)
    print("Accuracy:", accuracy)

#### example tokenize()

In [11]:
documents = ['Have you seen this book?',
             'I need to do my homeworks ',
            'consider it done!',
            'this book is amazing']
[tokenize (x) for x in documents]

[['seen', 'book'],
 ['need', 'homework'],
 ['consider', 'done'],
 ['book', 'amazing']]

#### example prep_data()

In [12]:
df,vectorized,vectorizer=prep_data(documents)#uses countvectorizer
df

Unnamed: 0,amazing,book,consider,done,homework,need,seen
0,0,1,0,0,0,0,1
1,0,0,0,0,1,1,0
2,0,0,1,1,0,0,0
3,1,1,0,0,0,0,0


In [13]:
df,vectorized,vectorizer=prep_data(documents,method=TfidfVectorizer)#uses TfidfVectorizer
df

Unnamed: 0,amazing,book,consider,done,homework,need,seen
0,0.0,0.61913,0.0,0.0,0.0,0.0,0.785288
1,0.0,0.0,0.0,0.0,0.707107,0.707107,0.0
2,0.0,0.0,0.707107,0.707107,0.0,0.0,0.0
3,0.785288,0.61913,0.0,0.0,0.0,0.0,0.0


## Part 2: Splitting data and applying processing methods on them

In [15]:
X_train, X_test, y_train, y_test = train_test_split(data['sms'], data['label'], random_state=444)
print('rows in the original data set: {}'.format(data.shape[0]))
print('rows in the training set: {}'.format(X_train.shape[0]))
print('rows in the test set: {}'.format(X_test.shape[0]))

rows in the original data set: 5568
rows in the training set: 4176
rows in the test set: 1392


#### applying methods on training and testing data
we are going to test most of the modellings in using two separate vectorizers that we designed, and we can then compare them

In [17]:
'''1. Count vectorizer method'''
df,training_data_count,vectorizer = prep_data(X_train)#vectorizing on training data and extracting vectorizer
df,testing_data_count = vectorize(X_test,vectorizer)#applying extracted vectorizer on testing data
'''2. Tfidf vectorizer method'''
df,training_data_tfidf,vectorizer = prep_data(X_train,method=TfidfVectorizer)#vectorizing on training data and extracting vectorizer
df,testing_data_tfidf = vectorize(X_test,vectorizer)#applying extracted vectorizer on testing data

## Part 3: *naive bayes* method

In [19]:
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data_count, y_train)
predictions = naive_bayes.predict(testing_data_count)
precision, recall, fscore, support = score(y_test, predictions, pos_label='spam', average='binary')
print('countvectorizer: Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                         round(recall, 3),
                                                         round((predictions==y_test).sum() / len(predictions),3)))

naive_bayes.fit(training_data_tfidf, y_train)
predictions = naive_bayes.predict(testing_data_tfidf)
precision, recall, fscore, support = score(y_test, predictions, pos_label='spam', average='binary')
print('TfidfVectorizer: Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                         round(recall, 3),
                                                         round((predictions==y_test).sum() / len(predictions),3)))



countvectorizer: Precision: 0.936 / Recall: 0.92 / Accuracy: 0.982
TfidfVectorizer: Precision: 0.992 / Recall: 0.722 / Accuracy: 0.964


## Part 4: k-fold hold out

As we can see bellow we use *n-split=5 * which means that training data will be splitted into 5 part and for part will be used as training and one as validation. Once finished the score will be reported. This will happen 5 times in total and we will end up with 5 scores, which give us good idea about how good the model performes.

#### countvectorized data

In [110]:
rf = RandomForestClassifier(n_jobs=-1)
k_fold = KFold(n_splits=5)
cross_val_score(rf, training_data_count, y_train, cv=k_fold, scoring='accuracy', n_jobs=-1)

array([ 0.97248804,  0.97964072,  0.96646707,  0.97245509,  0.95688623])

#### tfidftvectorized data

In [111]:
rf = RandomForestClassifier(n_jobs=-1)
k_fold = KFold(n_splits=5)
cross_val_score(rf, training_data_tfidf, y_train, cv=k_fold, scoring='accuracy', n_jobs=-1)

array([ 0.97607656,  0.98203593,  0.96766467,  0.96287425,  0.95928144])

## Part 5: Random Forest, Gradient Boosting and modelling by GridSearch

### - Random Forest; Count vectorized

In [114]:
rf = RandomForestClassifier()
param = {'n_estimators': [10, 150, 300],
         'max_depth': [30, 60, 90, None]}
clf = GridSearchCV(rf, param, cv=5, n_jobs=-1)
clf_fit = clf.fit(training_data_count, y_train)
pd.DataFrame(clf_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5] #metrics for all the nodes in grid search

Unnamed: 0,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,...,split4_train_score,mean_train_score,std_train_score,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params
7,0.97488,0.985629,0.973653,0.973653,0.97006,0.975575,0.005279,1,0.998802,0.999102,...,0.999102,0.999162,0.000224,59.540338,1.570981,0.414299,0.048288,90.0,150,"{'max_depth': 90, 'n_estimators': 150}"
11,0.980861,0.983234,0.971257,0.97485,0.967665,0.975575,0.005802,1,1.0,1.0,...,1.0,1.0,0.0,80.553275,14.606205,0.274529,0.06393,,300,"{'max_depth': None, 'n_estimators': 300}"
10,0.976077,0.984431,0.97006,0.973653,0.97006,0.974856,0.005303,3,1.0,1.0,...,1.0,1.0,0.0,61.550068,1.285677,0.416906,0.048129,,150,"{'max_depth': None, 'n_estimators': 150}"
3,0.978469,0.980838,0.965269,0.973653,0.973653,0.974377,0.005338,4,0.994012,0.994014,...,0.99581,0.994672,0.000935,4.166853,0.064391,0.080414,0.006322,60.0,10,"{'max_depth': 60, 'n_estimators': 10}"
5,0.976077,0.982036,0.972455,0.973653,0.967665,0.974377,0.004707,4,0.99521,0.994612,...,0.996708,0.99569,0.000724,90.971913,1.352475,0.601395,0.047223,60.0,300,"{'max_depth': 60, 'n_estimators': 300}"


In [115]:
print(clf_fit.best_estimator_) #best estimator parameters

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=90, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=150, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)


In [116]:
predictions = clf_fit.best_estimator_.predict(testing_data_count)
precision, recall, fscore, support = score(y_test, predictions, pos_label='spam', average='binary')
print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                         round(recall, 3),
                                                         round((predictions==y_test).sum() / len(predictions),3)))

Precision: 1.0 / Recall: 0.847 / Accuracy: 0.981


### - Random Forest; Tfidf vectorized

In [117]:
''' Random Forest; tfidf'''
rf = RandomForestClassifier()
param = {'n_estimators': [10, 150, 300],
         'max_depth': [30, 60, 90, None]}
clf = GridSearchCV(rf, param, cv=5, n_jobs=-1)
clf_fit = clf.fit(training_data_tfidf, y_train)
pd.DataFrame(clf_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

print(clf_fit.best_estimator_)
predictions = clf_fit.best_estimator_.predict(testing_data_tfidf)
precision, recall, fscore, support = score(y_test, predictions, pos_label='spam', average='binary')
print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                         round(recall, 3),
                                                         round((predictions==y_test).sum() / len(predictions),3)))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=90, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=300, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
Precision: 1.0 / Recall: 0.835 / Accuracy: 0.979


### - Gradient Boosting; Count vectorized

In [118]:
''' Gradient Boosting; Count'''
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [100, 150],
    'max_depth': [7, 11, 15],
    'learning_rate': [0.1]
}
clf = GridSearchCV(gb, param, cv=5, n_jobs=-1)
clf_fit = clf.fit(training_data_count, y_train)
pd.DataFrame(clf_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

print(clf_fit.best_estimator_)
predictions = clf_fit.best_estimator_.predict(testing_data_count)
precision, recall, fscore, support = score(y_test, predictions, pos_label='spam', average='binary')
print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                         round(recall, 3),
                                                         round((predictions==y_test).sum() / len(predictions),3)))

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=7,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=150, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
Precision: 0.944 / Recall: 0.858 / Accuracy: 0.976


### - Gradient Boosting; Tfidf vectorized

In [119]:
''' Gradient Boosting; tfidf'''
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [100, 150],
    'max_depth': [7, 11, 15],
    'learning_rate': [0.1]
}
clf = GridSearchCV(gb, param, cv=5, n_jobs=-1)
clf_fit = clf.fit(training_data_tfidf, y_train)
pd.DataFrame(clf_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

print(clf_fit.best_estimator_)
predictions = clf_fit.best_estimator_.predict(testing_data_tfidf)
precision, recall, fscore, support = score(y_test, predictions, pos_label='spam', average='binary')
print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                         round(recall, 3),
                                                         round((predictions==y_test).sum() / len(predictions),3)))

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=7,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=150, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
Precision: 0.948 / Recall: 0.83 / Accuracy: 0.973


## Part 6: PipeLine

a simple pipeline which applys count vectorizing followed by tfidf vectorizing on its output and then classifying using *Random Forest*. The first two consecutive lines in the pipeline has same results as *prep_data(X_train,method=TfidfVectorizer)* 

In [173]:
def model_pipeline(): 
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier())
    ])
    return pipeline

In [174]:
model = model_pipeline()
model.fit(X_train, y_train);
y_pred = model.predict(X_test)
display_results(y_test, y_pred)

Labels: ['ham' 'spam']
Confusion Matrix:
 [[1216    0]
 [  37  139]]
Accuracy: 0.97341954023


In [175]:
predictions = model.predict(X_test)
precision, recall, fscore, support = score(y_test, predictions, pos_label='spam', average='binary')
print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                         round(recall, 3),
                                                         round((predictions==y_test).sum() / len(predictions),3)))

Precision: 1.0 / Recall: 0.79 / Accuracy: 0.973


## Part 7: Feature union
using this functionality we can do processes parallel to each other. As an example we want to bring the length of the texts into our calculations. 

In [22]:
''' we need to wrap the appropriate method in a classed baed off BaseEstimator and TransformerMixin as bellow'''
class TextLengthExtractor(BaseEstimator, TransformerMixin):
    def text_length(self, text):
        return len(text) - text.count(" ")

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.text_length)
        return pd.DataFrame(X_tagged)

In [23]:
def model_pipeline2():
    pipeline = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            ('text_length', TextLengthExtractor())
        ])),

        ('clf', RandomForestClassifier())
    ])

    return pipeline

In [26]:
model = model_pipeline2()
model.fit(X_train, y_train);
y_pred = model.predict(X_test)
display_results(y_test, y_pred)

Labels: ['ham' 'spam']
Confusion Matrix:
 [[1216    0]
 [  38  138]]
Accuracy: 0.972701149425


In [27]:
print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                         round(recall, 3),
                                                         round((predictions==y_test).sum() / len(predictions),3)))

Precision: 0.992 / Recall: 0.722 / Accuracy: 0.964


### We can simply add Grid Search to our pipeline model

In [None]:
pipeline = model_pipeline2()
param = {'clf__n_estimators': [10, 150, 300],
         'clf__max_depth': [30, 60, 90, None]}
model = GridSearchCV(pipeline, param, cv=5, n_jobs=-1)
model.fit(X_train, y_train);
y_pred = model.predict(X_test)
display_results(y_test, y_pred)