# Scikit-learn - Unit 09 - NLP (Natural Language Processing)

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand and create a ML pipeline for NLP (Natural Language Processing)


---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - NLP (Natural Language Processing)

* We will continue from the previous notebook, where we found the algorithm that most suited the data (SGDClassifier) and now we are doing an extensive hyperparameter optimization to find the pipeline with the best hyperparameter combination.
* Once we find the best pipeline, we will evaluate the pipeline and predict on real-time data
* We will need to re-load the data, custom function for hyperparameter optimization and pipeline.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load data

We will use a dataset that contains records telling if a given SMS message is a spam or not (spam or ham). We load the data from GitHub.
* In this project we are intersted to **predict if a given message is spam or not**, therefore the ML task is Classification

url = 'https://raw.githubusercontent.com/ShresthaSudip/SMS_Spam_Detection_DNN_LSTM_BiLSTM/master/SMSSpamCollection'
df = (pd.read_csv(url, sep ='\t',names=["label", "message"])
    .sample(frac=0.6, random_state=0)
    .reset_index(drop=True)
    )
df = df.sample(frac=0.5, random_state=101)
print(df.shape)
df.head()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Split data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  As usual, we are splitting the data into train and test set.
* In this case, there are 2 columns in the dataset, where `message` contains the text, and `label` tells if the SMS message was a spam or not.
* At the end, we have a Pandas Series for the features (`message`) and target (`label`) - note the brackets subsetting the data, for example: `df['message']`

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'],
                                                    test_size=0.2, random_state=101)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Create the pipeline

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> We will consider classic steps in an NLP pipeline, where we first clean the text then extract the features for the model
* The pipeline steps will be slightly different from what we have been studying at Classfication (Data Cleaning, Feature Engineering, Feature Scaling, Feature Selection and Model), but the purpose is the same: prepare the data for the model.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 Overall, here we will consider steps for **(1) cleaning the textual data and (2) represent the text as numbers, or feature extraction.**
* (1) In our case, we will make the text lower case and remove punctuation for text cleaning.
    * The practical tasks for cleaning the textual data will differ from dataset to dataset; for example, you may have a dataset where you need to clean HTML tags, so you need a function to do that for you; or eventually, you need to remove diacritics (marks located above or below a letter to reflect a particular pronunciation, like *resumé*)
  
* (2) There are also multiple techniques for feature extraction; we will consider the ones we covered in Module 2; in this case, we **will tokenize the text then use TF-IDF (Term Frequency－Inverse Document Frequency)**

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are using texthero to clean the textual data, **by changing the text to lowercase and removing punctuation** from the textual data
* If you want to refresh these concepts, you may revert to Module 2 where we covered the NLP task.
* We need to create a custom Python class to parse it into the pipeline afterwards. We are using the same approach for creating custom transformers we saw in the feature-engine lesson, where we use BaseEstimator, TransformerMixin, create fit and transform methods. So the custom transformer can be added correctly to the ML pipeline.

from sklearn.pipeline import Pipeline
import texthero as hero

from sklearn.base import BaseEstimator, TransformerMixin
class text_cleaning(BaseEstimator, TransformerMixin):

  def __init__(self ):
    return None

  def fit(self, X, y=None):
    return self

  def transform(self, X):
    X = hero.preprocessing.lowercase(X)
    X = hero.remove_punctuation(X)
    return X


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For feature extraction we use **CountVectorizer** and **TfidfTransformer**, you can find their documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).
* We need to convert the textual data to a format that the algorithms can learn the relationships from, also known as vectors. 
  * CountVectorizer: According to its documentation, it converts a collection of text documents to a matrix of token counts. It stores the number of times every word is used in our text data. We are also removing english "stop words".
  * (TfidfTransformer) Term Frequency－Inverse Document Frequency Transformer: It transforms a count matrix to a normalized tf or tf-idf representation according to its documentation. The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and are empirically less informative than features that occur in a small fraction of the data. In addition, this highlights the words that are most unique to a document, thus better for characterizing it. 


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> Finally, our pipeline will have 4 steps:
* Text cleaning: lowercase the text and remove punctuation
* CountVectorizer: convert text to token
* TF-IDF: transform a count matrix to a normalized tf or tf-idf representation
* Model

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

def PipelineOptimization(model):
  pipeline = Pipeline([
                       
        ( 'text_cleaning', text_cleaning() ),
        ( 'vect', CountVectorizer(stop_words='english') ),
        ( 'tfidf', TfidfTransformer() ),
        ( 'model', model )
    ])
  
  return pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We load the Python class (HyperparameterOptimizationSearch) that we studied in previous units, which aims to fit a set of algorithms with multiple hyperparameters. A quick reinforcement on what this class does: 
* The developer defines a set of algorithms and their respectives hyperparameters values
* The code iterates on each algoirthm and fits pipelines using GridSearchCV considering its respective hyperparameter values. The result is stored.
That is repeated for all algorithms that the user listed.
* Once all pipelines are trained, the developer can retrieve a list with a performance result summary and an object that contains all trained pipelines. The developer can then subset the best pipeline.

from sklearn.model_selection import GridSearchCV
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model=  PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Fit multiple pipelines for best algorithms using multiple hyperparameter combination

We update our dictionaries using the algorithms and hyperparameters combinations we want to optimize.

from sklearn.linear_model import SGDClassifier

models_search = {
    "SGDClassifier":SGDClassifier(random_state=101),}


params_search = {
    "SGDClassifier": {'model__tol':[1e-2, 1e-1], },
  }

Next we fit multiple pipelines using the algorithms we selected considering multiple combinations of hyperparameters

search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='accuracy',
           n_jobs=-2,
           cv=2)

Let's check  training results summary 
* Note that SGDClassifier performed best. Not only the performance improved from the default hyperparameters but now SGDClassifier is performing better than LinearSVC.

grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

We check for the best model programatically

best_model = grid_search_summary.iloc[0,0]
best_model

So we can grab the best model parameters

grid_search_pipelines[best_model].best_params_

And grab the best pipeline

best_pipeline = grid_search_pipelines[best_model].best_estimator_
best_pipeline

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pipeline Performance

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Finally, we evaluate the pipeline as usual with our custom function for classification tasks.

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We parse the arguments we are familiar with
* Train and Test set
* Best pipeline
* for `label_map`, we get the classes name with `.unique()`


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> Note: The model learned the relationships in the data in the train set and predicted everything correctly. In the test set, we had a few misclassifications, but still, the performance looks good, and the **model could generalize on the unseen data** (test set)

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=best_pipeline,
                label_map= df['label'].unique()
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Predict on real time data

Parse a real time message to validate whether or not you shall click on the link   :)
* Try new sentencs, by changing the content on the `real_time_msg` variable

########################################################################
real_time_msg = 'Congratulations, you won the auction. Please click on link below to get your prize'
########################################################################

X_live = pd.Series(data=real_time_msg, name='message')
best_pipeline.predict(X_live)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Bonus: Typical hyperparameters for algorithms listed in this notebook

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We again reinforce that It will take time and experience to learn which hyperparameters to consider when optimizing your pipeline and which values would make sense to tune.
* the library documentation is your best friend instructing you on the library's available hyperparameters for that given algorithm.
The hyperparameters we list here are a suggestion so that you can use them as a reference when you start fine-tuning your ML pipelines.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We will write the hyperparameters for the algorithms using the same dictionary structure we saw over the notebook, assuming you are arranging everything into a pipeline and the last step is called '`model`'

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Support Vector Machine


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Support Vector Machine (or SVM) is an algorithm that can be used for Classification or Regression
* The idea is to find a hyperplane that separates the data.
  * A hyperplane is a boundary that distinguish the data points and will be N-1 dimensiona;, for example, if you have 2 variables (2 dimensions), you can plot these variables in a XY plot, like a 2D scatter plot. Your hyperplane in this case is a line. If you have 3 variables  (3 dimensions), you can plot these variables in a XYZ plot, like a 3D scatter plot. Your hyperplane in this case is a [plane](https://en.wikipedia.org/wiki/Plane_(geometry)) (note: it is a geometry plane, not an airplane)
  * The hyperplane should have the maximum distance (here called margin) between data points. Support vectors (therefore the algorithm name) are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane.

# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
from sklearn.svm import SVC

params_search = {
    "SVC": {#'model__C':[1,0.5,1.5],
          'model__tol':[1e-3,1e-2,1e-4],
          #  'model__kernel': ['rbf', 'poly', 'sigmoid'],
            }
}


---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Linear Support Vector Machine

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> According to its documentation, Linear Support Vector Machine is similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

# https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
from sklearn.svm import LinearSVC

params_search = {

    "LinearSVC": {#'model__C':[1,0.5,1.5],
                  'model__tol':[1e-3,1e-2,1e-4],
                  },
}

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Linear classifier with SGD (Stochastic Gradient Descent)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> According to Scikit learn documentation, this estimator implements regularized linear models (SVM, logistic regression, etc.) with stochastic gradient descent (SGD) learning
* SGD is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. SGD is merely an optimization technique and does not correspond to a specific family of machine learning models. It is only a way to train a model.

# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
from sklearn.linear_model import SGDClassifier

params_search = {
    "SGDClassifier": {'model__tol':[1e-3, 1e-2, 1e-4],
                    #  'model__penalty':['l2', 'l1', 'elasticnet'],
                     # 'model__alpha':[0.0001,0.001],
                      },
}

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Naive Bayes

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> According to Scikit learn [documentation](https://scikit-learn.org/stable/modules/naive_bayes.html), Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. 
* Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
from sklearn.naive_bayes import MultinomialNB

params_search = {
    "MultinomialNB":{'model__alpha': [1.0, 0.6, 0.4, 1.3, 0.0]
                     },
}


---