# Scikit-learn - Unit 09 - NLP (Natural Language Processing)

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand and create a ML pipeline for NLP (Natural Language Processing)


---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - NLP (Natural Language Processing)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Conversational language, unlike text neatly entered into form inputs, is unstructured data that cannot be neatly broken down into elements in a row-column database table; there is a vast quantity of information available within it and waiting to be accessed. 
* The aim of natural language processing, therefore, is to gather, extract and make available all of this information.




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%208-%20Challenge.png"> NLP is not a trivial task since its goal is to understand the language, not only process the text/strings/keywords. 
* As we know, language is ambiguous, subjective and subtle.  New words and terms are constantly added/updated and their meaning may change according to the context. 
* These aspects all together make NLP a very interesting and challenging task for ML.




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will study NLP (Natural Language Processing) as a supervised learning approach where the features are text and the target variable is a meaning associated to that given text. Therefore the ML task is Classification.
* Therefore the workflow will be similar to what we covered for Classification tasks, where we :
    * load the data
    * define the pipeline steps
    * split the data into train and test sets
    * train multiple pipelines using hyperparmeter optimization
    * evaluate pipeline performance
* One difference will be defining the pipeline steps, where we will use steps for pre-processing the textual data before the modelling step. Once you have a processed text, you can then use ML algorithms to predict your target variable.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load data

We will use a dataset that contains records telling if a given SMS message is a spam or not (spam or ham). We load the data from GitHub.
* In this project we are intersted to **predict if a given message is spam or not**, therefore the ML task is Classification

url = 'https://raw.githubusercontent.com/ShresthaSudip/SMS_Spam_Detection_DNN_LSTM_BiLSTM/master/SMSSpamCollection'
df = (pd.read_csv(url, sep ='\t',names=["label", "message"])
    .sample(frac=0.6, random_state=0)
    .reset_index(drop=True)
    )
df = df.sample(frac=0.5, random_state=101)
print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Note: just as a reinforcement, in an actual project, once you load your textual data, you could explore using the techniques covered in the Text Analysis lesson. 
* We will not do that here since our focus is on the ML process used in NLP.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Split data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  As usual, we are splitting the data into train and test set.
* In this case, there are 2 columns in the dataset, where `message` contains the text, and `label` tells if the SMS message was a spam or not.
* At the end, we have a Pandas Series for the features (`message`) and target (`label`) - note the brackets subsetting the data, for example: `df['message']`

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'],
                                                    test_size=0.2, random_state=101)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Create the pipeline

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> We will consider classic steps in an NLP pipeline, where we first clean the text then extract the features for the model
* The pipeline steps will be slightly different from what we have been studying at Classfication (Data Cleaning, Feature Engineering, Feature Scaling, Feature Selection and Model), but the purpose is the same: prepare the data for the model.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 Overall, here we will consider steps for **(1) cleaning the textual data and (2) represent the text as numbers, or feature extraction.**
* (1) In our case, we will make the text lower case and remove punctuation for text cleaning.
    * The practical tasks for cleaning the textual data will differ from dataset to dataset; for example, you may have a dataset where you need to clean HTML tags, so you need a function to do that for you; or eventually, you need to remove diacritics (marks located above or below a letter to reflect a particular pronunciation, like *resumé*)
  
* (2) There are also multiple techniques for feature extraction; we will consider the ones we covered in Module 2; in this case, we **will tokenize the text then use TF-IDF (Term Frequency－Inverse Document Frequency)**

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are using texthero to clean the textual data, **by changing the text to lowercase and removing punctuation** from the textual data
* If you want to refresh these concepts, you may revert to Module 2 where we covered the NLP task.
* We need to create a custom Python class to parse it into the pipeline afterwards. We are using the same approach for creating custom transformers we saw in the feature-engine lesson, where we use BaseEstimator, TransformerMixin, create fit and transform methods. So the custom transformer can be added correctly to the ML pipeline.

from sklearn.pipeline import Pipeline
import texthero as hero

from sklearn.base import BaseEstimator, TransformerMixin
class text_cleaning(BaseEstimator, TransformerMixin):

  def __init__(self ):
    return None

  def fit(self, X, y=None):
    return self

  def transform(self, X):
    X = hero.preprocessing.lowercase(X)
    X = hero.remove_punctuation(X)
    return X


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For feature extraction we use **CountVectorizer** and **TfidfTransformer**, you can find their documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).
* We need to convert the textual data to a format that the algorithms can learn the relationships from, also known as vectors. 
  * CountVectorizer: According to its documentation, it converts a collection of text documents to a matrix of token counts. It stores the number of times every word is used in our text data. We are also removing english "stop words".
  * (TfidfTransformer) Term Frequency－Inverse Document Frequency Transformer: It transforms a count matrix to a normalized tf or tf-idf representation according to its documentation. The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and are empirically less informative than features that occur in a small fraction of the data. In addition, this highlights the words that are most unique to a document, thus better for characterizing it. 


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> Finally, our pipeline will have 4 steps:
* Text cleaning: lowercase the text and remove punctuation
* CountVectorizer: convert text to token
* TF-IDF: transform a count matrix to a normalized tf or tf-idf representation
* Model

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

def PipelineOptimization(model):
  pipeline = Pipeline([
                       
        ( 'text_cleaning', text_cleaning() ),
        ( 'vect', CountVectorizer(stop_words='english') ),
        ( 'tfidf', TfidfTransformer() ),
        ( 'model', model )
    ])
  
  return pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We load the Python class (HyperparameterOptimizationSearch) that we studied in previous units, which aims to fit a set of algorithms with multiple hyperparameters. A quick reinforcement on what this class does: 
* The developer defines a set of algorithms and their respectives hyperparameters values
* The code iterates on each algoirthm and fits pipelines using GridSearchCV considering its respective hyperparameter values. The result is stored.
That is repeated for all algorithms that the user listed.
* Once all pipelines are trained, the developer can retrieve a list with a performance result summary and an object that contains all trained pipelines. The developer can then subset the best pipeline.

from sklearn.model_selection import GridSearchCV
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model=  PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> List algorithms

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Now we list the algorithms we want to use for this task. First, we are considering new estimators from Scikit learn that typically tend to offer reasonable performance for NLP tasks.
  * It doesn't mean we couldn't have considered the algorithms we have seen already in the course, like tree-based algorithms. However, the central aspect is that we use algorithms that tend to be more effective for NLP tasks.
  * For learning purposes, we will consider 2 algorithms (SGDClassifier and LinearSVC) from this set of algorithms used for NLP task, to speed up the learning process. However, we suggest you try with the other algorithms at your own pace and time
  * We will not give full details of how these other algorithms work to avoid overloading you with a lot of new information. It will be a matter of time, experience and curiosity for you to keep learning new topics as a data practitioner, including learning about additional families of algorithms. There is a BONUS section at the end of this notebook where we will briefly explain the algorithms and present the typical hyperparameters used for the NLP classification task.

from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.svm import LinearSVC

models_search = {
    #"MultinomialNB":MultinomialNB(),
    "SGDClassifier":SGDClassifier(random_state=101),
   # "SVC": SVC(random_state=101),
    "LinearSVC": LinearSVC(random_state=101),
}


params_search = {
   # "MultinomialNB":{},
    "SGDClassifier": {},
   # "SVC": {},
    "LinearSVC": {},
}


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
">
 We are using the technique we covered in previous units for hyperparameter optimization, where we:
* 1-  Fit multiple pipelines with multiple algorithms using their default hyperparameters. So we can find the algorithms that look to best fit the data
* 2 - Then we fit multiple pipelines for the best algorithms using multiple hyperparameter combinations

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit multiple pipelines with multiple algorithms using their default hyperparameters

We start by fitting multiple pipelines using the default hyperparmeters
* We parse the training data, set scoring metric to accuracy (we assume our stakeholders are interested in how accurate their system is) and set cv=2 (typically you may set to 5, but for learning purpose and to have a faster training, we set to 2)

search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='accuracy',
           n_jobs=-2,
           cv=2)

Let's check  the training results summary 
* Note that SGDClassifier performed best, and the difference to LinearSVC is slight; both are close.

grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

---