<a href="https://colab.research.google.com/github/smoussa24/Ardunio_Uno_Counter/blob/master/SKLearn_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# McGill AI Scikit-Learn Workshop

January 27, 2020

**Etienne Denis**

Slideshow url: [git.io/fh91K](git.io/fh91K)

Colab: [here](https://colab.research.google.com/drive/1LQuuM9oNuQhX16jyMoD2ekkIvJ4nefHd)

## Why SKLearn
- Implementing scalable and efficient ML models is **hard**
- Python packages ecosystem
- Opensource and well documented
- Includes many models


## (Supervised) ML Pipeline

<img src="http://drive.google.com/uc?export=view&id=1rRN74OATkfsAJPrUmT1YxxXeA9lQK5BN" width="1000">




## ML with SKLearn 

<img src="http://drive.google.com/uc?export=view&id=1BhV_PosHmARa0SKJ18OzbY9flD0hMXJI" width="1000">




## NLP Example: 20 News Group Dataset
We will explain the concepts behind each of these steps using the [ 20 News Group dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html). 

In [0]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all')

# The 20 News Group dataset is a collection of ~20,000 newsgroup documents 
# partitioned into 20 balanced categories ("groups") as shown below:

newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Splitting the data
* SKLearn takes care of partitioning data with the function [`train_test_split` ](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, train_size=0.8, test_size=0.2)


## Vectorization and Feature Engineering


In [0]:
# Always look at the data before making feature engineering decisions
# Seems like the internet never changes...
print(X_train[10000])

From: mark@taylor.uucp (Mark A. Davis)
Subject: Re: Blinking Cursor in Xterm???
Organization: Lake Taylor Hospital Computer Services
Lines: 25

barmar@think.com (Barry Margolin) writes:

>In article <1993May18.130845.6859@taylor.uucp> mark@taylor.uucp (Mark A. Davis) writes:
>>You are stuck in a distributed system feedback loop!  What if you are on an
>>Xterminal or running xterm over the net to another machine?  That is when the
>>load problems occur.  If you had a machine with 20 Xterminals attached and
>>they all had blinking cursors in xterm's, this could represent a considerable
>>ethernet bandwidth waste.

>Ethernet certainly has enough bandwidth for a few blinking cursors.  My
>Lisp Machine blinks the cursor twice a second and updates the time in the
>status line every second.  This uses under 1K bytes/second, or about .1% of
>the bandwidth of Ethernet.  A hundred of them on a single ethernet might be
>a problem, but a couple dozen should be fine.

Granted it's nothing to loose 

## Bag of Words

Each word in the selected vocabulary (e.g. `{apple, bannana}`) is **one hot encoded**:
```
Apple  ==>> [1,0]
Banana ==>> [0,1]
```

A BoW vectorized “document” encodes the multiplicity of each (vocab) word in the document:
```
d1 = "Apple Banana Banana Apple" ==>> [2,2]
d2 = "Banana Cat Chicken Frog" ==>> [1,0]
d3 = "Banana Apple Banana Apple" ==>> [2,2]
```



In [0]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer().fit(X_train)
X_train_counts = count_vect.transform(X_train)
X_test_counts = count_vect.transform(X_test) 


There are many other ways of vectorizing textual features. Most methods result in (lossy) compression of information present in the text. What's lost has a large impact on downstream performance. 

**Think about what sort of information is lost with a BoW encoding?**

## Better Features
* Bag of words loses information on ordering and relative frequency
    * tf-idf
    * n-grams

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tfidf = tfidf_transformer.transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)


## Normalization
* Scales individual samples (rows of `X_train`) to have unit norm 
* Many way of normalizing, SKLearn provides (l1,l2,max)
* We'll use the default **l2 normalization**

In [0]:
from sklearn.preprocessing import Normalizer

normalizer_tranformer = Normalizer().fit(X=X_train_tfidf)
X_train_normalized = normalizer_tranformer.transform(X_train_tfidf)
X_test_normalized = normalizer_tranformer.transform(X_test_tfidf)


## The Model
* SKLearn provides a wide variety of [models](https://scikit-learn.org/stable/supervised_learning.html)
* No gpu support
* Not a deeplearning library
<img src=http://scikit-learn.org/stable/_static/ml_map.png width="1000">



In [0]:
# We will by using MultinomialNB but there are many other good choices
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_normalized, y_train)

##  Prediction and Evaluation

In [0]:
from sklearn import metrics
y_pred = clf.predict(X_test_normalized)
print(metrics.classification_report(y_test, y_pred,
    target_names=newsgroups.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.90      0.73      0.80       178
           comp.graphics       0.85      0.74      0.79       192
 comp.os.ms-windows.misc       0.86      0.84      0.85       208
comp.sys.ibm.pc.hardware       0.71      0.91      0.79       182
   comp.sys.mac.hardware       0.93      0.82      0.87       188
          comp.windows.x       0.95      0.82      0.88       212
            misc.forsale       0.90      0.70      0.79       192
               rec.autos       0.86      0.89      0.87       186
         rec.motorcycles       0.98      0.96      0.97       209
      rec.sport.baseball       0.97      0.95      0.96       201
        rec.sport.hockey       0.91      0.99      0.95       206
               sci.crypt       0.74      0.98      0.84       192
         sci.electronics       0.89      0.72      0.80       195
                 sci.med       0.97      0.93      0.95       197
         

# Pipeline

#### Notice how there is a repetitive structure our code:

<img src="http://drive.google.com/uc?export=view&id=1D7Fc3Cv5rBl9-XvaVpCdtSGLs4u-knJu" width="1000">

#### This calls for abstraction!


In [0]:
## SKLearn Pipeline
Why?
* abstraction
* k-fold cross validation
* keeping track of hyperparameters

How?
 * Chains a series of transformers followed by an estimator passing data from one program to another

## SKLearn Pipeline
![](http://docs.google.com/drawings/d/e/2PACX-1vQjzopDKBYU5J0vlQma0a3nQ_6KQulscS5P3CYlWEhAIVey3Ofj1U7_f4QRZvE4W0Nzaq7AYdzQLpOD/pub?w=463&h=531)

## Pipeline Makeover

In [0]:
from sklearn.pipeline import Pipeline
pclf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('norm', Normalizer()),
    ('clf', MultinomialNB()),
])

pclf.fit(X_train, y_train)
y_pred = pclf.predict(X_test)
print(metrics.classification_report(y_test, y_pred,
    target_names=newsgroups.target_names))

# That was easy....

                          precision    recall  f1-score   support

             alt.atheism       0.90      0.73      0.80       178
           comp.graphics       0.85      0.74      0.79       192
 comp.os.ms-windows.misc       0.86      0.84      0.85       208
comp.sys.ibm.pc.hardware       0.71      0.91      0.79       182
   comp.sys.mac.hardware       0.93      0.82      0.87       188
          comp.windows.x       0.95      0.82      0.88       212
            misc.forsale       0.90      0.70      0.79       192
               rec.autos       0.86      0.89      0.87       186
         rec.motorcycles       0.98      0.96      0.97       209
      rec.sport.baseball       0.97      0.95      0.96       201
        rec.sport.hockey       0.91      0.99      0.95       206
               sci.crypt       0.74      0.98      0.84       192
         sci.electronics       0.89      0.72      0.80       195
                 sci.med       0.97      0.93      0.95       197
         

In [0]:
# From: https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html
import numpy as np
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

## Randomized Search and Cross Validation

In [0]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as randint
from scipy.stats import uniform

params = {"vect__ngram_range": [(1,1),(1,2),(2,2)],
          "tfidf__use_idf": [True, False],
          "clf__alpha": uniform(1e-2, 1e-3)}

seed = 551 # Very important for repeatibility in experiments!

random_search = RandomizedSearchCV(pclf, param_distributions = params, cv=2, verbose = 10, random_state = seed, n_iter = 1)
random_search.fit(X_train, y_train)



Fitting 2 folds for each of 1 candidates, totalling 2 fits
[CV] clf__alpha=0.010640064047894703, tfidf__use_idf=False, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__alpha=0.010640064047894703, tfidf__use_idf=False, vect__ngram_range=(1, 1), score=0.8775347912524851, total=   7.6s
[CV] clf__alpha=0.010640064047894703, tfidf__use_idf=False, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.3s remaining:    0.0s


[CV]  clf__alpha=0.010640064047894703, tfidf__use_idf=False, vect__ngram_range=(1, 1), score=0.8834152171026424, total=   6.9s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   21.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   21.9s finished


RandomizedSearchCV(cv=2, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...malizer(copy=True, norm='l2')), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
          fit_params=None, iid='warn', n_iter=1, n_jobs=None,
          param_distributions={'vect__ngram_range': [(1, 1), (1, 2), (2, 2)], 'tfidf__use_idf': [True, False], 'clf__alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x11d990b38>},
          pre_dispatch='2*n_jobs', random_state=551, refit=True,
          return_train_score='warn', scoring=None, verbose=10)

## CV Results and Final Eval

In [0]:
report(random_search.cv_results_)
y_pred = random_search.predict(X_test)
print(metrics.classification_report(y_test, y_pred,
    target_names=newsgroups.target_names))

Model with rank: 1
Mean validation score: 0.880 (std: 0.003)
Parameters: {'clf__alpha': 0.010640064047894703, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}

                          precision    recall  f1-score   support

             alt.atheism       0.92      0.90      0.91       178
           comp.graphics       0.77      0.84      0.80       192
 comp.os.ms-windows.misc       0.87      0.86      0.86       208
comp.sys.ibm.pc.hardware       0.78      0.87      0.82       182
   comp.sys.mac.hardware       0.89      0.88      0.89       188
          comp.windows.x       0.94      0.89      0.91       212
            misc.forsale       0.88      0.88      0.88       192
               rec.autos       0.93      0.92      0.93       186
         rec.motorcycles       0.99      0.96      0.97       209
      rec.sport.baseball       0.99      0.96      0.97       201
        rec.sport.hockey       0.95      0.99      0.97       206
               sci.crypt       0.96      0

# Customizing the Pipeline

## Custom Transformers
Assignment 1 asked you to create custom features. How could we add similar features to our SKLearn pipeline? We'll be working off of [an example in the SKLearn docs](https://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html).

We'll have to create our own classes of the same polymorphic type that the [**Pipeline** class](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) expects. Estimators (i.e. ML Models) and Transformers both extend two base classes in SKLearn:

* *Transformers and estimators* both extend the [**BaseEstimator** class](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html)

* *Transformers* also extend the [**TransformerMixin** class](https://sklearn.org/modules/generated/sklearn.base.TransformerMixin.html)

* *Estimators* also extend the [**ClassifierMixin** class](https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassifierMixin.html)

Let's take another look at a NewsGroup document to inform our new feature choices


In [0]:
# Notice the seperate subject and body. The subject carries concise information 
# that is lost by performing tfidf BoW on the whole. 
print(X_train[0])

From: jason@ab20.larc.nasa.gov (Jason Austin)
Subject: Polls (was Re: Top Ten Excuses for Slick Willie's Record-Setting Disapproval Rati)
Organization: NASA Langley Research Center, Hampton, VA
Lines: 37
Reply-To: Jason C. Austin <j.c.austin@larc.nasa.gov>
NNTP-Posting-Host: ab20.larc.nasa.gov
In-reply-to: libwca@emory.edu's message of 14 Apr 93 20:31:09 GMT

In article <2680@emoryu1.cc.emory.edu> libwca@emory.edu (Bill Anderson) writes:
-> : 	According to a ``CNN Poll'' to key reason for Clinton's low
-> : approval rating is people are angry about him not moving fast enough
-> : on gays in the military.  I just burst out laughing when I heard this;
-> : what planet do these CNN people live on anyway?
-> : --
-> : Jason C. Austin
-> : j.c.austin@larc.nasa.gov       
-> 
-> Dunno, man... that sounds pretty damned unlikely to me, too,
-> although it's certainly one of the reasons I'm pissed off at him.
-> Maybe the sample was taken entirely from my fellow memebers of the
-> Cultural Elit

As seen in the document we might gain from looking at the subject and body seperately.

We'll be adding **three custom features**:
* BoW for the *subject line*
* BoW for the *body* 
* Vector of body *text stats*
    * length of text
    * number of sentences


The pipeline passes data in a sequential manner from transformer to transformer (and finally to an estimator). Let's get a *rough* idea of what the flow looks like in our case:

* Splitting the subject and body
* Calculating the custom *text stats*
* Calculating BoW features for *subject line* and *body*

Finally we will also have to consider how to merge all of these features

## Splitting the Subject and Body
* We will seperate the documents into a dictionary that contains `subject` and `body` keys, so that dowstream transformers can access the seperated data

In [0]:
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin

class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
    """Extract the subject & body from a usenet post in a single pass.

    Takes a sequence of strings and produces a dict of sequences.  Keys are
    `subject` and `body`.
    """
    def fit(self, x, y=None):
        return self

    def transform(self, posts):
        features = np.recarray(shape=(len(posts),),
                               dtype=[('subject', object), ('body', object)])
        for i, text in enumerate(posts):
            headers, _, bod = text.partition('\n\n')
            bod = strip_newsgroup_footer(bod)
            bod = strip_newsgroup_quoting(bod)
            features['body'][i] = bod

            prefix = 'Subject:'
            sub = ''
            for line in headers.split('\n'):
                if line.startswith(prefix):
                    sub = line[len(prefix):]
                    break
            features['subject'][i] = sub

        return features

## Selecting the Right Data
* Remember that all data manipulation within the pipeline must be by a transformer that extends the proper class, even for something as simple as selecting a dictionary key.

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]


## Nested Pipelines
* To simplify working with more complicated flows, **pipelines can be nested** (we'll talk about how later)
* We have the necessary components for text and body BoW features. *Let's extract these features*

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
# Pipeline for pulling features from the post's subject line
subject_pipeline =  Pipeline([
    ('selector', ItemSelector(key='subject')),
    ('tfidf', TfidfVectorizer(min_df=50)),
])

# Pipeline for standard bag-of-words model for body
body_pipeline = Pipeline([
    ('selector', ItemSelector(key='body')),
    ('tfidf', TfidfVectorizer()),
    ('best', TruncatedSVD(n_components=50)),
    # TruncatedSVD simply does dimensionality reduction
    # We are picking the "n_components" most informative words
])

## Text Stats Feature

In [0]:
class TextStats(BaseEstimator, TransformerMixin):
    """Extract features from each document for DictVectorizer"""

    def fit(self, x, y=None):
        return self

    def transform(self, posts):
        return [{'length': len(text),
                 'num_sentences': text.count('.')}
                for text in posts]



In [0]:
from sklearn.feature_extraction import DictVectorizer

#Pipeline for pulling ad hoc features from post's body
body_stats_pipeline =  Pipeline([
    ('selector', ItemSelector(key='body')),
    ('stats', TextStats()),  # returns a list of dicts
    ('vect', DictVectorizer()),  # list of dicts -> feature matrix
])

## Combining Features
* **Remember:** Data flows sequentially in the pipeline. 

How do we go about extracting several features in parallel from the same data?
   * SKLearn provides a [**FeatureUnion** class](https://scikit-learn.org/0.18/modules/generated/sklearn.pipeline.FeatureUnion.html) that allows us to pass the data from the previous step to several nested pipelines.
   * FeatureUnion works by **concatenating** our feature vectors


In [0]:
class DenseTransformer(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.toarray()

In [0]:
from sklearn.pipeline import FeatureUnion

custom_features = FeatureUnion(
        transformer_list=[
            # The custom feature transformers
            ('subject', subject_pipeline),
            ('body_bow', body_pipeline),
            ('body_stats', body_stats_pipeline),
        ],
    
        # weight feature components in FeatureUnion
        transformer_weights={
            'subject': 0.8,
            'body_bow': 0.5,
            'body_stats': 1.0,
        },
    )

## Putting it all together


In [0]:
from sklearn.svm import SVC

custom_feature_pipeline = Pipeline([
    # Extract the subject & body
    ('subjectbody', SubjectBodyExtractor()),

    # Use FeatureUnion to combine the features from subject and body
    ('union', custom_features),
    # Use a SVC classifier on the combined features
    ('svc', SVC(kernel='linear'))

])

subj = SubjectBodyExtractor().fit(X_train)
custom_feature_pipeline.fit(X_train, y = y_train)
y_pred = custom_feature_pipeline.predict(y_test)
print(metrics.classification_report(y_pred, y_test))

## Good Practices
* Python can encourage lousy programming
* Make sure your experiments are repeatible 
    * random seeds
    * abstraction
    * recording parameters and outputs
* When using jupyter be careful for namespace bugs


## Questions?

<img src="http://drive.google.com/uc?export=view&id=1BhV_PosHmARa0SKJ18OzbY9flD0hMXJI" width="400">




**Stay up to date with all of our events! Signup for the weekly newsletter** [here](https://docs.google.com/forms/d/e/1FAIpQLScs1ADlyTZHnvbQPIftXhE2yGvqPGzyhhKq87q212ttkOmIaQ/viewform)