# Comp 551 Tutorial 2 :: scikit-learn

Oct 3, 2018

### Things we'll cover today

1) **What is scikit-learn and why should I care about it**

2) **How to go about doing ML (with scikit learn)**
  - ML as a single pipeline
    - Most common data preprocessing steps
      - train-test split
      - vectorization of textual features (only for textual features)
      - TF-IDF (only for textual features)
      - normalization
      - one-hot encoding of labels (for classification problems only)
    - Training
      - fit (closed form or gradient descent)
      - predict
    - Evaluation
      - metrics : measure accuracy - precision / recall / f-score
      - Cross validation
      - Grid Search

1) **What is scikit-learn and why should I care about it**

- Scikit-learn is a python-based free ML library that provides well-implemented off-the-shelf implementations for almost all ML operations.
- Implementing ML from scratch that is scalable, efficient, and error-free is really very very hard.

*Disclaimer* : Availability of off-the-shelf implementations of ML doesn't invalidates the need to know the algorithms details.

In [0]:
print("lala")

lala


2) **How to go about doing ML (with scikit learn)**

> Indented block




2.1) **ML as a single pipeline**
- Almost all *scalable* ML models follows the style of development in a pipeline. It eases the pain of thinking through the complex ML processes.
- Keep this pipeline in mind while developing any ML model.

![alt text](http://cs.mcgill.ca/~sthaku3/other_media/COMP551_tutorial_2/Steps.png)


- P.S. :: Closed-form solution seeking ML don't follow this suit.

From here on, we'll explain the concepts behind each of these steps with a real dataset called News20group as hosted [here](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html). So, let's import it now.

In [0]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [0]:
# This is how the features look like
list(newsgroups.data)[1:10]


['From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)\nSubject: Which high-performance VLB video card?\nSummary: Seek recommendations for VLB video card\nNntp-Posting-Host: midway.ecn.uoknor.edu\nOrganization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA\nKeywords: orchid, stealth, vlb\nLines: 21\n\n  My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n\n  - Any other high-performance VLB card\n\n\nPlease post or email.  Thank you!\n\n  - Matt\n\n-- \n    |  Matthew B. Lawson <------------> (mblawson@essex.ecn.uoknor.edu)  |   \n  --+-- "Now I, Nebuchadnezzar, praise and exalt and glorify the King  --+-- \n    |   of heaven, because everything he does is right and all his ways  |   \n    |   are just." - Nebuchadnezzar, king of Babylon, 562 B.C.           |   \n',
 'F

In [0]:
# This is how the target looks like
newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

![alt text](https://)**2.1.1) Most common data-proprocessing steps**
![alt text](http://cs.mcgill.ca/~sthaku3/other_media/COMP551_tutorial_2/Preprocessing.png)

**2.1.1.1) Train-test split **

In [0]:
## Simple way to do split would be to use scikit-learn's `train_test_split` method
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, train_size=0.8, test_size=0.2)
print(len(X_train))
# train tesprint(len(y_train))
# print(y_train[0])

15076


In [0]:
print(X_train[1])

From: jmd@cube.handheld.com (Jim De Arras)
Subject: Re: CLINTON JOINS LIST OF GENOCIDAL SOCIALIST LEADERS
Organization: Hand Held Products, Inc.
Lines: 51
Distribution: world
NNTP-Posting-Host: dale.handheld.com

In article <1993Apr23.153005.8237@starbase.trincoll.edu>  () writes:
> In article <1r6h4vINN844@clem.handheld.com>, jmd@cube.handheld.com (Jim De
> Arras) wrote:
> >   
> > You seem to make two points.  No one ultimately oversees the federal  
agencies  
> > you mention, and since Koresh "apparently" has a different view point from  
your  
> > Baptist upbringing, then he is not worthy of protection from religious  
> > persecution.  As to being the Messiah, is not Christ within us all?
> > 
> > Must be comforting to belong to a government approved religion.
> > 
> > Baptists are a cult, two, BTW, under most of the definitions in the  
dictionary  
> > of "cult".
> > 
> 
> I've yet to meet a group of Baptists who were stockpiling Cambell's soup
> and M-16's/AR-15's and banging

**2.1.1.2) Vectorization of textual features (applicable only for textual features)**

A very simple approach to represent textual features such as documents or sentences as numerical value is to use each word as an atomic type and as a basis for a vector space. For example imagine a world where there exist only 3 words: “Apple”, “Orange”, and “Banana” and every sentence or document is made of them. They become the basis of a 3 dimensional vector space:

```
Apple  ==>> [1,0,0]
Banana ==>> [0,1,0]
Orange ==>> [0,0,1]
```

This representation is called **one_hot** as it is always a vector of zeros with 1 on the position of the word.

Then a “sentence” or a “document” is simply the linear combination of these vectors where the number of the counts of appearance of the words is the coefficient along that dimension. For example:

```
d3 = "Apple Orange Orange Apple" ==>> [2,0,2]
d4 = "Apple Banana Apple Banana" ==>> [2,2,0]
d1 = "Banana Apple Banana Banana Banana Apple" ==>> [2,4,0]
d2 = "Banana Orange Banana Banana Orange Banana" ==>> [0,4,2]
d5 = "Banana Apple Banana Banana Orange Banana" ==>> [1,4,1]
```

Since, our toy dataset also has textual features, we'll have to vectorize them. But let's do train-test split first

## Now we transform text into vectors



In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
print(vectorizer)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [0]:
vectors_train = vectorizer.fit_transform(X_train)

# and we do the same for test data. remember to use the same vectorizer, only transform (why?)
vectors_test = vectorizer.transform(X_test)

In [0]:
print(vectors_train.shape)

(15076, 143453)


In [0]:
print(CountVectorizer())

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


**2.1.1.3) Tf–idf term weighting(only for textual features)**

In a large text corpus, some words are quite common (e.g. “the”, “a”, “is” in English). These words don't always convery meaningful information. However, if we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.''.

tf means term-frequency while tf–idf means term-frequency times inverse document-frequency:

$$\text{tf-idf} = \text{tf}(t,d)\  \text{X} \ \text{idf}(t)$$ 
  where :
- $t$ =  term/token/word 
 
- $d$ = document or a paragraph

- The $\text{tf}(n,d)$ is the number of times a token $t$ appears in a document $d$

- The idf is calculted as using the following formula:
  $$log\frac{n_{d}}{1+ \text{df}(d,t)}$$
  - where
   - $n_{d}$ is the total number of documents
   - $\text{df}(d,t)$ is the number of documents $d$ that contain term $t$ 

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer = TfidfVectorizer()
vectors_train_idf = tf_idf_vectorizer.fit_transform(X_train)
vectors_test_idf = vectorizer.transform(X_test)
print(X_train[1:2])

["From: pdc@dcs.ed.ac.uk (Paul Crowley)\nSubject: Re: Secret algorithm [Re: Clipper Chip and crypto key-escrow]\nReply-To: pdc@dcs.ed.ac.uk (Paul Crowley)\nOrganization: Edinburgh University\nLines: 11\n\nQuoting pla@sktb.demon.co.uk in article <8AOHOnj024n@sktb.demon.co.uk>:\n>You have every reason to be scared shitless.  Take a look at the records\n>of McCarthy, Hoover (J. Edgar, not the cleaner - though they both excelled at\n>sucking) and Nixon.\n\nHistory does not record whether J. Edgar Hoover was any good at sucking.\nAs for the cleaners, I'll stick with my 850W Electrolux and damn the\ncarpet.\n  __                                  _____\n\\/ o\\ Paul Crowley   pdc@dcs.ed.ac.uk \\\\ //\n/\\__/ Trust me. I know what I'm doing. \\X/  Fold a fish for Jesus!\n"]


In [0]:
print(vectors_train_idf.shape)
print(vectors_train_idf)

(15076, 143453)
  (0, 63454)	0.0288135123406339
  (0, 42332)	0.3419966030117869
  (0, 82767)	0.17637085176314105
  (0, 45539)	0.16955966078919996
  (0, 56164)	0.04095097076510266
  (0, 119892)	0.20762215355290042
  (0, 123298)	0.01440675617031695
  (0, 111507)	0.10915139419951263
  (0, 51392)	0.16343501905746047
  (0, 109640)	0.30924922353751205
  (0, 128135)	0.17165611334171327
  (0, 53326)	0.03568723079562396
  (0, 132658)	0.04340824223962878
  (0, 97010)	0.02676727411286295
  (0, 105355)	0.025980159325386727
  (0, 71356)	0.05322077009932647
  (0, 37181)	0.11606048663269346
  (0, 99961)	0.01499080819659299
  (0, 45512)	0.1354047003592674
  (0, 122153)	0.090684438174631
  (0, 131875)	0.027879548988000137
  (0, 62876)	0.09287370705699145
  (0, 45460)	0.08970332914063169
  (0, 45130)	0.05959052792161008
  (0, 22153)	0.11399886767059564
  :	:
  (15075, 52850)	0.061604288397687036
  (15075, 95673)	0.40231158365764463
  (15075, 131464)	0.11063485206848919
  (15075, 109607)	0.10132211651736

**2.1.1.4) Normalization**::
While not mandatory, normalization usually improves the performance of the learner significantly. It ensures that the learner learns from the data on similar scales across features. There are many ways of normalizing the data.
This will also be covered in detail in lecture *Feature Construction and Selection*.

For now we'll stick to the default *l2* provided by scikit-learn.

In [0]:
from sklearn.preprocessing import Normalizer

normalizer_train = Normalizer().fit(X=vectors_train)
vectors_train_normalized = normalizer_train.transform(vectors_train)
vectors_test_normalized = normalizer_train.transform(vectors_test)

**2.1.1.5) One-hot encoding of labels (for classification problems only)**
The integer nature of the labels is not amenable for classification tasks. However, scikit-learn above internally handles the integer nature of our labels. In most other cases, the programmers need to represent them in a format that allows handling them explicitly. One such popular format is one-hot encoding. This one-hot encoding works similar to as explained in section 2.1.1.2. So, we are only refering to a function [here](http://scikit-learn.org/stable/modules/preprocessing_targets.html#) that does that for you but for labels in the context of classification.

![alt text](http://cs.mcgill.ca/~sthaku3/other_media/COMP551_tutorial_2/training.png)

In [0]:
## Now we instantiate the classifier. Remember this can be any classifier, even the one you make.
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

**2.1.2.1) Fit**

In [0]:
## Scikit Learn API is very simple and straightforward, which contains the basic commands:
## `fit` to learn your parameters
clf.fit(vectors_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

**2.1.2.2) Predict**

In [0]:
## We get the predicted class
print(vectors_test)
y_pred = clf.predict(vectors_test)
## So now we see we have a set of predictions. 
y_pred

  (0, 3778)	1
  (0, 7420)	1
  (0, 15092)	1
  (0, 15282)	1
  (0, 16870)	1
  (0, 19961)	1
  (0, 21129)	1
  (0, 24295)	1
  (0, 27731)	1
  (0, 27765)	1
  (0, 27777)	1
  (0, 27894)	6
  (0, 27895)	1
  (0, 27994)	1
  (0, 28380)	1
  (0, 28495)	1
  (0, 30049)	5
  (0, 30254)	1
  (0, 30268)	1
  (0, 30362)	1
  (0, 30479)	1
  (0, 30599)	1
  (0, 30690)	2
  (0, 30828)	13
  (0, 31378)	1
  :	:
  (3769, 115901)	1
  (3769, 120167)	1
  (3769, 120302)	1
  (3769, 120444)	1
  (3769, 120952)	1
  (3769, 121125)	3
  (3769, 122557)	2
  (3769, 123298)	1
  (3769, 124122)	1
  (3769, 126743)	1
  (3769, 126770)	3
  (3769, 126785)	14
  (3769, 126984)	1
  (3769, 127965)	1
  (3769, 131847)	3
  (3769, 132694)	1
  (3769, 134289)	1
  (3769, 136942)	1
  (3769, 137227)	1
  (3769, 137570)	1
  (3769, 137880)	4
  (3769, 138409)	3
  (3769, 138580)	1
  (3769, 139253)	1
  (3769, 141734)	1


array([18, 12,  0, ..., 13, 11,  4])

### Evaluation

To see how good or bad your classifier did, you should check the predictions with the gold standard dataset. The cool thing about scikit-learn is that it gives you several metrics to do so. You can even use your own classifier and use the list of predicted classes and gold standard classes to compare.

![alt text](http://cs.mcgill.ca/~sthaku3/other_media/COMP551_tutorial_2/metrics.png)

In [0]:
from sklearn import metrics

The `metrics` class provides a set of useful metrics you can use for your needs. For any classification task, you need to report mainly these metrics:

- Accuracy : (TP + TN) / (TP + TN + FP + FN)
- Precision : TP / (TP + FP)
- F1 Score : 2TP / (2TP + FP + FN)
- Recall Score : TP / (TP + FN)

Remember, when we do multi-class classification, we use it in `macro` average mode, where we calculate metrics for each label, and find their unweighted mean. You can also instead use other averaging modes such as `micro`, `weighted`, `samples`

In [0]:
metrics.accuracy_score(y_test, y_pred)

0.903448275862069

In [0]:
metrics.precision_score(y_test, y_pred, average='macro')

0.9046498723495265

In [0]:
metrics.f1_score(y_test, y_pred, average='macro')

0.902861326959733

In [0]:
metrics.recall_score(y_test, y_pred, average='macro')

0.902009043513538

In [0]:
## You can show all of this in a single call
print(metrics.classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.92      0.90      0.91       168
          1       0.82      0.84      0.83       196
          2       0.87      0.84      0.85       195
          3       0.82      0.81      0.81       200
          4       0.89      0.89      0.89       213
          5       0.90      0.91      0.91       199
          6       0.82      0.90      0.86       186
          7       0.89      0.88      0.89       187
          8       0.97      0.94      0.95       200
          9       0.93      0.94      0.93       199
         10       0.93      0.97      0.95       191
         11       0.98      0.95      0.97       198
         12       0.84      0.83      0.84       191
         13       0.92      0.93      0.93       224
         14       0.95      0.93      0.94       201
         15       0.91      0.97      0.94       182
         16       0.90      0.95      0.93       180
         17       0.98      0.97      0.97   

In [0]:
### Cross Validation

from sklearn.model_selection import cross_val_score
clf = MultinomialNB(alpha=.01)
scores = cross_val_score(clf, vectors_train, y_train, cv=5)
print(scores)

[0.88653655 0.89002981 0.88594164 0.87607973 0.89195479]


## Grid Search

When you are first searching for the best hyperparamters, its a good first strategy to run a grid search with the choice of hyperparameters to see which ones work the best for your dataset. 

In [0]:
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

tuned_parameters = [{'alpha': [0.01, 1, 0.5, 0.2, 0.1]}]

n_folds = 5

clf = MultinomialNB()
clf = GridSearchCV(clf, tuned_parameters, cv=n_folds, refit=False)
clf.fit(vectors_train, y_train)
scores = clf.cv_results_['mean_test_score']
scores_std = clf.cv_results_['std_test_score']
print('scores:',scores)
print('scores_std',scores_std)

scores: [0.88611037 0.84770496 0.86700716 0.87609445 0.87894667]
scores_std [0.00547919 0.00674491 0.00838737 0.00750473 0.00641575]


### Choice of Classifier
![Choosing the right estimator](http://scikit-learn.org/stable/_static/ml_map.png)

### References

1.[Scikit Learn](http://scikit-learn.org/)

2.[Comp-551-tutorial(Winter-2018)](https://colab.research.google.com/drive/1SDc_x307LwqNucOclQinOdlDvawZdeo9#scrollTo=Cpjv6BG1ctcK)

### Useful Links

1. ROC Curve: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
2. https://people.duke.edu/~ccc14/sta-663/BlackBoxOptimization.html