## 1: Face Recognition, but not evil this time

Using the faces dataset in:

```
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
```

If you use the `faces.target` and `faces.target_names` attributes, you can build a facial recognition algorithm.

Use sklearn **gridsearch** (or an equivalent, like random search) to optimize the model for accuracy. Try both a SVM-based classifier and a logistic regression based classifier (with a feature pipeline of your choice) to get the best model. You should have at least 80% accuracy.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.manifold import Isomap
from matplotlib import offsetbox
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer, Normalizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [2]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)

In [3]:
X = faces.data
y = faces.target

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.shape

(1078, 2914)

In [5]:
#SVM based classifier

param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }

pipe = Pipeline([
    ('pca', PCA(n_components = 150, svd_solver ='randomized',
            whiten = True)),
    ('std', StandardScaler()),
    ('clf', GridSearchCV(SVC(kernel ='rbf', class_weight ='balanced'), param_grid))
])

In [6]:
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
accuracy_score(y_test, y_pred)

0.8555555555555555

In [7]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[ 8  1  0  3  0  0  0  0]
 [ 1 44  1  4  0  0  0  1]
 [ 0  1 20  4  0  0  0  0]
 [ 0  3  1 94  0  0  0  0]
 [ 0  0  0  4 15  0  0  2]
 [ 0  1  0  4  0  9  0  1]
 [ 0  1  0  1  0  0  8  0]
 [ 0  0  0  5  0  0  0 33]]
              precision    recall  f1-score   support

           0       0.89      0.67      0.76        12
           1       0.86      0.86      0.86        51
           2       0.91      0.80      0.85        25
           3       0.79      0.96      0.87        98
           4       1.00      0.71      0.83        21
           5       1.00      0.60      0.75        15
           6       1.00      0.80      0.89        10
           7       0.89      0.87      0.88        38

    accuracy                           0.86       270
   macro avg       0.92      0.78      0.84       270
weighted avg       0.87      0.86      0.85       270



In [8]:
# Logistic regression classifier

logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(random_state=42)

In [9]:
y_pred = logreg.predict(X_test)
accuracy_score(y_test, y_pred)

0.8185185185185185

In [10]:
pipe = Pipeline([
    ('standard_scaler', StandardScaler()),
    ('pca', PCA()),
    ('reg', LogisticRegression()),
])

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8518518518518519

In [11]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[ 8  0  2  1  1  0  0  0]
 [ 2 42  2  3  0  0  0  2]
 [ 0  2 18  4  0  0  0  1]
 [ 1  4  3 84  2  2  0  2]
 [ 1  0  0  1 17  0  1  1]
 [ 0  1  0  1  3 10  0  0]
 [ 0  0  0  0  0  0 10  0]
 [ 1  0  0  4  1  0  0 32]]
              precision    recall  f1-score   support

           0       0.62      0.67      0.64        12
           1       0.86      0.82      0.84        51
           2       0.72      0.72      0.72        25
           3       0.86      0.86      0.86        98
           4       0.71      0.81      0.76        21
           5       0.83      0.67      0.74        15
           6       0.91      1.00      0.95        10
           7       0.84      0.84      0.84        38

    accuracy                           0.82       270
   macro avg       0.79      0.80      0.79       270
weighted avg       0.82      0.82      0.82       270



- **Judging by the results, it seems like both classifiers give similar results**

# 2: Bag of Words, Bag of Popcorn

By this point, you are ready for the [Bag of Words, Bag of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) competition. 

Use NLP feature pre-processing (using, SKLearn, Gensim, Spacy or Hugginface) to build the best classifier you can. Use a  feature pipeline, and gridsearch for your final model.

A succesful project should get 90% or more on a **holdout** dataset you kept for yourself.

In [12]:
train = pd.read_csv('data/labeledTrainData.tsv', sep='\t')
train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [13]:
#Recycling the same gensim method from 4.7

import re

replaceDict = dict({
'{':" ", '}':" ", ',':"", '.':" ", '!':" ", '\\':" ", '/':" ", '$':" ", '%':" ",
'^':" ", '?':" ", '\'':" ", '"':" ", '(':" ", ')':" ", '*':" ", '+':" ", '-':" ",
'=':" ", ':':" ", ';':" ", ']':" ", '[':" ", '`':" ", '~':" ",
})

rep = dict((re.escape(k), v) for k, v in replaceDict.items())

pattern = re.compile("|".join(rep.keys()))
def replacer(text):
    return rep[re.escape(text.group(0))]

words = train.review.str.replace(pattern, replacer).str.lower().str.split()
words = pd.DataFrame(words.tolist())
words

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2509,2510,2511,2512,2513,2514,2515,2516,2517,2518
0,with,all,this,stuff,going,down,at,the,moment,with,...,,,,,,,,,,
1,the,classic,war,of,the,worlds,by,timothy,hines,is,...,,,,,,,,,,
2,the,film,starts,with,a,manager,nicholas,bell,giving,welcome,...,,,,,,,,,,
3,it,must,be,assumed,that,those,who,praised,this,film,...,,,,,,,,,,
4,superbly,trashy,and,wondrously,unpretentious,80,s,exploitation,hooray,the,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,it,seems,like,more,consideration,has,gone,into,the,imdb,...,,,,,,,,,,
24996,i,don,t,believe,they,made,this,film,completely,unnecessary,...,,,,,,,,,,
24997,guy,is,a,loser,can,t,get,girls,needs,to,...,,,,,,,,,,
24998,this,30,minute,documentary,buñuel,made,in,the,early,1930,...,,,,,,,,,,


In [14]:
import gensim.downloader as model_api

word_vectors = model_api.load("word2vec-google-news-300")

words.columns = words.columns.astype(str)

def soft_get(w):
    try:
        return word_vectors[w]
    except KeyError:
        return np.zeros(word_vectors.vector_size)
def map_vectors(row):
    try:
        return np.sum(
            row.loc[row.notna()].apply(soft_get)
        )
    except:
        return np.zeros(word_vectors.vector_size)
emb = pd.DataFrame(words.apply(map_vectors, axis=1))
emb.columns = ['C']
emb = pd.DataFrame(np.array(emb.C.apply(pd.Series)))
emb

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,17.283424,11.937847,18.506580,45.120676,-24.435819,3.551744,12.055645,-27.254570,30.983692,19.554092,...,-24.531746,23.092484,-33.328987,9.722412,-26.716827,-10.303997,2.579872,-19.129990,11.317310,-4.080734
1,4.778992,6.744947,3.544289,15.174940,-8.201431,0.314445,4.249725,-9.231554,7.977154,8.119308,...,-12.727188,3.146690,-12.804749,3.808205,-6.146217,-2.835491,-0.495293,-4.693481,4.241936,-0.824364
2,7.668854,18.400442,2.117029,24.500671,-17.646823,-1.491764,3.763025,-24.287819,18.558378,24.320732,...,-23.039650,6.284547,-23.298676,15.287937,-12.791531,-5.257778,2.578659,-12.364639,8.910461,4.143981
3,13.438416,9.676690,3.184448,28.120438,-22.092400,3.097771,14.640491,-28.468567,23.722717,22.710403,...,-25.556366,4.002686,-23.584404,9.610126,-10.111442,-2.726578,3.336586,-10.713562,16.601911,3.251701
4,17.447418,9.209012,8.393372,27.671641,-22.164581,1.606564,11.384158,-23.531187,22.702066,23.904671,...,-24.457840,10.056759,-29.204399,6.255138,-14.498840,-0.895094,6.454372,-19.625992,0.599186,1.965607
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,4.843140,3.083902,0.044777,8.552612,-7.245542,1.479004,4.257339,-4.647678,4.680054,2.775642,...,-5.378468,6.225250,-7.275024,3.712402,-4.262329,-1.908768,4.530167,-3.873884,-0.512589,0.154842
24996,5.863220,2.340229,2.210129,12.314983,-7.772020,1.114616,10.503281,-13.057060,13.101852,12.770050,...,-15.988598,6.903183,-15.212250,2.484120,-5.440849,-1.743176,2.057991,-6.311344,6.623047,-2.365334
24997,2.160698,3.789089,-0.081245,13.001617,-8.620522,-1.689472,2.976837,-6.946411,7.065413,7.312721,...,-7.426193,9.140686,-10.666756,4.078735,-6.753197,0.665543,-0.919044,-7.831268,3.209351,-2.725433
24998,9.051056,3.218176,9.683090,16.531189,-7.922881,3.315762,9.002274,-17.094444,12.638168,7.161171,...,-16.743801,0.952118,-15.382585,2.369106,-11.096809,-1.904205,-0.214203,-5.103699,12.709579,-2.317215


In [15]:
X = Normalizer().fit_transform(emb)
y = train['sentiment']
emb = emb.fillna(0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [16]:
clf = SVC()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8592

In [17]:
#Feature pipeline

pipe = Pipeline([
    ('standardscaler', StandardScaler()),
    ('pca', PCA()),
    ('reg', LogisticRegression()),
])

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.859

In [35]:
#Grid search

grid = GridSearchCV(pipe, param_grid, cv = 4, verbose=1)

In [36]:
for param in grid.get_params().keys():
    print(param)

cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__pca
estimator__std
estimator__clf
estimator__pca__copy
estimator__pca__iterated_power
estimator__pca__n_components
estimator__pca__random_state
estimator__pca__svd_solver
estimator__pca__tol
estimator__pca__whiten
estimator__std__copy
estimator__std__with_mean
estimator__std__with_std
estimator__clf__cv
estimator__clf__error_score
estimator__clf__estimator__C
estimator__clf__estimator__break_ties
estimator__clf__estimator__cache_size
estimator__clf__estimator__class_weight
estimator__clf__estimator__coef0
estimator__clf__estimator__decision_function_shape
estimator__clf__estimator__degree
estimator__clf__estimator__gamma
estimator__clf__estimator__kernel
estimator__clf__estimator__max_iter
estimator__clf__estimator__probability
estimator__clf__estimator__random_state
estimator__clf__estimator__shrinking
estimator__clf__estimator__tol
estimator__clf__estimator__verbose
estimator__clf__estimator
estimator__cl

In [41]:
pipe = Pipeline([
    ('standardscaler', StandardScaler()),
    ('pca', PCA()),
    ('reg', LogisticRegression()),
])

param_grid = [
    {
    'pca__n_components': np.arange(50, 60, 5),
    'reg__fit_intercept': [True, False],
    'reg__C': [2, 5, 3]
}]

grid = GridSearchCV(pipe, param_grid, cv = 4, verbose=1)

In [42]:
grid.fit(X_train, y_train)
model = grid.best_estimator_

predict = model.predict(X_test)
print(grid.best_params_)

Fitting 4 folds for each of 12 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:   21.7s finished


{'pca__n_components': 55, 'reg__C': 3, 'reg__fit_intercept': False}


In [44]:
model = grid.best_estimator_
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8414
[[2099  382]
 [ 411 2108]]
              precision    recall  f1-score   support

           0       0.84      0.85      0.84      2481
           1       0.85      0.84      0.84      2519

    accuracy                           0.84      5000
   macro avg       0.84      0.84      0.84      5000
weighted avg       0.84      0.84      0.84      5000

