# Linear Models (SVM, Logistic Regression)

## Table of Contents


[Requirements](#Requirements)   

[SVM](#SVM)  
[Model overview](#SVM-Model-overview)  
[Proc and Cons](#SVM-Proc-and-Cons)  
[Main params](#SVM-Main-params)  
[Practice](#SVM-Practice)  
[Useful links](#SVM-Useful-links)  
[Task](#SVM-Task)  

[Logistic Regression](#Logistic-Regression)  
[Model overview](#LR-Model-overview)  
[Proc and Cons](#LR-Proc-and-Cons)  
[Main params](#LR-Main-params)  
[Practice](#LR-Practice)  
[Useful links](#LR-Useful-links)  
[Task](#LR-Task)  

[Logistic Regression vs SVM](#LR-vs-SVM)  

## Requirements


1. Python 3.x (or Anaconda3 for Python 3.5, https://www.continuum.io/downloads)
2. Scikit-learn 0.18.x (pip install scikit-learn==0.18.1, http://scikit-learn.org/)
3. NLTK lib latest (http://www.nltk.org/install.html)
4. Pandas latest (http://pandas.pydata.org/)
5. For datasets more than 1M reviews min Hardware Requirements (SDRAM >= 8 GB)

[To the table of contents](#Table-of-Contents)

# SVM  

## sklearn.svm.LinearSVC  

### SVM Model overview  


A Support Vector Machine (SVM) is a supervised machine learning algorithm.

SVMs are more commonly used in classification problems.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes.

<img src="../pictures/svm_intro.png" alt="logistic" style="width: 100%;"/>


** Support Vectors **

Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set.
What is a hyperplane?

 
As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.


** How do we find the right hyperplane? **


Or, in other words, how do we best segregate the two classes within the data?

The distance between the hyperplane and the nearest data point from either set is known as the margin. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.

<img src="../pictures/svm_margins.png" alt="logistic" style="width: 70%;"/>


** But what happens when there is no clear hyperplane? **  

This is where it can get tricky. Data is rarely ever as clean as our simple example above.  
A dataset will often look more like the jumbled balls below which represent a linearly non separable dataset.  
In order to classify a dataset like the one above it’s necessary to move away from a 2d view of the data to a 3d view.  

Imagine that our two sets of colored balls above are sitting on a sheet and this sheet is lifted suddenly, launching the balls into the air. While the balls are up in the air, you use the sheet to separate them. This ‘lifting’ of the balls represents the mapping of data into a higher dimension. This is known as kernelling. 

<img src="../pictures/svm_kerneling.png" alt="logistic" style="width: 70%;"/>

Because we are now in three dimensions, our hyperplane can no longer be a line. It must now be a plane as shown in the example above. The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it.


** SVM Uses **
 
SVM is used for text classification tasks such as category assignment, detecting spam and sentiment analysis.  
It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and color-based classification.  
SVM also plays a vital role in many areas of handwritten digit recognition, such as postal automation services.


[To the table of contents](#Table-of-Contents)

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo('you3liCbRZPrZA')

### SVM Proc and Cons

** Pros **

    + Simple
    + Accuracy
    + Works well on smaller cleaner datasets
    + It can be more efficient because it uses a subset of training points


** Cons **
    - Isn’t suited to larger datasets as the training time with SVMs can be high (not for LinearSVC)
    - Less effective on noisier datasets with overlapping classes



[To the table of contents](#Table-of-Contents)

### SVM Main params

** sklearn.svm.LinearSVC **

** C ** : float, default: 1.0  
    Inverse of regularization strength; must be a positive float.  
    Like in support vector machines, smaller values specify stronger regularization. 

C: Penalty parameter C of the error term. It also controls the trade off between smooth decision boundary and classifying the training points correctly.
    
** loss ** : string, ‘hinge’ or ‘squared_hinge’ (default=’squared_hinge’)  
    Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss.

** penalty **: string, ‘l1’ or ‘l2’ (default=’l2’)  
    Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.

** tol **: float, optional (default=1e-4)
    Tolerance for stopping criteria.s


[To the table of contents](#Table-of-Contents)

### SVM Practice

In [2]:
import pandas as pd
import re
import pickle

from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import shuffle
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

import nltk
from nltk.stem import SnowballStemmer
from nltk import word_tokenize as nltk_wtknz

In [3]:
train = pd.read_csv('../data/movie_reviews.csv', sep = ',')
train = train [1:10000]
X = train.text
Y = train.label

In [4]:
print(train.shape)

(9999, 2)


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152610 entries, 0 to 152609
Data columns (total 2 columns):
label    152610 non-null int64
text     152610 non-null object
dtypes: int64(1), object(1)
memory usage: 2.3+ MB


In [6]:
# df.describe()
# df.describe(include=['object'])
# df['label'].value_counts()
train['label'].value_counts(normalize=True)

1    0.587498
0    0.412502
Name: label, dtype: float64

In [7]:
def tokenize(text):
    text = re.sub("[^a-zA-Z]", " ", text)
    word_list = nltk_wtknz(text)
    stemmer = SnowballStemmer("english")
    stems = [stemmer.stem(word) for word in word_list]
    return stems

In [8]:
pipeline_svm = Pipeline([('vectorizer', TfidfVectorizer(ngram_range=(1, 3))),
                     ('clf_svm', LinearSVC(class_weight='balanced'))])

In [9]:
pipeline_svm.steps[1]

('clf_svm',
 LinearSVC(C=1.0, class_weight='balanced', dual=True, fit_intercept=True,
      intercept_scaling=1, loss='squared_hinge', max_iter=1000,
      multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
      verbose=0))

In [19]:
params_svm = dict(clf_svm__C=[100, 10, 1])
grid_search_svm = GridSearchCV(pipeline_svm, param_grid=params_svm, cv=5, scoring='accuracy')
%time grid_search_svm.fit(X,Y)

CPU times: user 1min 57s, sys: 590 ms, total: 1min 58s
Wall time: 1min 58s


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=Tr...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'clf_svm__C': [100, 10, 1]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring='accuracy', verbose=0)

In [20]:
print("# Tuning hyper-parameters")
print("Best parameters set found on development set:")
print()
print(grid_search_svm.best_params_)
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = Y, grid_search_svm.predict(X)
print(classification_report(y_true, y_pred))
print()

# Tuning hyper-parameters
Best parameters set found on development set:

{'clf_svm__C': 10}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6537
          1       1.00      1.00      1.00     13462

avg / total       1.00      1.00      1.00     19999




In [10]:
pipeline_svm = Pipeline([('vectorizer', TfidfVectorizer(ngram_range=(1, 2))),
                     ('clf_svm', LinearSVC(class_weight='balanced', C = 10))])
%time pipeline_svm.fit(X, Y)

CPU times: user 2min 32s, sys: 1.37 s, total: 2min 34s
Wall time: 2min 34s


Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=Tr...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [11]:
y_true, y_pred = Y, pipeline_svm.predict(X)
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     62952
          1       1.00      1.00      1.00     89658

avg / total       1.00      1.00      1.00    152610



In [11]:
# with open('../dumps/m_lin_svc_mix_out2.pkl', 'wb') as f:
#     pickle.dump(pipeline, f)

In [14]:
df = pd.read_csv("../data/test.csv", sep=",")

In [15]:
print(df.shape)

(10660, 2)


In [16]:
df.head()

Unnamed: 0,label,text
0,0,"it's so laddish and juvenile , only teenage bo..."
1,0,exploitative and largely devoid of the depth o...
2,0,[garbus] discards the potential for pathologic...
3,0,a visually flashy but narratively opaque and e...
4,0,"the story is also as unoriginal as they come ,..."


In [17]:
# with open('../dumps/m_lin_svc_mix_out.pkl', 'rb') as f:
#     model = pickle.load(f)

In [18]:
y_predicted_svm = pipeline_svm.predict(df['text'])

In [19]:
print(accuracy_score(df['label'], y_predicted_svm))
print(classification_report(df['label'], y_predicted_svm))

0.839212007505
             precision    recall  f1-score   support

          0       0.88      0.79      0.83      5330
          1       0.81      0.89      0.85      5330

avg / total       0.84      0.84      0.84     10660



### SVM Useful links

http://cs229.stanford.edu/materials.html  
http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html  
https://www.csie.ntu.edu.tw/~cjlin/liblinear/  
http://www.kdnuggets.com/2016/07/support-vector-machines-simple-explanation.html  


### SVM Task



[To the table of contents](#Table-of-Contents)

In [None]:
### Train SVM model with different paprameters/preprocessing
### Check accuracy on test set.

#### Your code here###

#my_pipeline_svm = pipeline()

#### Your Code Ends Here

#my_pipeline_svm.fit(X, Y)
#y_predicted_svm = my_pipeline_svm.predict(df['text'])
#print(accuracy_score(df['label'], y_predicted_lr))
#print(classification_report(df['label'], y_predicted_lr))

# Logistic Regression  

### LR Model overview  

Logistic regression is a linear model for classification rather than regression.  
Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier.  
In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function (sigmoid function).  

<img src="../pictures/sphx_glr_plot_logistic_001.png" alt="logistic" style="width: 70%;"/>

Decision boundary “separates” variable space into two decision regions.
Linear regression not advised for classification. First, it is sensitive to outliers/skewed data sets.

We will focus on ** sklearn.linear_model.LogisticRegression ** (liblinear)
This class implements regularized logistic regression using the "liblinear" library.


[To the table of contents](#Table-of-Contents)

### LR Proc and Cons

** Pros **

    + simple
    + good scaling for huge data
    + fast

** Cons **
    - not for non-linear data

[To the table of contents](#Table-of-Contents)

# Logistic function and SoftMax


### Logistic function
$$  
    F(x) = \frac{1}{1+e^{-(w_0 + x^T * w_1)}} 
$$

<br> where $w_0$ corresponds to intercept of the model, 
<br>$w_1$ - vector of model coefficients, 
<br>$x$ - feature vector 


### SoftMax
$$  
    F(x) = \frac{e^{ x^T * w_j}}{\sum_{k=1} e^{ x^T * w_k}} 
$$

<br> where $w_j$ model weights corresponding to class j,  
<br>$x$ - feature vector


### Logistic Regression Main params

** C ** : float, default: 1.0  
    Inverse of regularization strength; must be a positive float.  
    Like in support vector machines, smaller values specify stronger regularization.  

** class_weight **: dict or ‘balanced’, default: None
    Weights associated with classes in the form {class_label: weight}.  
    If not given, all classes are supposed to have weight one.

** solver **: {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}, default: ‘liblinear’
    Algorithm to use in the optimization problem.
        For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ is
            faster for large ones.
        For multiclass problems, only ‘newton-cg’, ‘sag’ and ‘lbfgs’ handle
            multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

** tol **: float, default: 1e-4
    Tolerance for stopping criteria.


** n_jobs ** : int, default: 1
    Number of CPU cores used during the cross-validation loop. If given a value of -1, all cores are used.


[To the table of contents](#Table-of-Contents)

In [20]:
pipeline_lr = Pipeline([('vectorizer', TfidfVectorizer(ngram_range=(1, 2))),
                     ('clf_lr', LogisticRegression())])

In [21]:
pipeline_lr.steps[1]

('clf_lr',
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False))

In [None]:
params_lr = dict(clf_lr__C=[0.01, 0.1, 1, 10])
grid_search_lr = GridSearchCV(pipeline_lr, param_grid=params_lr, cv=5, scoring='accuracy')
%time grid_search_lr.fit(X,Y)

In [None]:
print("# Tuning hyper-parameters")
print("Best parameters set found on development set:")
print()
print(grid_search_lr.best_params_)
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = Y, grid_search_lr.predict(X)
print(classification_report(y_true, y_pred))
print()

In [22]:
pipeline_lr = Pipeline([('vectorizer', TfidfVectorizer(ngram_range=(1, 2))),
                     ('clf_lr', LogisticRegression(C = 1))])
pipeline_lr.fit(X, Y)

Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=Tr...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [23]:
y_predicted_lr = pipeline_lr.predict(df['text'])
print(accuracy_score(df['label'], y_predicted_lr))
print(classification_report(df['label'], y_predicted_lr))

0.809099437148
             precision    recall  f1-score   support

          0       0.89      0.71      0.79      5330
          1       0.76      0.91      0.83      5330

avg / total       0.82      0.81      0.81     10660



### LR Practice

### LR Useful links

http://cs229.stanford.edu/materials.html  
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html  
https://habrahabr.ru/company/ods/blog/323890/  


### LR Task



[To the table of contents](#Table-of-Contents)

In [None]:
### Train Logistic Regression model with different paprameters/preprocessing/ 
### Check accuracy on test set.



#### Your code here###

#my_pipeline_lr = pipeline()

#### Your Code Ends Here

#my_pipeline_lr.fit(X, Y)
#y_predicted_lr = my_pipeline_lr.predict(df['text'])
#print(accuracy_score(df['label'], y_predicted_lr))
#print(classification_report(df['label'], y_predicted_lr))

# LR vs SVM  

In practical classification tasks, linear logistic regression and linear SVMs often yield very similar results.  
** Logistic regression ** tries to maximize the conditional likelihoods of the training data, which makes it more prone to outliers than SVMs.  
** The SVMs ** mostly care about the points that are closest to the decision boundary (support vectors). 

On the other hand, logistic regression has the advantage that it is a simpler model that can be implemented more easily.   Furthermore, logistic regression models can be easily updated, which is attractive when working with streaming data.


[To the table of contents](#Table-of-Contents)