
<h2 style="font-size:34px; font-family:Verdana" align="center">Linear Models </h2>


<a id='0'></a>

<a id='1'></a>

## 1. Naive Bayes

Classification is a fundamental issue in machine learning and data mining. In classification, the goal of a learning algorithm is to construct a classifier given a set of training examples with class labels. Typically, an example $E$ is represented by a tuple of attribute values ($x_1, x_2, … , x_n$), where $x_i$ is the value of attribute $X_i$. 

Let $C$ represent the classification variable, and let $c$ be the value of $C$. In this paper, we assume that there are only two classes: + (the positive class) or − (the negative class).

A classifier is a function that assigns a class label to an example. From the probability perspective, according to Rule, the probability of an example $E = (x_1, x_2, … , x_n)$ being class $c$ is

$$p(c|E) = {\frac{p(E|c)p(c)}{p(E)}}$$

$E$ is classified as the class $C = +$ if and only if
$$f_b(E) = {\frac{p(C = +|E)}{p(C = -|E)}}\geq1$$


where $f_b(E)$ is called a Bayesian classifier. Assume that all attributes are independent given the value of the class variable; that is,
$$p(E|c) = p(x_1, x_2, … , x_n|c) = \prod\limits_{i = 1}^n p(x_i|c)$$
the resulting classifier is then:
$$f_{nb}(E) = {\frac{p(C = +)}{p(C = -)}} \prod\limits_{i = 1}^n {\frac{p(x_i|c = +)}{p(x_i|c = -)}} $$
The function $f_{nb}(E)$ is called a naive Bayesian classifier, or simply naive Bayes (NB). Figure 1 shows an example of naive Bayes. In naive Bayes, each attribute node has no parent except the class node.


<img src='http://i.piccy.info/i9/e3f799752a613ded34b7d17feaf2eca4/1492605985/18456/1138938/NB5.jpg'/>

Naive Bayes is the simplest form of Bayesian network, in which all attributes are independent given the value of the class variable. This is called conditional independence. It is obvious that the conditional independence assumption is rarely true in most real-world applications. A straightforward approach to overcome the limitation of naive Bayes is to extend its structure to represent explicitly the dependencies among attributes

<a id='2'></a>

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

In [2]:
# Import data for train  
train = pd.read_csv('../data/movie_reviews.csv', sep = ',')
train_data = train.text
train_labels = train.label

In [9]:
# Import data for test 
test = pd.read_csv('../data/test.csv', sep = ',')
test_data = test.text
test_labels = test.label

In [3]:
# Create validation dataset
train_data, train_validation_data, train_labels, train_validation_labels = train_test_split(train.text, train.label, test_size=0.2, random_state=42, stratify=train.label)

In [4]:
# Take the last 22 words from each review in the train set
train_data = train_data.str.split().apply(lambda x:  ' '.join(x for x in x[-22:]))

In [5]:
# List of stopwords
STOPWORDS = ['by','does', 'was', 'were', 'the', 'of', 'end', 'and', 'is']    

In [6]:
# Convert text to vector
cvect = CountVectorizer()
counts = cvect.fit_transform(train_data)

In [7]:
# Train NB
classifier = MultinomialNB(alpha=0.4)
pipeline = Pipeline([('vectorizer', CountVectorizer(binary=True,ngram_range=(1,3),stop_words=STOPWORDS)), ('classifier', classifier)])
model = pipeline.fit(X=train_data, y=train_labels)

In [8]:
# Validation
pred_test = model.predict(train_validation_data)

print ("Accuracy :", metrics.accuracy_score(train_validation_labels, pred_test))
print ("F1-score :", metrics.f1_score(train_validation_labels, pred_test))

Accuracy : 0.818589869602
F1-score : 0.848388598341


In [10]:
# Test 
pred_test = model.predict(test_data)

print ("Accuracy :", metrics.accuracy_score(test_labels, pred_test))
print ("F1-score :", metrics.f1_score(test_labels, pred_test))

Accuracy : 0.8095684803
F1-score : 0.825180847399


# 2. Linear Models (SVM, Logistic Regression)

### SVM Model overview  


A Support Vector Machine (SVM) is a supervised machine learning algorithm.

SVMs are more commonly used in classification problems.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes.

<img src="../pictures/svm_intro.png" alt="logistic" style="width: 100%;"/>


** Support Vectors **

Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set.
What is a hyperplane?

 
As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.


** How do we find the right hyperplane? **


Or, in other words, how do we best segregate the two classes within the data?

The distance between the hyperplane and the nearest data point from either set is known as the margin. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.

<img src="../pictures/svm_margins.png" alt="logistic" style="width: 70%;"/>


** But what happens when there is no clear hyperplane? **  

This is where it can get tricky. Data is rarely ever as clean as our simple example above.  
A dataset will often look more like the jumbled balls below which represent a linearly non separable dataset.  
In order to classify a dataset like the one above it’s necessary to move away from a 2d view of the data to a 3d view.  

Imagine that our two sets of colored balls above are sitting on a sheet and this sheet is lifted suddenly, launching the balls into the air. While the balls are up in the air, you use the sheet to separate them. This ‘lifting’ of the balls represents the mapping of data into a higher dimension. This is known as kernelling. 

<img src="../pictures/svm_kerneling.png" alt="logistic" style="width: 70%;"/>

Because we are now in three dimensions, our hyperplane can no longer be a line. It must now be a plane as shown in the example above. The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it.


** SVM Uses **
 
SVM is used for text classification tasks such as category assignment, detecting spam and sentiment analysis.  
It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and color-based classification.  
SVM also plays a vital role in many areas of handwritten digit recognition, such as postal automation services.


### SVM Main params

** sklearn.svm.LinearSVC **

** C ** : float, default: 1.0  
    Inverse of regularization strength; must be a positive float.  
    Like in support vector machines, smaller values specify stronger regularization. 

C: Penalty parameter C of the error term. It also controls the trade off between smooth decision boundary and classifying the training points correctly.
    
** loss ** : string, ‘hinge’ or ‘squared_hinge’ (default=’squared_hinge’)  
    Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss.

** penalty **: string, ‘l1’ or ‘l2’ (default=’l2’)  
    Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.

** tol **: float, optional (default=1e-4)
    Tolerance for stopping criteria.s