# CIS600 - Social Media & Data Mining
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

# Classification - LR & SVM

###  March 8, 2018

# Remarks
- Your midterm exams will be scored over the break.
- You will have another assignment tomorrow: a proposal for your term project.
- I will not be able to support graduate students this summer, but thanks to those of you who expressed interest.

# LR & SVM
#### (Data and examples for today taken from Bishop's *Pattern Recognition and Machine Learning*.)

### Today we will look in more detail at two related families of classification algorithm. We will need the notions of *precision* and *recall* and the language of *conditional probability*. To review:

### If $A$ and $B$ are *events* (e.g. belonging to a class, buying a computer, etc.), then we denote the *conditional probability of $A$ given $B$* by

## $P(A \mid B)$

### The conditional probability is a relative thing, giving the proportion of $B$ that consists of $A$. Formally:

## $P(A \mid B) \color{red}{\equiv} P(A \cap B) \big / P(B)$

### (This is a definition.)

### We saw a very simple example of Naïve Bayes classification. Let's look at that again

## Naïve Bayes

### In a *naïve Bayes classifier*, the *events* or *conditions* are the observation of the data and the hypothesis that the data belong to a particular class. For that reason, you will often see Bayes' Theorem written like this:

## $ P(H \mid {\bf X}) = \frac{P({\bf X } \mid H) P(H)}{P({\bf X})}$

### where $H$ is the hypothesis that the data $\bf X$ belong to a given class.

### The term on the left is calling the *posterior* probability, estimate or distribution, depening on context.

### In terms of formulas, the naïve model assumes *independence* of the probabilities of the various attributes or features of the data.

## $ P({\bf X} \mid C_i) = \prod_{k=1}^n P(x_k \mid C_i)$

### (This is a *totally unreasonable assumption*, but very often doesn't cost much at the level of classifier results. Why?)

### Example

<img src="notebook-images/bayes.png" style="width: 600px;"/>

### This is the dataset, $X$. Each row is an observation. Note that in this case, the data is categorical. More advanced methods are needed to update continuous variables, but here we use simple proportions.

### Goal: to classify the datapoint $X=$ (age $\leq 30$, income $= $medium, student $= $yes, credit_rating $= $ fair).

### We are able to calculate directly from aggregate stats the values of $P(x \mid C_i)$ for any given feature $x$ and category $C_i$. Here, we want two classes:

## $C_1: \text{buys_computer="yes"}$
## $C_2: \text{buys_computer="no"}$

### We want to find out which is larger, $P(C_1 \mid X)$ or $P(C_2 \mid X)$.

In [1]:
# Let's store probabilities in dictionaries.
PXC1, PXC2 = dict(), dict()
PXC1['age <= 30'], PXC2['age <= 30'] = 2/9, 3/5
PXC1['income = medium'], PXC2['income = medium'] = 4/9, 2/5
PXC1['student = yes'], PXC2['student = yes'] = 6/9, 1/5
PXC1['credit_rating = fair'], PXC2['credit_rating = fair'] = 6/9, 2/5

In [2]:
# Calculating the first posterior
prod = 1
for key in PXC1:
    prod *= PXC1[key]

### The joint probability of $X$ (our data) and category $1$ is...

In [3]:
prod * (9/14)

0.028218694885361547

### For the second category...

In [4]:
# Calculating the first posterior
prod = 1
for key in PXC2:
    prod *= PXC2[key]

In [5]:
prod * (5/14)

0.006857142857142858

### Therefore, our $X$ gets classified into $C_1$, "buys computer".

# Logistic Regression

### The idea that will carry over from Naïve Bayes is that we are trying to classify by finding the largest conditional probability

## $ P(C_k \mid {\bf X})$

### (There is an important distinction to be made here between *generative* and *discriminative* models. See Bishop for more.)

### As in other examples, such as SVC, we take our model to be a function with knobs on it.

### Remeber the *sigmoid* function?

In [7]:
import numpy as np
from bokeh.plotting import figure, output_notebook, show
output_notebook()
sgmd = lambda x: 1 / (1 + np.exp(-x))
x = np.linspace(-2*np.pi, 2*np.pi, 100)
y = sgmd(x)
p = figure(plot_width=400, plot_height=400)
p.line(x,y)
show(p)

### Note the *domain* and *codomain*. The sigmoid function can take any input and returns a value between $0$ and $1$. This means we can use any discriminant function we like and feed its output into the sigmoid. You can think of the output of the sigmoid as a probability.

### A more readable expression for the *logistic* sigmoid:

## $\sigma(a) = \frac{1}{1 + e^{-a}}$

### Given data $\bf X$ and weights $\color{blue}{W}$, we compute the dot product $\color{blue}{W} \cdot \bf X$ and feed the result into the sigmoid function. This is a simple *GLM*, or *Generalized Linear Model*. The model is this:

## $P(C_k \mid {\bf X}) = \sigma(\color{blue}{W} \cdot {\bf X})$

### Remember from last time that the weights $\color{blue}{W}$ are really the optimization variables. In this case, the mathematical optimization can be done via *iterative reweighted least squares* (IRLS). But what on Earth are we optimizing? We are actually *maximizing* the likelihood of the parameters given the data. We find the $\color{blue}{W}$ (a vector), which maximizes this:

## $ P( {\bf t} \mid \color{blue}{W} ) = \prod_{n=1}^N y_n^{t_n}(1 - y_n)^{1-t_n}$

### where ${\bf t} = (t_1,\ldots,t_N)$ are the *labels* from the training set (i.e. $t_n \in \{0,1\}$) and $y_n = P(C_1 \mid x_n) = \sigma(\color{blue}{W} \cdot x_n)$ are the probabilities predicted by the model with weights $\color{blue}{W}$.

### We are left with a model that takes $\bf X$ as input and outputs a value $\sigma(\color{blue}{W} \cdot {\bf X}) \in (0,1)$. In the case of two-class classification, this can be interpreted as the probability that the datapoint $\bf X$ belongs to class $C_1$.

### Let's look at an example.

In [38]:
# For random points
import numpy as np

# Plotting using Bokeh
from bokeh.plotting import figure, show, output_notebook

# Show in notebook.
output_notebook()

# Build a figure
p = figure(title="Bokeh Markers", toolbar_location=None)
p.grid.grid_line_color = None
p.background_fill_color = "#eeeeee"

def mscatter(p, x, y, marker, fill):
    p.scatter(x, y, marker=marker, size=15,
              line_color="navy", fill_color=fill, alpha=0.5)

# Number of datapoints
N = 100

# First cluster
X1 = np.random.random(100)
Y1 = np.random.random(100)

# Second Cluster
X2 = np.random.random(100) - 1/2
Y2 = np.random.random(100) + 1/2
    
# Throw in clusters
mscatter(p, X1, Y1, "circle", "blue")
mscatter(p, X2, Y2, "square", "orange")

# # A line separating them
# X = np.linspace(-2,2,100)
# Y = 2*X**2
# p.line(X,Y, line_color='red')
show(p)

In [61]:
# Create a logistic regression object
logreg = linear_model.LogisticRegression()

# Combine all data
X = np.append(X1,X2)
Y = np.append(Y1,Y2)
features = np.array([X,Y]).transpose()

# Create labels
T = 100*[1] + 100*[0]

# Train the model
logreg.fit(features,T)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Now we must grab the predicted probabilities. Let's do it on new data.

In [74]:
# First cluster
X1 = np.random.random(100)
Y1 = np.random.random(100)

# Second Cluster
X2 = np.random.random(100) - 1/2
Y2 = np.random.random(100) + 1/2

# Combining data
X = np.append(X1,X2)
Y = np.append(Y1,Y2)
features = np.array([X,Y]).transpose()

# (Leave labels as before)

# Run the model on our new data...

pred = logreg.predict_proba(features)

# This function computes the p-r values for us
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(T,pred[:,1])


## Precision-Recall Curve

### Now, let's look at a precision-recall curve for our classifier.

In [75]:
p1 = figure(plot_width=400, plot_height=400)
p1.line(recall,precision)
show(p1)

# Support Vector Machines

### Remember the parameter $C$? Let's learn what that's about. This lets us deal with overlapping classes and gives an example of *regularization*. In general, there is a (precisely defined) tradeoff between fitting the training data well and building a model that performs well on other data. Regularization is a way to penalize too much complexity in the model without banishing complexity outright. 

### The parameter $C$ controls the *slack variables* that allow for some misclassification on the training set.

### The optimization problem here, with *linearly separable data*, is to minimize the length of the weight vector:

## $\|\color{blue}{W}\|^2$

### If we want to allow for overlapping data, then we can throw in the extra variables $\xi_n$, *one for each data point*(!). They depend on the weights $\color{blue}{W}$; $\xi_n = 0$ if $f_{\color{blue}{W}}(x_n) = t_n$, and otherwise $\xi_n = |t_n - f_{\color{blue}{W}}(x_n)|$.

<img src="prmlfigs-png/Figure7.3.png" style="width: 600px;"/>

### Then the objective function becomes

## $\sum_{n=1}^N \xi_n + \|\color{blue}{W}\|^2$

### But that could be too great a cost. We want to scale the cost of these slack variables.

## $C\sum_{n=1}^N \xi_n + \|\color{blue}{W}\|^2$

### In technical terms, $C$ is a regularization parameter. The larger it is, the more strongly misclassifications are avoided.