<center><img src="img/logo_hse_black.jpg"></center>

<h1><center>Data Analysis</center></h1>
<h2><center>Seminar: SVM</center></h2>

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

%matplotlib inline

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12, 6)

## SVM Classification

Function `select_model` should take train set and output fitted svm model with best hyperparameters.

You should iterate over the following hyperparameters:
- kernel type (linear, RBF, polynomial with different degrees)
- different $C$ ($0.1, 1, 10, 100, 1000, 10000$)

Use 10-fold cross-validation and `GridSearchCV`

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

In [None]:
def select_model(x, y):
    """
    Implement some model selection strategy here:
    seek through different kernels and parameters.

    Use a validation scheme to select the best model
    
    Quality metric: accuracy

    Returns:
        SVM classifier implemented by sklearn SVC class.
    """
    best_accuracy = 0
    best_model = None
    
    params = {}
    
    cv = KFold(n_splits=10, shuffle=True, random_state=123)

    model = SVC()
    
    model.fit(x, y)
    
    best_model = model
    yhat = best_model.predict(x)
    best_accuracy = accuracy_score(y, yhat)

    print "Best model %s, with accuracy %f" % (best_model, best_accuracy)
    return best_model

Some helper functions

In [None]:
def plot_data_set(x, y, description=''):
    print "Plotting data set points"
    plt.figure(figsize=(8, 8))

    colors = np.array(['r', 'b'])[y]
    plt.title(description, fontsize='small')
    plt.scatter(x[:, 0], x[:, 1], marker='o', c=colors, s=50)
    
def plot_decision_region(x1_min, x2_min, x1_max, x2_max, clf, n_points=1000):
    print "Plotting decision region"
    x1, x2 = np.meshgrid(np.linspace(x1_min, x1_max, n_points), np.linspace(x2_min, x2_max, n_points))
    z = clf.decision_function(np.c_[x1.ravel(), x2.ravel()]).reshape(x1.shape)

    plt.contour(x1, x2, z, levels=[0.0], linestyles='solid', linewidths=2.0)
    plt.contour(x1, x2, z, levels=[-1.0, 1.0], linestyles='dashed', linewidths=1.0)

In [None]:
def generate_linear(size=100, k=1.1, b=0.0, nl=0.1):
    print "Generating 'Linearly-separated' data set"

    x = np.random.random((size, 2))
    y = np.zeros(size, dtype=int)
    noise = np.random.randn(size) * nl
    y[x[:, 1] - (k * x[:, 0] + b) > noise] = 1

    return x, y

x, y = generate_linear()
clf = select_model(x, y)
plot_data_set(x, y)
plot_decision_region(x[:, 0].min(), x[:, 1].min(), x[:, 0].max(), x[:, 1].max(), clf)

In [None]:
def generate_concentric(size=100, r1=1.0, r2=2.0, sigma=0.3):
    print "Generating 'Concentric circles' data set"
    x = np.zeros((size, 2))
    x[:size/2, 0] = sigma * np.random.randn(size/2) + r1
    x[size/2:, 0] = sigma * np.random.randn(size/2) + r2
    x[:, 1] = (np.random.random(size) - 0.5) * 2 * np.pi
    y = np.hstack([np.zeros(size/2, dtype=int), np.ones(size/2, dtype=int)])

    z = np.zeros((size, 2))
    z[:, 0] = x[:, 0] * np.cos(x[:, 1])
    z[:, 1] = x[:, 0] * np.sin(x[:, 1])

    return z, y

x, y = generate_concentric()
clf = select_model(x, y)
plot_data_set(x, y)
plot_decision_region(x[:, 0].min(), x[:, 1].min(), x[:, 0].max(), x[:, 1].max(), clf)

In [None]:
def generate_sin(size=200):
    print "Generating 'Sinus-separated' data set"

    x = np.random.random((size, 2))
    x[:, 0] = x[:, 0] * 4 * np.pi
    x[:, 1] = (x[:, 1] - 0.5) * 2
    y = np.zeros(size, dtype=int)
    y[x[:, 1] > np.sin(x[:, 0])] = 1

    return x, y

x, y = generate_sin()
clf = select_model(x, y)
plot_data_set(x, y)
plot_decision_region(x[:, 0].min(), x[:, 1].min(), x[:, 0].max(), x[:, 1].max(), clf)

## SVM Regression

Consider *titanium.csv*<br/>

We should predict 'y' with 'x'.

### Data visualization

Normalize data (only `x` column) and plot it

In [None]:
# Your Code Here

### Model learning

Consider 3 kernels
* Linear
* Polynomial (degree = 3, gamma = 6, coef0 = 1)
* RBF (gamma = 6, coef0 = 1)

Set `epsilon=0.01`

For each kernel:
1. For each `C` in `np.logspace(-2, 2, 10)` find and plot mean absolute error of a model
2. For best $С$ at each kernel plot initial dataset with SVM predictions

Everything is performent on training set (no splitting and CV)

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
## Your Code Here

# Custom kernel

Now we are going to try to determinte the language of the word and use custom kernel for that task

We are going to have to texts - some first sentences of War and Peace in spanish and english. Lets say we don't know what ngramms are and consider [edit distance](https://ru.wikipedia.org/wiki/%D0%A0%D0%B0%D1%81%D1%81%D1%82%D0%BE%D1%8F%D0%BD%D0%B8%D0%B5_%D0%9B%D0%B5%D0%B2%D0%B5%D0%BD%D1%88%D1%82%D0%B5%D0%B9%D0%BD%D0%B0) between strings.

In [None]:
def edit_dist(string_1, string_2):
    """
    Calculates the Levenshtein distance between two strings.
    """
    len_1 = len(string_1) + 1
    len_2 = len(string_2) + 1

    d = [0] * (len_1 * len_2)

    for i in range(len_1):
        d[i] = i
    for j in range(len_2):
        d[j * len_1] = j

    for j in range(1, len_2):
        for i in range(1, len_1):
            if string_1[i - 1] == string_2[j - 1]:
                d[i + j * len_1] = d[i - 1 + (j - 1) * len_1]
            else:
                d[i + j * len_1] = min(
                   d[i - 1 + j * len_1] + 1,        # deletion
                   d[i + (j - 1) * len_1] + 1,      # insertion
                   d[i - 1 + (j - 1) * len_1] + 1,  # substitution
                )

    return d[-1]

In [None]:
edit_dist('kitten', 'sitting')

## Load and prepare data
Load *war_and_peace_es.txt* and *war_and_peace_en.txt*.<br/> 
Make a single dataframe with a column for word and class label

In [None]:
## Your Code Here

## Some data preparations

One issue with custom kernels is that `sklean.SVC` requires them to accept only numbers.<br/>
In our case that should be the indices of words: for instance, instead of strings ['treat', 'celebrit', 'prince', ...] custom kernel should take indices [9209, 11145, 7735, ...].

Before that:
1. Set `RND_SEED`
2. Shuffle and reindex dataframe with words (используйте методы df.sample() и df.reset_index())
3. Limit dataframe up to 1000 words
4. Split to train and test with 60/40

In the end matrices X_train, X_test should contain **indices** of words

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
RND_SEED = ...

### Implementation

Some guidence [here](http://stackoverflow.com/questions/26962159/how-to-use-a-custom-svm-kernel).

TD;DR:<br/>
Custom kernel should accept two matrices: $U$ и $V$ with features (during training they both are for training set, during prediction - one for train and one for test set).

As a result it should return a matrix $G_{ij} = K(U_i, V_j)$.

We should:
1. Implement function *string_kernel(U, V)*
2. Visualize it matrix (plt.imshow()).

In [None]:
def string_kernel(U, V):
    #Your Code Here

G = string_kernel(X_train, X_train)

In [None]:
plt.imshow(G)

## Quality estimation

Check quality measure with different `C`

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# Your Code Here