## Support vector machines

**Data** [Gender-annoted dataset of European parliament talks](https://www.kaggle.com/ellarabi/europarl-annotated-for-speaker-gender-and-age)

**Overreaching question** Can we develop a model which correctly predicts speakers' based on what they are saying?

## Data management

We connect the variable of interest into the textual data each speaker has said.
That data is stored as XML, so we need to do a bit of work before we can easily use it.
Also, transform the textual data to a feature matrix.

In [None]:
metadata = open('./data/europarl-annotated-for-speaker-gender-and-age/europarl.de-en/europarl.de-en.dat').readlines()
all_texts = open('./data/europarl-annotated-for-speaker-gender-and-age/europarl.de-en/europarl.de-en.en.aligned.tok').readlines()

## check that both files have same number of rows
assert len(metadata) == len(all_texts)

## this time processign these takes already some time, so let's choose a random set of 1000 messages already now

import random

random.seed(1)

selected_lines = random.sample( range( len( metadata ) ) , k = 10000 )

print( metadata[0] )

from bs4 import BeautifulSoup

genders = []
selected_texts = []

for line in selected_lines:
    
    md = BeautifulSoup( metadata[ line ] )
    genders.append( md.line['gender'] )
    
    selected_texts.append( all_texts[ line ] )
    

print( len( genders ) )
print( len( selected_texts ) )

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer()
document_term_matrix = tf_vectorizer.fit_transform( selected_texts )

## Separate the train-test split

This is used later in the analysis to ensure we do not [overfit](https://en.wikipedia.org/wiki/Overfitting) the data when we train the machine learning classifier.
We choose to use 20% of data for testing.

In [None]:
from sklearn.model_selection import train_test_split

label_train, label_test, data_train, data_test = train_test_split( genders, document_term_matrix, test_size = .2 )

# Run and evaluate machine learning tasks

We now train the model using the **training** data and measure how well accuracy we achieved by examining **test data**.

In [None]:
from sklearn import svm

model = svm.SVC(kernel='linear') # Linear Kernel, default settings
model.fit( data_train, label_train)

In [None]:
from sklearn import metrics

## check how well we did for testing data
label_test_pred = model.predict( data_test )
print( metrics.accuracy_score( label_test, label_test_pred ) )

In [None]:
# understand predictions

predictors = {}

for i, name in enumerate( tf_vectorizer.get_feature_names() ):
    predictors[name] = i
    
    
for name, value in predictors.items():
    predictors[name] = model.coef_[0, value ]
    

print( predictors )

### Tasks

* Run the code as is and interprent the accuracy. What does that mean?
* Examine different metrics for [classification accuracy](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).
* Fix issues in the text pre-processing: account for stop words, frequent terms ans stem content in the document-term-matrix: does it have any implications on accuracy?
* Predictors includes each feature (as a key) and how good the variable was for said problem (as a value). Extract from this the best predictors.
* Count the number of different labels in the dataset of 10,000 comments. What can you observe?
* Modify the code to use [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) model and SVM model. Which one seems to work better?

# Advanced magics

* There are many different ways to build a models using various supervised machine learning methods.
One can use different parameters of methods. This is known as *tuning* the model and can improve models' performance in terms of accuracy.
* [Grid search](https://scikit-learn.org/stable/modules/grid_search.html) is an approach to examine different parameters and examine what paremeters lead to best models.
* You can also work on data preprocessing to [scale them](https://scikit-learn.org/stable/modules/preprocessing.html) or try to more acressively to clean or remove data.

In [None]:
## defining parameters for different models
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

In [None]:
from sklearn.model_selection import GridSearchCV

many_models = GridSearchCV( svm.SVC(), param_grid )
many_models.fit( data_train, label_train )

print( many_models )

* We have used a binary variable (male/female), however support vector machines can be used to [multi-category classification](https://scikit-learn.org/stable/modules/svm.html#multi-class-classification) or [linear variables through regression models](https://scikit-learn.org/stable/modules/svm.html#regression).

* If doing category classification, the algorithm is senstive to inbalances between classification, i.e. if there are more cases belonging to Category 1 than in Category 2. 

In [None]:
model = svm.SVC(kernel='linear', class_weight='balanced') # Linear Kernel, default settings
model.fit( data_train, label_train)

### Tasks

* Try different grid seaarcg parameters, see if your accuracy metric improve.
* Does balancing improve accuracy with our data?
* Use age variable to develop a regression model.