# 01 - Machine Learning
What we want to do here is work with our *voting_with_topics.csv* file, which records the votes that every person made with respect to each subjet. We want to train a model and see whether, given a new text of law, our model will successfully be able to predict what a person would have voted. We will start by exploring our data first.

## 1.0 - Imports and Loading of the Data

In [None]:
import pandas as pd
import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split, cross_val_predict, learning_curve
import sklearn.metrics

%matplotlib inline
%load_ext autoreload
%autoreload 2

# There's a lot of columns in the DF. 
# Therefore, we add this option so that we can see more columns
pd.options.display.max_columns = 100

Importing the quite heavy DataFrame with the voting fields and the results. We drop a useless column and create a *Name* field, which will contain both the first and the last name of a person, so we can then create a model for each unique deputee at the parliament. 

In [None]:
path = '../datas/Voting/'
voting_df = pd.read_csv(path+'voting_with_topics.csv')
print('Entries in the DataFrame',voting_df.shape)

#Dropping the useless column
voting_df = voting_df.drop('Unnamed: 0',1)

#Putting numerical values into the columns that should have numerical values
print(voting_df.columns.values)

num_cols = ['Decision', ' armée', ' asile / immigration', ' assurances', ' budget', ' dunno', ' entreprise/ finance',
           ' environnement', ' famille / enfants', ' imposition', ' politique internationale', ' retraite  ']
voting_df[num_cols] = voting_df[num_cols].apply(pd.to_numeric)

#Inserting the full name at the second position
voting_df.insert(2,'Name', voting_df['FirstName'] + ' ' + voting_df['LastName'])

voting_df.head()

The first reduction of our DataFrame is to suppress all the entries of the *Decision* field, which contain either a *4*, a *5*, a *6* or a *7*. It basically means that the person did not take part to the voting, and that is hence not useful to our purpose. 

In [None]:
voting_df = voting_df[((voting_df.Decision != 4) & (voting_df.Decision != 5) & (voting_df.Decision != 6) & (voting_df.Decision != 7))]
print(voting_df.shape)
print('Top number of entries in the df :\n', voting_df.Name.value_counts())

We now want to slice to the *DataFrame* into multiple smaller *DataFrame*s which contains all the entries for a single person. This is done in order to be able to apply machine Learning to each person. The function *split_df* below splits the DataFrame into a dictionary which contains all the unique entries with respect to a given field.

In [None]:
def split_df(df, field):
    """
        Splits the input df along a certain field into multiple dictionaries which links each unique
        entry of the field to the entries in the dataframe
    """
    # Retrieve first all the unique Name entries
    unique_field = df[field].unique()
    print('Number of unique entries in',field,':',len(unique_field))
    #Create a dictionary of DataFrames which stores all the info relative to a single deputee
    df_dict = {elem : pd.DataFrame for elem in unique_field}

    for key in voting_dict.keys():
        df_dict[key] = df.loc[df[field] == key]
    
    return df_dict

voting_dict = split_df(voting_df, 'Name')

## 1.1 Machine Learning on a single deputee
Now on, we will work on an example, to see whether we are able to perform anything slightly correct with a machine learning prediction.  We will work with the data on our former member of the *National Council* and now member of the *Federal Council*, *Guy Parmelin*. Note that as the process of voting a law is iterative, going from one chamber to the other until the project is accepter, we first take the simple assumption of taking the vote of the person as the **last** vote he made on a given subject. This reduces quite a lot the size of the data we're working on, but we still have enough of them. 

In [None]:
#df_deputee = voting_dict['Guy Parmelin'].drop_duplicates('text', keep = 'last')
df_deputee = voting_dict['Silvia Schenker'].drop_duplicates('text', keep = 'last')

print(df_deputee.shape)
df_deputee.head()

### 1.1.1 Preparing the Features
Before moving onto Machine Learning properly, let us visualise the amount of votes into the 3 possible categories (1 : yes, 2 : no , 3 : absention).

In [None]:
print(df_deputee.Decision.value_counts())

We see a way smaller number of abtensions that *yes* and *no*, this is why we choose to ignore them at first. We rescale the decision output to *0* and *1*, otherwise the algorithm will not understand that it is a classification problem.

In [None]:
df_deputee = df_deputee[df_deputee.Decision!=3]
df_deputee['Decision'] = df_deputee['Decision'] -1
df_deputee.shape

We will now format the data and keep the relevant columns only, as well as split them into a training set. The *X* DataFrame will contain the probabilities we got from the nlp, *X_text* will be the textual data, that we will store for visualisation of the results later on. The *Y* vector contains the *Decision* taken by the person, this is what we want to predict. We will use the *Random Forest Classifier* as we did in the homework 4 of the course as our prediction algorithm. 


In [None]:
pred_field = [' armée', ' asile / immigration', ' assurances', ' budget', ' dunno', ' entreprise/ finance',
           ' environnement', ' famille / enfants', ' imposition', ' politique internationale', ' retraite  ']

X = df_deputee[pred_field]
X_text = df_deputee[['Name','text']]
Y = df_deputee['Decision']

### 1.1.2 Classification of our data
The classifier we will  use is the Random Forest. In order to evaluate the performance of our results, we will use several tools, which will help us to understand better the results we obtain. 

1. The [cross_validation module of scikit-learn](http://scikit-learn.org/0.17/modules/generated/sklearn.cross_validation.cross_val_score.html) allows us to test the performance of our classification. The `cross_val_score` method returns the percentage of accuracy of our classification (average of the testing error of each iteration of the cross-validation. The `cv` field allows us to chose the number of folds of cross-validation we want to perform (e.g. cv=5 -> 5-fold cross-validation).
2. The [F1 score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) takes into account the false positives and the false negatives in the process of outputting the score. Hence, a model with a high prediction accuracy can get very poor results in the F1-metric if for instance it outputs everything to an output which is dominant in the population (cf. the examples "Everybody has cancer" in the lecture 07 of the course).
3. [The confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) plots the detail of the classification, and allows us to visualise the false positives, false negatives. We can compute the F1-score from this matrix.
4. The [feature importances](http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html) allows us to see which of the features are the most significant for the classification. 


If we turn now to the coding, the first function below `print_confusion_matrix` is a helper function that helps us visualising the confusion matrix in a more elegant way than the usual way of displaying a 2D numpy array.

In [None]:
def print_confusion_matrix(conf_matrix):
    """
        Prints in textual way the entries of the confusion matrix conf_matrix and allows us to visualise it in a 
        more elegant way than when displaying it simply with print. It makes the understanding of it more intuitive.
    """
    print('\n\t\t\tDISPLAYING THE CONFUSION MATRIX\n')
    print('\tPredicted : ',end='\t' )
    features = conf_matrix.shape[0]
    for i in range(0,features):
        print(i,end='\t')
    print('TOTAL')
    print('\n Reality : ',end='\t')
    for i in range(0,features):
        print(i,'||',end='\t')
        for j in range(0,features):
            print(conf_matrix[i,j],end='\t')
        print(sum(conf_matrix[i,:]),end='\n\t\t')
    print('\nTOTALS : \t  ||',end='\t')
    for j in range(0,features):
        print(sum(conf_matrix[:,j]),end='\t')
    print(sum(sum(conf_matrix)))

The `cross_validation` function does perform the *cv_param*-fold cross-validation, and outputs the cross-validation result, along with F1-score and the confusion matrix, in order for us to understand the shape of our results. We perform by default a 20-fold cross validation, as we want a result as stable as possible.

In [None]:
def cross_validation(X,Y,cv_param = 20, max_depth = None): 
    """
        Cross_validation takes as input a dataset X and the labels Y and performs the cv_param-fold cross validation.
        It uses a random forest classifier in order to do so and plots the cross-validation score, the f1-score
        as well as the confusion matrix.
    """
    # 1. Creates the classifier that we will use 
    forest = RandomForestClassifier(max_depth = max_depth)
    # 2. Predicts the output of the classification after cv_param-fold cross-validation
    Y_predicted = cross_val_predict(forest, X, Y, cv=cv_param)
    
    # 3. Print the results : scores, 
    print('Cross Validation result :', cross_val_score(forest,X,Y,cv = cv_param).mean(),
        '\nF1 score result :',sklearn.metrics.f1_score(Y, Y_predicted,average='micro'))
    
    print_confusion_matrix(sklearn.metrics.confusion_matrix(Y,Y_predicted))

The `plot_feature_importances` function does rank each feature of *X* accordingly to the role it plays into the classification of the data we give it. This allows us to see whether a key subset of our features would turn out to be outstandingly better than the rest at determining the outcome of the vote of the deputee

In [None]:
def plot_feature_importances(X, Y, columns,plot_flag = 1, max_depth = None):
    """
        plot_feature_importances displays the importance of each feature into determining the output of our problem. 
        @param plot_flag : a boolean which can be turned on or off for the plotting for the display of the results.
    """
    forest = RandomForestClassifier(max_depth = max_depth)
    forest = forest.fit(X,Y)
    importances = forest.feature_importances_
    std = np.std([tree.feature_importances_ for tree in forest.estimators_],axis=0)
    indices = np.argsort(importances)[::-1]
    # Print the feature ranking
    print("Feature ranking:")

    if(plot_flag):    
        for f in range(X.shape[1]):
            print("%d. feature %d - %s (%f)" % (f + 1, indices[f],  columns[indices[f]], importances[indices[f]]))
            
        # Plot the feature importances of the forest
        plt.figure()
        plt.title("Feature importances")
        plt.bar(range(X.shape[1]), importances[indices], color="r", yerr=std[indices], align="center")
        plt.xticks(range(X.shape[1]), indices)
        plt.xlim([-1, X.shape[1]])
        plt.show()
    return indices,importances

It turns out that the learning curves are useful to be able to see whether our model is massively overfitting, and to help tune the best parameters on which to run our model. We will then plot the learning curves here and will pick the best model we have for a given deputee. 

In [None]:
def plot_learning_curve(estimator,X,Y,title,cv=20):
    """
        Custom plotting routine to plot the training and testing scores against the number of data considered.
        This takes an estimator and a Dataset as inputs, along with their labels, and the k-folds of cross validation
        and simply plots them with their standard deviation added as colour on the plot. 
    """
    plt.figure()
    plt.title(title)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, Y, cv=cv, train_sizes=np.linspace(0.2,1,20))
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                    train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                    test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
            label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
            label="Cross-validation score")
    plt.legend(loc="best")
    plt.show()

    
estimator = RandomForestClassifier()    

title = "Learning Curves (Random Forest Classifier with 2 classes)"
plot_learning_curve(estimator,X,Y,title,20)   

The plot describes the classification into binary output. The default *RandomForestClassifier* method clearly overfits our data, this is notably due to the fact of having an unfixed depth for the depth of the tree. This is why we will iterate over different depth and fix it to a value that yields a good result, and the fact of having a max depth will mitigate overfitting.

We now compute the result of the cross validation given a certain depth, varying from 1 to 20. Do not look too much at the `plot_fig` function, which only provides visulisation of our data obtained while iterating on the max_depth of the forest.

In [None]:
cv_score = np.zeros(20)
tr_score = np.zeros(20)

cv_param = 20
for i in range(1,21):
    forest = RandomForestClassifier(max_depth = i)
    cv_score[i-1] = cross_val_score(forest,X,Y,cv = cv_param).mean()
    forest.fit(X,Y)
    tr_score[i-1] = forest.score(X,Y)

def plot_fig(data_1,data_2,title_,xlabel_,ylabel_):
    """
        Custom plotting routine to plot our data the way we want 
        (does not provide anything extremely useful)
    """
    
    fig = plt.figure()
    ax = fig.add_subplot(111)    # The big subplot
    ax1 = fig.add_subplot(211)
    ax2 = fig.add_subplot(212)

    # Turn off axis lines and ticks of the big subplot
    ax.spines['top'].set_color('none')
    ax.spines['bottom'].set_color('none')
    ax.spines['left'].set_color('none')
    ax.spines['right'].set_color('none')
    ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')

    ax1.plot(np.linspace(1,20,20), data_1, 'o-', color="r",label="CV score")
    ax2.plot(np.linspace(1,20,20), data_2, 'o-', color="g",label="Training score ")
    ax1.legend(loc="best")
    ax2.legend(loc="best")
    ax.set_title(title_)
    ax.set_xlabel(xlabel_)
    ax.set_ylabel(ylabel_)
    plt.show()

plot_fig(cv_score,tr_score,
         "Cross validation score against the depth of the random forest",
         "Max depth of the random forest","Cross validation score")    

In [None]:
estimator = RandomForestClassifier(max_depth = 3)    

title = "Learning Curves (Random Forest Classifier with 2 classes)"
plot_learning_curve(estimator,X,Y,title,20)    

### 1.1.3 Results
Having found the depth at wich our tree does not overfit, we want to focus on understanding the results we get. We will see, given the features we have, whether our algorithm is able to classify correctly

In [None]:
cross_validation(X, Y, cv_param=20, max_depth=7)
features_2 = plot_feature_importances(X,Y,pred_field, max_depth=7)

**TODO :** What could be improved :
1. Not simply take the last iteration of the law : maybe take into account the intermediary votes
2. Average by party