This notebook performs the same task as DistanceComputtion.ipynb but for the topics, ie it computes the distance matrix for each votation subjects based on the topic modelling results.

In [None]:
import pandas as pd
import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split, cross_val_predict, learning_curve
import sklearn.metrics

%matplotlib inline
%load_ext autoreload
%autoreload 2

# There's a lot of columns in the DF. 
# Therefore, we add this option so that we can see more columns
pd.options.display.max_columns = 100

In [None]:
path = '../datas/nlp_results/'
voting_df = pd.read_csv(path+'voting_with_topics.csv')
print('Entries in the DataFrame',voting_df.shape)

#Dropping the useless column
voting_df = voting_df.drop('Unnamed: 0',1)

#Putting numerical values into the columns that should have numerical values
#print(voting_df.columns.values)

num_cols = ['Decision', ' armée', ' asile / immigration', ' assurances', ' budget', ' dunno', ' entreprise/ finance',
           ' environnement', ' famille / enfants', ' imposition', ' politique internationale', ' retraite  ']
voting_df[num_cols] = voting_df[num_cols].apply(pd.to_numeric)

#Inserting the full name at the second position
voting_df.insert(2,'Name', voting_df['FirstName'] + ' ' + voting_df['LastName'])

voting_df.head(3)

We first erase the duplicates and only collect the results of the topic modelling for each votation

In [None]:
voting_df_copy = voting_df.drop_duplicates(['BillTitle'], keep = 'last')
voting_subjects = voting_df_copy['BillTitle'].unique()
topics = [' armée', ' asile / immigration', ' assurances', ' budget', ' dunno', ' entreprise/ finance', ' environnement', ' famille / enfants', ' imposition', ' politique internationale', ' retraite  ']
print("{n} subjects voted in the parliament from 2009 to 2015".format(n=voting_subjects.shape[0]))
voting_df_copy = voting_df_copy.set_index(['BillTitle'])
voting_df_copy = voting_df_copy[topics]
voting_df_copy.head()

We then implement the distance function, which is simply the euclidean distance between the vectors whose entries are the percentage for each topic computed by topic modelling.

In [None]:
def distance(p1, p2):
    return np.linalg.norm(p1-p2)

We then apply it to every pairs of subjects in order to compute the distance matrix.

In [None]:
n = voting_subjects.shape[0]
distanceMatrix = np.zeros((n,n))

for i in range(n):
    if i % 10 == 0:
        print("Compute distances from subject " + str(i))
    for j in range(n):
        distanceMatrix[i][j] = distance(voting_df_copy.loc[voting_subjects[i]].values,
                                        voting_df_copy.loc[voting_subjects[j]].values)
print("Mean distance : {d}".format(d = np.mean(distanceMatrix)))

We save the matrix. We observe as expected that the diagonal of the distance matrix contains only 0 as the distance between some subject and itself is 0.

In [None]:
import pandas as pd 
df = pd.DataFrame(distanceMatrix, index = voting_subjects, columns = voting_subjects)
df.to_csv("distanceMatrixSubjects.csv")
df.head()

We finally compute for each subject the topic which appears the most.

In [None]:
topic_df = pd.DataFrame(index = voting_subjects)
topic_df['Topic'] = voting_df_copy[topics].idxmax(axis=1)
topic_df.head()

In [None]:
topic_df.to_csv("SubjectTopicMapping.csv")