### Predicting the Mayor of London 2016 results using ward level demographics

In this notebook several classification algorithms such as Logistic regression, Support Vector Machine, K Nearest Neighbor(KNN) and 
Decision Tree. The dataset included as **outcome** the results of **Mayor of London election in 2016** and as features (predictors/independent variables) some selected socio-demographic predictors agregated at ward level.  
The previously used the demographic dataset in another notebooks: 

**__[Predicting the median house price in London Wards](https://github.com/sebastianBIanalytics/Data_Science_Machine_Learning_Python/blob/master/Predicting%20median%20House%20Price%20London%20-%20Multiple%20Regression.ipynb)__**

**__[Where in London to open a new Luxury Wine Bar](https://github.com/sebastianBIanalytics/Data_Science_Machine_Learning_Python/blob/master/WINE%20BAR%20in%20London%20Final.ipynb)__**

The original source of the predictors can be accessed at the link below:  
** Ward Profiles and Atlas of Greater London Authority (GLA)** provided by **__[DataStore London](https://data.london.gov.uk/dataset/ward-profiles-and-atlas)__**. Although the provided details reflect London profile until 2015, this is the most comprehensive dataset publicly available that cover data from the 2011 Census, from ONS and governmental data. 

The election results used the **__[London Elections Results 2016, Wards, Boroughs, Constituency]( https://data.london.gov.uk/download/london-elections-results-2016-wards-boroughs-constituency/01f4ff3a-c562-4d61-977f-c2dfb36694ce/gla-elections-votes-all-2016.xlsx)__**. However, only the voted per ward level were included while the postal votes (given at Borough level) were excluded. 



The necessary packages were imported. 

In [1]:
import sys
import itertools
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import pandas.testing as tm
import pandas.util.testing as tm

import seaborn as sns
import pandas.util.testing as tm
from matplotlib.ticker import NullFormatter
import pandas.util.testing as tm
import matplotlib.ticker as ticker

from sklearn import preprocessing
from sklearn import metrics
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier

### Model Evaluation using Test set 
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

## setting the working directory
import os


  if __name__ == '__main__':


In [2]:
# notice: installing seaborn might takes a few minutes
!conda install -c anaconda seaborn -y

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



## Importing the dataset

In [3]:
## setting directory 
os.chdir("C://@@ Default Folder Python Notebooks/Data")

## Importing the dataset 
Election = pd.read_csv('Election_Demographics.csv', encoding='ANSI')
Election.head()

Unnamed: 0.1,Unnamed: 0,ID,Borough,Ward,Geog level,Constituency,New code,Turnout,Ward Level Electorate,% Turnout,...,% with no qualifications - 2011,% with Level 4 qualifications and above - 2011,A-Level Average Point Score Per Student - 2013/14,A-Level Average Point Score Per Entry; 2013/14,Crime rate - 2014/15,Violence against the person rate - 2014/15,% area that is open space - 2014,Cars per household - 2011,Average Public Transport Accessibility score - 2014,Turnout at Mayoral election - 2012
0,0,1,Bexley,Barnehurst,Ward,Bexley & Bromley,E05000064,2758,6886,40%,...,23.1,20.1,757.587952,214.443374,46.461219,15.471698,35.978052,1.254193,3.135916,35.06704
1,1,2,Bexley,Belvedere,Ward,Bexley & Bromley,E05000065,2675,7506,36%,...,23.7,21.9,694.377778,209.123457,61.963541,19.758065,33.133207,1.013248,2.752564,31.933791
2,2,3,Bexley,Blackfen And Lamorbey,Ward,Bexley & Bromley,E05000066,3011,6974,43%,...,22.4,19.3,750.33,212.2325,28.756957,6.915888,9.484078,1.349928,2.051587,35.887557
3,3,4,Bexley,Blendon And Penhill,Ward,Bexley & Bromley,E05000067,3050,6993,44%,...,21.5,19.6,725.517045,207.929545,37.669377,7.636364,13.770616,1.441948,2.065738,38.663117
4,4,5,Bexley,Brampton,Ward,Bexley & Bromley,E05000068,3311,6902,48%,...,22.9,20.7,688.423809,208.227619,26.340457,6.320755,9.101077,1.326364,2.665179,41.213064


In [4]:
Election['Winner'].value_counts()

Sadiq Aman Khan    379
Zac Goldsmith      242
Name: Winner, dtype: int64

The winner in 379 wards was Sadiq Aman Khan, the actual Mayor of London. Below it can be observed in which Boroughs of London he won all the wards and where he lost.  

In [5]:
Vote_Borough_prop = pd.crosstab(Election['Borough'], Election['Winner'], 
                           margins=True, normalize='index').sort_values('Sadiq Aman Khan', 
                           ascending=False).round(4)*100
Vote_Borough_prop

Winner,Sadiq Aman Khan,Zac Goldsmith
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1
Barking & Dagenham,100.0,0.0
Islington,100.0,0.0
Tower Hamlets,100.0,0.0
Southwark,100.0,0.0
Newham,100.0,0.0
Hackney,100.0,0.0
Lewisham,100.0,0.0
Haringey,100.0,0.0
Lambeth,100.0,0.0
Brent,95.24,4.76


In [6]:
Election = Election.drop(columns=['Unnamed: 0', 'Geog level'])

In [7]:
Election.dropna(inplace=True)

In [8]:
Election.shape

(600, 75)

Some useless variables were excluded and the cases which contain NA were excluded. The final dataset contains 600 wards and 75 variables.  

In [9]:
FeatureS = Election[['Working-age (16-64) - 2015', 'Older people aged 65+ - 2015', 'Median Age - 2013', 
                     'Population density (persons per sq km) - 2013',
       '% BAME - 2011', '% English is First Language of no one in household - 2011',
       'Number of jobs in area - 2013',
       'Median House Price (Ã‚Â£) - 2014', 'Number of properties sold - 2014',
       'Median Household income estimate (2012/13)',
       '% Households Social Rented - 2011',
       '% Households Private Rented - 2011',
       '% dwellings in council tax bands A or B - 2015',
       'Claimant Rate of Housing Benefit (2015)',
       'Rate of JobSeekers Allowance (JSA) Claimants - 2015',
       '% dependent children (0-18) in out-of-work households - 2014',
       'A-Level Average Point Score Per Student - 2013/14',
       'Crime rate - 2014/15', 'Violence against the person rate - 2014/15',
       'Average Public Transport Accessibility score - 2014']] 

In [10]:
FeatureS.shape

(600, 20)

However, only 39 variables (**listed above**) were kept for analysis. Most of then have integers or float format.

In [11]:
FeatureS.dtypes

Working-age (16-64) - 2015                                        int64
Older people aged 65+ - 2015                                      int64
Median Age - 2013                                                 int64
Population density (persons per sq km) - 2013                   float64
% BAME - 2011                                                   float64
% English is First Language of no one in household - 2011       float64
Number of jobs in area - 2013                                   float64
Median House Price (Ã‚Â£) - 2014                                float64
Number of properties sold - 2014                                  int64
Median Household income estimate (2012/13)                        int64
% Households Social Rented - 2011                               float64
% Households Private Rented - 2011                              float64
% dwellings in council tax bands A or B - 2015                  float64
Claimant Rate of Housing Benefit (2015)                         

### Feature selection

Lets defind feature sets, X:

In [12]:
X = FeatureS
X[0:3]

Unnamed: 0,Working-age (16-64) - 2015,Older people aged 65+ - 2015,Median Age - 2013,Population density (persons per sq km) - 2013,% BAME - 2011,% English is First Language of no one in household - 2011,Number of jobs in area - 2013,Median House Price (Ã‚Â£) - 2014,Number of properties sold - 2014,Median Household income estimate (2012/13),% Households Social Rented - 2011,% Households Private Rented - 2011,% dwellings in council tax bands A or B - 2015,Claimant Rate of Housing Benefit (2015),Rate of JobSeekers Allowance (JSA) Claimants - 2015,% dependent children (0-18) in out-of-work households - 2014,A-Level Average Point Score Per Student - 2013/14,Crime rate - 2014/15,Violence against the person rate - 2014/15,Average Public Transport Accessibility score - 2014
0,6600,2050,41,3672.4,9.8,1.3,2200.0,250000.0,228,38200,10.7,8.3,6.741573,5.843453,1.448995,11.020408,757.587952,46.461219,15.471698,3.135916
1,7950,1600,35,3828.1,28.4,5.1,3800.0,179500.0,246,33510,16.1,18.6,23.938224,12.0,2.425614,21.571429,694.377778,61.963541,19.758065,2.752564
2,6700,2050,42,6352.9,7.3,1.2,1100.0,280000.0,182,40780,3.8,8.1,3.044496,2.941527,1.12507,7.2,750.33,28.756957,6.915888,2.051587


In [13]:
y = Election['Winner'].values
y [0:3]

array(['Zac Goldsmith', 'Zac Goldsmith', 'Zac Goldsmith'], dtype=object)

## Normalize Data 

Data Standardization give data zero mean and unit variance (technically should be done after train test split )

In [14]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-1.32290918,  0.88382937,  1.65937172, -0.92516434, -1.5365264 ,
        -1.52236517, -0.32306534, -0.84747715,  0.45642946, -0.07642246,
        -0.87140391, -1.6227474 , -0.73662558, -1.09072393, -0.88124148,
        -0.66398467,  1.2740769 , -0.60514416, -0.53856825, -0.4170586 ],
       [-0.72108024,  0.03095614,  0.1428096 , -0.89304708, -0.56058207,
        -1.01332495, -0.20128314, -1.23227656,  0.66952063, -0.79693974,
        -0.49855664, -0.55953069,  0.54652698,  0.01326089, -0.11122896,
         0.78144715,  0.06402702, -0.31771915, -0.17940434, -0.70198893],
       [-1.27832926,  0.88382937,  1.91213207, -0.37224003, -1.66770172,
        -1.53576097, -0.40679061, -0.68373272, -0.08813688,  0.31993886,
        -1.34781986, -1.64339238, -1.01248821, -1.611094  , -1.13663924,
        -1.1873596 ,  1.13513592, -0.93339484, -1.25547811, -1.22299769],
       [-1.18916942,  1.07335676,  1.65937172, -0.59237837, -1.55226744,
        -1.46878199, -0.37634506, -0.71102346,  

The features were names as **X** and target as **y**. The Train and Test datasets were created as a 80% and 20% of dataset. 

# Train/Test dataset

Okay, we split our dataset into train and test set:


In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2, random_state=4)

print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (480, 20) (480,)
Test set: (120, 20) (120,)


# Logistic Regression

The first classification methods applied is Logistic Regression. With an F1 score of 0.925 and Jaccard index for accuracy of 0.925 and Log Loss of 0.258 the model Logistic regression is a good option to correctly classify which candidate will be selected on each wards based on the selected demographic variables.


In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

yhat_LR = LR.predict(X_test)         
#yhat_LR

yhat_prob = LR.predict_proba(X_test)
#yhat_prob


In [17]:
### f1_score from sklearn library
from sklearn.metrics import f1_score
f1_score(y_test, yhat_LR, average='weighted') 

0.908967681127068

In [18]:
 ### Jaccard index for accuracy:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat_LR)



0.9083333333333333

In [19]:
from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob) 

0.29477524101532304

In [20]:
from sklearn.metrics import classification_report, confusion_matrix
confusion_matrix = confusion_matrix(y_test,yhat_LR)  
confusion_matrix

print(classification_report(y_test,yhat_LR))

                 precision    recall  f1-score   support

Sadiq Aman Khan       0.95      0.91      0.93        77
  Zac Goldsmith       0.85      0.91      0.88        43

       accuracy                           0.91       120
      macro avg       0.90      0.91      0.90       120
   weighted avg       0.91      0.91      0.91       120



# K Nearest Neighbor(KNN)

The second classification method applied will be KNN clustering with 2, 4, 6 clusters. Based on the accuracy for the training and test datasets the 4 clusters solution is preferable. 

In [21]:
from sklearn.neighbors import KNeighborsClassifier

#Train Model and Predict 

k = 2
neigh2 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh2
yhat2 = neigh2.predict(X_test)  
yhat2[0:5]

# Accuracy evaluation

from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh2.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat2))

Train set Accuracy:  0.93125
Test set Accuracy:  0.8833333333333333


In [22]:
#Train Model and Predict 

k = 4
neigh4 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh4
yhat4 = neigh4.predict(X_test)  
yhat4[0:5]

# Accuracy evaluation

print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh4.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat4))

Train set Accuracy:  0.9208333333333333
Test set Accuracy:  0.9166666666666666


In [23]:
#Train Model and Predict 

k = 6
neigh6 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh6
yhat6 = neigh6.predict(X_test)  
yhat6[0:5]

# Accuracy evaluation

print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh6.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat6))

Train set Accuracy:  0.925
Test set Accuracy:  0.9333333333333333


In [24]:
### I choose KNN 4 as the Accuracy for both Train and Test data are the highest. 
### f1_score from sklearn library
from sklearn.metrics import f1_score
f1_score(y_test, yhat4, average='weighted') 


0.9144507755618866

In [25]:
### Jaccard index for accuracy:
from sklearn.metrics import jaccard_similarity_score

jaccard_similarity_score(y_test, yhat4)



0.9166666666666666

# Decision Tree

The third classification method applied will be the **Decision Tree**, at depth level 10.  


In [26]:
## I tried different maximum detphs and 10 is the best one

Vote_Tree = DecisionTreeClassifier(criterion="entropy", max_depth = 10)
Vote_Tree # it shows the default parameters

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=10,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [27]:
Vote_Tree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=10,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [28]:
predTreeTrain = Vote_Tree.predict(X_train) 

In [29]:
print("Decision Trees's Training Accuracy: ", metrics.accuracy_score(y_train, predTreeTrain)) 

Decision Trees's Training Accuracy:  0.9979166666666667


In [30]:
predTree = Vote_Tree.predict(X_test) 

In [31]:
print("Decision Trees's Test Accuracy: ", metrics.accuracy_score(y_test, predTree)) 

Decision Trees's Test Accuracy:  0.9333333333333333


In [32]:
### Jaccard index for accuracy:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, predTree)



0.9333333333333333

In [33]:
f1_score(y_test, predTree, average='weighted') 

0.9325511989297107

# Support Vector Machine

Finally, the SVM is applied to classify the Winner of Mayoral election in London. 

In [34]:
from sklearn import svm
clf1 = svm.SVC(kernel='rbf')
clf1.fit(X_train, y_train) 



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [35]:
yhatSVM = clf1.predict(X_test)
yhatSVM [0:5]

array(['Sadiq Aman Khan', 'Zac Goldsmith', 'Zac Goldsmith',
       'Sadiq Aman Khan', 'Sadiq Aman Khan'], dtype=object)

In [36]:
### f1_score from sklearn library
from sklearn.metrics import f1_score
f1_score(y_test, yhatSVM, average='weighted') 

0.9248007590132826

In [37]:
### Jaccard index for accuracy:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhatSVM)



0.925

In [38]:
# write your code here
clf2 = svm.SVC(kernel='linear')
clf2.fit(X_train, y_train) 
yhatSVM2 = clf2.predict(X_test)
print("Avg F1-score for SVM: %.4f" % f1_score(y_test, yhatSVM2, average='weighted'))
print("Jaccard score for SVM: %.4f" % jaccard_similarity_score(y_test, yhatSVM2))
print("SVM's Test Accuracy: ", metrics.accuracy_score(y_test, yhatSVM2))

Avg F1-score for SVM: 0.8907
Jaccard score for SVM: 0.8917
SVM's Test Accuracy:  0.8916666666666667




# Report
The metrics from different classification methods were included in the table below to compare the methods and decide which is the best classification method.


| Algorithm          | Jaccard | F1-score | Train Acc | Test Acc  |
|--------------------|---------|----------|-----------|-----------| 
| KNN-4              | 0.916   | 0.914    | 0.920     |  0.916    |
| Decision Tree (10) | 0.925   | 0.923    | 0.997     |  0.925    |
| SVM                | 0.892   | 0.891    |           |  0.891    |
| Logistic Regression| 0.908   | 0.908    | 0.879     |  0.908    |

The scores listed above were printed in the cells below. Based on them we can conclude the Decision Tree seems to be the best performing classification method using the list of selected predictors. 

In [39]:
print("F1 score for Logistic Regression: ", f1_score(y_test, yhat_LR, average='weighted')) 
print("Jaccard Similarity Score for Logistic Regression: ", jaccard_similarity_score(y_test, yhat_LR))
print("Train set Accuracy for Logistic Regression: ", metrics.accuracy_score(y_train, LR.predict(X_train)))
print("Test set Accuracy for Logistic Regression: ", metrics.accuracy_score(y_test, yhat_LR))

F1 score for Logistic Regression:  0.908967681127068
Jaccard Similarity Score for Logistic Regression:  0.9083333333333333
Train set Accuracy for Logistic Regression:  0.8791666666666667
Test set Accuracy for Logistic Regression:  0.9083333333333333




In [40]:
print("Train set Accuracy for KNN 4 is: ", metrics.accuracy_score(y_train, neigh4.predict(X_train)))
print("Test set Accuracy for KNN 4 is: ", metrics.accuracy_score(y_test, yhat4))
print("F1 score for KNN 4 is: ", f1_score(y_test, yhat4, average='weighted') )
print("Jaccard Similarity Score for KNN 4 is: ", jaccard_similarity_score(y_test, yhat4))

Train set Accuracy for KNN 4 is:  0.9208333333333333
Test set Accuracy for KNN 4 is:  0.9166666666666666
F1 score for KNN 4 is:  0.9144507755618866
Jaccard Similarity Score for KNN 4 is:  0.9166666666666666




In [41]:
print("Decision Tree's Training Accuracy: ", metrics.accuracy_score(y_train, predTreeTrain)) 
print("Decision Tree's Test Accuracy: ", metrics.accuracy_score(y_test, predTree)) 
print("Jaccard Similarity Score for Decision Tree: ", jaccard_similarity_score(y_test, predTree))
print("F1 score for Decision Tree: ", f1_score(y_test, predTree, average='weighted'))

Decision Tree's Training Accuracy:  0.9979166666666667
Decision Tree's Test Accuracy:  0.9333333333333333
Jaccard Similarity Score for Decision Tree:  0.9333333333333333
F1 score for Decision Tree:  0.9325511989297107




This notebook is the 1st draft of my way of showing how Machine learning can be potentially used to create a model for predicting the results of London 2016 election based on selected my selection of variables (20 out of ~72). This selection did not take into account any statistical checking of the relation between variables (multicolinearity) and did not assessed the statistical significance of any of the variables. Once I will do this the variables to be included in the model will be different. However, this will allow to compare the model I will create in the 2nd draft to this one including as predictors the following features: 

* 'Working-age (16-64) - 2015', 
* 'Older people aged 65+ - 2015', 
* 'Median Age - 2013', 
* 'Population density (persons per sq km) - 2013', 
* '% BAME - 2011','% English is First Language of no one in household - 2011', 
* 'Number of jobs in area - 2013', 
* 'Median House Price (Ã‚Â£) - 2014',
* 'Number of properties sold - 2014', 
* 'Median Household income estimate (2012/13)', 
* '% Households Social Rented - 2011', 
* '% Households Private Rented - 2011', 
* '% dwellings in council tax bands A or B - 2015', 
* 'Claimant Rate of Housing Benefit (2015)', 
* 'Rate of JobSeekers Allowance (JSA) Claimants - 2015', 
* '% dependent children (0-18) in out-of-work households - 2014', 
* 'A-Level Average Point Score Per Student - 2013/14', 
* 'Crime rate - 2014/15', 
* 'Violence against the person rate - 2014/15', 
* 'Average Public Transport Accessibility score - 2014'.         

* If you read this Notebook please email me any comments or suggestions to my email sebastian@bianalytics.org