### Predicting the Mayor of London 2016 results using ward level demographics

In this notebook several classification algorithms such as Logistic regression, Support Vector Machine, K Nearest Neighbor(KNN) and 
Decision Tree. The dataset included as **outcome** the results of **Mayor of London election in 2016** and as features (predictors/independent variables) some selected socio-demographic predictors agregated at ward level.  
The previously used the demographic dataset in another notebooks: 

**__[Predicting the median house price in London Wards](https://github.com/sebastianBIanalytics/Data_Science_Machine_Learning_Python/blob/master/Predicting%20median%20House%20Price%20London%20-%20Multiple%20Regression.ipynb)__**

**__[Where in London to open a new Luxury Wine Bar](https://github.com/sebastianBIanalytics/Data_Science_Machine_Learning_Python/blob/master/WINE%20BAR%20in%20London%20Final.ipynb)__**

The original source of the predictors can be accessed at the link below:  
** Ward Profiles and Atlas of Greater London Authority (GLA)** provided by **__[DataStore London](https://data.london.gov.uk/dataset/ward-profiles-and-atlas)__**. Although the provided details reflect London profile until 2015, this is the most comprehensive dataset publicly available that cover data from the 2011 Census, from ONS and governmental data. 

The election results used the **__[London Elections Results 2016, Wards, Boroughs, Constituency]( https://data.london.gov.uk/download/london-elections-results-2016-wards-boroughs-constituency/01f4ff3a-c562-4d61-977f-c2dfb36694ce/gla-elections-votes-all-2016.xlsx)__**. However, only the voted per ward level were included while the postal votes (given at Borough level) were excluded. 



The necessary packages were imported. 

In [37]:
import sys
import itertools
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pandas.testing as tm
import pandas.util.testing as tm

import seaborn as sns
import pandas.util.testing as tm
from matplotlib.ticker import NullFormatter
import pandas.util.testing as tm
import matplotlib.ticker as ticker

from sklearn import preprocessing
from sklearn import metrics

from sklearn.tree import DecisionTreeClassifier

### Model Evaluation using Test set 
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

## setting the working directory
import os


In [35]:
# notice: installing seaborn might takes a few minutes
!conda install -c anaconda seaborn -y

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



## Importing the dataset

In [3]:
## setting directory 
os.chdir("C://@@ Default Folder Python Notebooks/Data")

## Importing the dataset 
Election = pd.read_csv('Election_London 2016 Cleaned dataset.csv', encoding='ANSI')
Election.head()

Unnamed: 0,ID,Borough,Ward,Constituency,Turnout,Ward Level Electorate,Winner,Sadiq Aman Khan - Labour Party,% Children aged 0-15,% Working-age 16-64,...,% dependent children in out-of-work households 2014,% Households No adults Employment dependent children 2011,% Lone parents no employment 2011,Average GCSE capped point scores 2014,% No qualifications 2011,% Level 4 and above qualifications 2011,Crime rate 2014_15,% area that is open space - 2014,Average Public Transport Accessibility score 2014,Turnout at Mayoral election 2012
0,1,Bexley,Barnehurst,Bexley & Bromley,2758,6886,Zac Goldsmith,588,18.668678,62.049783,...,11.020408,3.330971,38.013699,326.298246,23.1,20.1,46.461219,35.978052,3.135916,35.06704
1,2,Bexley,Belvedere,Bexley & Bromley,2675,7506,Zac Goldsmith,957,23.174859,63.908139,...,21.571429,6.147795,42.576029,306.139264,23.7,21.9,61.963541,33.133207,2.752564,31.933791
2,3,Bexley,Blackfen And Lamorbey,Bexley & Bromley,3011,6974,Zac Goldsmith,613,18.347339,62.595705,...,7.2,1.996152,26.720648,332.838519,22.4,19.3,28.756957,9.484078,2.051587,35.887557
3,4,Bexley,Blendon And Penhill,Bexley & Bromley,3050,6993,Zac Goldsmith,546,17.973648,62.571558,...,5.686275,2.199101,28.358209,340.145185,21.5,19.6,37.669377,13.770616,2.065738,38.663117
4,5,Bexley,Brampton,Bexley & Bromley,3311,6902,Zac Goldsmith,707,16.559789,60.109393,...,5.555556,1.946647,34.090909,325.361682,22.9,20.7,26.340457,9.101077,2.665179,41.213064


In [4]:
Election['Winner'].value_counts()

Sadiq Aman Khan    379
Zac Goldsmith      242
Name: Winner, dtype: int64

The winner in 379 wards was Sadiq Aman Khan, the actual Mayor of London. Below it can be observed in which Boroughs of London he won all the wards and where he lost.  

In [5]:
Vote_Borough_prop = pd.crosstab(Election['Borough'], Election['Winner'], 
                           margins=True, normalize='index').sort_values('Sadiq Aman Khan', 
                           ascending=False).round(4)*100
Vote_Borough_prop

Winner,Sadiq Aman Khan,Zac Goldsmith
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1
Barking & Dagenham,100.0,0.0
Islington,100.0,0.0
Tower Hamlets,100.0,0.0
Southwark,100.0,0.0
Newham,100.0,0.0
Hackney,100.0,0.0
Lewisham,100.0,0.0
Haringey,100.0,0.0
Lambeth,100.0,0.0
Brent,95.24,4.76


In [6]:
Election.dropna(inplace=True)

In [7]:
Election.columns

Index(['ID', 'Borough', 'Ward', 'Constituency', 'Turnout',
       'Ward Level Electorate', 'Winner', 'Sadiq Aman Khan - Labour Party',
       '% Children aged 0-15', '% Working-age 16-64',
       '% Older people aged 65+', 'Median Age 2013', 'Population density 2013',
       '% BAME 2011', '% Not Born in UK 2011', 'General Fertility Rate 2013',
       'Male life expectancy 2009_13', 'Female life expectancy 2009-13 ',
       '% children in reception obese 2011_14',
       '% children year 6 obese- 2011_14',
       'Rate Ambulance Incidents per 1,000 population - 2014',
       'In employment (16-64) 2011', 'Number of jobs in area - 2013',
       'Rate new migrant workers - 2011/12', 'No properties sold 2014',
       'Median Household income 2012/13', '% semi-detached houses 2011',
       '% Households Private Rented 2011', '% dwellings CT bands A or B 2015',
       '% dwellings CT bands F, G or H - 2015',
       'Rate Claimant Housing Benefit 2015', 'Rate JobSeekers Allowance 2015',
    

Some useless variables were excluded and the cases which contain NA were excluded. The final dataset contains 600 wards and 75 variables.  

In [8]:
 Features = Election[['% Working-age 16-64',
       '% Older people aged 65+', 'Median Age 2013', 
       'Population density 2013',
       '% BAME 2011', '% Not Born in UK 2011', 
       'Median Household income 2012/13', 
       '% dwellings CT bands F, G or H - 2015',
       'Rate Claimant Housing Benefit 2015', 'Rate JobSeekers Allowance 2015',
       '% dependent children in out-of-work households 2014',
       '% No qualifications 2011',
       '% Level 4 and above qualifications 2011', 
        'Crime rate 2014_15',
       'Average Public Transport Accessibility score 2014']] 


In [9]:
Features.shape

(600, 15)

However, only 39 variables (**listed above**) were kept for analysis. Most of then have integers or float format.

In [10]:
Features.dtypes

% Working-age 16-64                                    float64
% Older people aged 65+                                float64
Median Age 2013                                          int64
Population density 2013                                float64
% BAME 2011                                            float64
% Not Born in UK 2011                                  float64
Median Household income 2012/13                          int64
% dwellings CT bands F, G or H - 2015                  float64
Rate Claimant Housing Benefit 2015                     float64
Rate JobSeekers Allowance 2015                         float64
% dependent children in out-of-work households 2014    float64
% No qualifications 2011                               float64
% Level 4 and above qualifications 2011                float64
Crime rate 2014_15                                     float64
Average Public Transport Accessibility score 2014      float64
dtype: object

### Feature selection

Lets defind feature sets, X:

In [11]:
X = Features
X[0:5]

Unnamed: 0,% Working-age 16-64,% Older people aged 65+,Median Age 2013,Population density 2013,% BAME 2011,% Not Born in UK 2011,Median Household income 2012/13,"% dwellings CT bands F, G or H - 2015",Rate Claimant Housing Benefit 2015,Rate JobSeekers Allowance 2015,% dependent children in out-of-work households 2014,% No qualifications 2011,% Level 4 and above qualifications 2011,Crime rate 2014_15,Average Public Transport Accessibility score 2014
0,62.049783,19.281539,41,3672.4,9.8,8.0,38200,4.494382,5.843453,1.448995,11.020408,23.1,20.1,46.461219,3.135916
1,63.908139,12.917002,35,3828.1,28.4,20.4,33510,1.351351,12.0,2.425614,21.571429,23.7,21.9,61.963541,2.752564
2,62.595705,19.056956,42,6352.9,7.3,7.1,40780,5.854801,2.941527,1.12507,7.2,22.4,19.3,28.756957,2.051587
3,62.571558,19.454793,41,5285.7,9.5,7.8,40340,13.501144,2.982988,0.825537,5.686275,21.5,19.6,37.669377,2.065738
4,60.109393,23.330819,45,5350.0,13.1,10.4,40790,6.888361,2.855137,0.90307,5.555556,22.9,20.7,26.340457,2.665179


In [12]:
y = Election['Winner'].values
y [0:5]

array(['Zac Goldsmith', 'Zac Goldsmith', 'Zac Goldsmith', 'Zac Goldsmith',
       'Zac Goldsmith'], dtype=object)

## Normalize Data 

Data Standardization give data zero mean and unit variance (technically should be done after train test split )

In [13]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-1.22907358,  1.76105635,  1.65937172, -0.92516434, -1.5365264 ,
        -2.01581355, -0.07642246, -0.61420056, -1.09072393, -0.88124148,
        -0.66398467,  0.8963358 , -1.35505669, -0.60514416, -0.4170586 ],
       [-0.86328008,  0.2726697 ,  0.1428096 , -0.89304708, -0.56058207,
        -1.10971499, -0.79693974, -0.81948879,  0.01326089, -0.11122896,
         0.78144715,  0.99946497, -1.21100531, -0.31771915, -0.70198893],
       [-1.12161585,  1.70853629,  1.91213207, -0.37224003, -1.66770172,
        -2.08157877,  0.31993886, -0.52534429, -1.611094  , -1.13663924,
        -1.1873596 ,  0.77601842, -1.41907953, -0.93339484, -1.22299769],
       [-1.1263688 ,  1.801573  ,  1.65937172, -0.59237837, -1.55226744,
        -2.03042805,  0.25234236, -0.02592051, -1.60365928, -1.37280573,
        -1.39473167,  0.62132465, -1.39507097, -0.7681517 , -1.21248014],
       [-1.61101462,  2.70800577,  2.67041313, -0.57911479, -1.36337499,
        -1.84043964,  0.32147515, -0.45783692, 

The features were names as **X** and target as **y**. The Train and Test datasets were created as a 80% and 20% of dataset. 

# Train/Test dataset

Okay, we split our dataset into train and test set:


In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (480, 15) (480,)
Test set: (120, 15) (120,)


# Logistic Regression

The first classification methods applied is Logistic Regression. With an F1 score of 0.925 and Jaccard index for accuracy of 0.925 and Log Loss of 0.258 the model Logistic regression is a good option to correctly classify which candidate will be selected on each wards based on the selected demographic variables.


In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_similarity_score


## Initialize Logistic Regression instance
LR = LogisticRegression()

## Fit the model on the training data ### model is trained

LR = LogisticRegression(penalty='l1', C=0.01, solver='liblinear').fit(X_train, y_train)
LR

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
## Predicting outcome for training and test dataset 

pred_train_Y = LR.predict(X_train)

pred_test_Y = LR.predict(X_test)

pred_test_Y_prob = LR.predict_proba(X_test)

In [23]:
LR_train_acc = round(accuracy_score(y_train, pred_train_Y),4)
LR_test_acc = round(accuracy_score(y_test, pred_test_Y), 4)

In [24]:
### f1_score from sklearn library

from sklearn.metrics import f1_score
F1_LR = round (f1_score(y_test, pred_test_Y, average='weighted'), 4) 

In [25]:
 ### Jaccard index for accuracy:
from sklearn.metrics import jaccard_similarity_score
Jaccard_LR = round (jaccard_similarity_score(y_test, pred_test_Y), 4)



In [26]:
from sklearn.metrics import log_loss
log_loss(y_test, pred_test_Y_prob)

0.5095592953551875

In [27]:
from sklearn.metrics import classification_report, confusion_matrix
confusion_matrix = confusion_matrix(y_test,pred_test_Y)  
confusion_matrix

print(classification_report(y_test,pred_test_Y))

                 precision    recall  f1-score   support

Sadiq Aman Khan       0.93      0.82      0.87        65
  Zac Goldsmith       0.81      0.93      0.86        55

       accuracy                           0.87       120
      macro avg       0.87      0.87      0.87       120
   weighted avg       0.87      0.87      0.87       120



In [28]:
print (confusion_matrix)

[[53 12]
 [ 4 51]]



 True positive is 53.
    
 True negative is 51.

 False positive is 12.
    
 False negative is 4.


# K Nearest Neighbor(KNN)

The second classification method applied will be KNN clustering with 2, 4, 6 clusters. Based on the accuracy for the training and test datasets the 4 clusters solution is preferable. 

In [32]:
from sklearn.neighbors import KNeighborsClassifier


In [38]:
#Train Model and Predict 

k = 3
neigh3 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh3
yhat3 = neigh3.predict(X_test)  
yhat3[0:5]

# Accuracy evaluation

print ("Train set Accuracy KNN3: ", metrics.accuracy_score(y_train, neigh3.predict(X_train)))
print("Test set Accuracy KNN3: ", metrics.accuracy_score(y_test, yhat3))

Train set Accuracy KNN3:  0.9479166666666666
Test set Accuracy KNN3:  0.9083333333333333


In [39]:
k = 4
neigh4 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh4
yhat4 = neigh4.predict(X_test)  
yhat4[0:5]

# Accuracy evaluation

print("Train set Accuracy KNN4: ", metrics.accuracy_score(y_train, neigh4.predict(X_train)))
print("Test set Accuracy KNN4: ", metrics.accuracy_score(y_test, yhat4))

Train set Accuracy KNN4:  0.9333333333333333
Test set Accuracy KNN4:  0.9


In [40]:
k = 6
neigh6 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh6
yhat6 = neigh6.predict(X_test)  
yhat6[0:5]

# Accuracy evaluation

print("Train set Accuracy KNN6: ", metrics.accuracy_score(y_train, neigh6.predict(X_train)))
print("Test set Accuracy KNN6: ", metrics.accuracy_score(y_test, yhat6))

Train set Accuracy KNN6:  0.925
Test set Accuracy KNN6:  0.9


In [41]:
### I choose KNN 3 as the Accuracy for both Train and Test data are the highest. 
### f1_score from sklearn library

KNN3_F1 = round (f1_score(y_test, yhat3, average='weighted'),4) 

KNN3_jaccard = round (jaccard_similarity_score(y_test, yhat3), 4)

KNN3_train_acc = round (metrics.accuracy_score(y_train, neigh3.predict(X_train)), 4)

KNN3_test_acc = round (metrics.accuracy_score(y_test, yhat3), 4)



# Decision Tree

The third classification method applied will be the **Decision Tree**, at depth level 10.  


In [43]:
## I tried different maximum detphs and 10 is the best one

Vote_Tree = DecisionTreeClassifier(criterion="entropy", max_depth = 10)

Vote_Tree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=10,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [44]:
predTreeTrain = Vote_Tree.predict(X_train) 

In [45]:
predTree = Vote_Tree.predict(X_test) 

In [46]:
DT_train_acc = round ( metrics.accuracy_score(y_train, predTreeTrain), 4) 
DT_test_acc = round ( metrics.accuracy_score(y_test, predTree), 4) 
DT_jaccard = round (jaccard_similarity_score(y_test, predTree), 4) 
DT_F1 = round (f1_score(y_test, predTree, average='weighted'), 4) 



# Support Vector Machine

Finally, the SVM is applied to classify the Winner of Mayoral election in London. 

In [48]:
from sklearn import svm

clf1 = svm.SVC(kernel='rbf')
clf1.fit(X_train, y_train) 



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [49]:
yhatSVM = clf1.predict(X_test)

In [50]:
# write your code here
clf1 = svm.SVC(kernel='linear')
clf1.fit(X_train, y_train) 
yhatSVM = clf1.predict(X_test)

SVM_F1 = round (f1_score(y_test, yhatSVM, average='weighted'), 4) 

SVM_Jaccard = round (jaccard_similarity_score(y_test, yhatSVM), 4) 
SVM_test_acc = round(metrics.accuracy_score(y_test, yhatSVM), 4)





In [51]:
print("1. F1 score for Logistic Regression: ", F1_LR) 
print("2. Jaccard Similarity Score for Logistic Regression: ", Jaccard_LR)
print("3. Train set Accuracy for Logistic Regression: ", LR_train_acc)
print("4. Test set Accuracy for Logistic Regression: ", LR_test_acc)

#####

print("1. F1 score for Decision Tree: ", DT_F1)
print("2. Jaccard Similarity Score for Decision Tree: ", DT_jaccard)
print("3. Decision Tree's Training Accuracy: ", DT_train_acc) 
print("4. Decision Tree's Test Accuracy: ", DT_test_acc) 

#####

print("1. F1 score for SVM: ", SVM_F1)
print("2. Jaccard Similarity Score for SVM: ", SVM_Jaccard)
print("4. Test set Accuracy for SVM: ", SVM_test_acc)


print("1. F1 score for KNN 3 is: ", KNN3_F1)
print("2. Jaccard Similarity Score for KNN 3 is: ", KNN3_jaccard)
print("3. Train set Accuracy for KNN 3 is: ", KNN3_train_acc)
print("4. Test set Accuracy for KNN 3 is: ", KNN3_test_acc)

1. F1 score for Logistic Regression:  0.8668
2. Jaccard Similarity Score for Logistic Regression:  0.8667
3. Train set Accuracy for Logistic Regression:  0.8688
4. Test set Accuracy for Logistic Regression:  0.8667
1. F1 score for Decision Tree:  0.8996
2. Jaccard Similarity Score for Decision Tree:  0.9
3. Decision Tree's Training Accuracy:  1.0
4. Decision Tree's Test Accuracy:  0.9
1. F1 score for SVM:  0.8829
2. Jaccard Similarity Score for SVM:  0.8833
4. Test set Accuracy for SVM:  0.8833
1. F1 score for KNN 3 is:  0.9083
2. Jaccard Similarity Score for KNN 3 is:  0.9083
3. Train set Accuracy for KNN 3 is:  0.9479
4. Test set Accuracy for KNN 3 is:  0.9083


# Report
The metrics from different classification methods were included in the table below to compare the methods and decide which is the best classification method.


| Algorithm          | Jaccard | F1-score | Train Acc | Test Acc  |
|--------------------|---------|----------|-----------|-----------| 
| KNN-3              | 0.908   | 0.908    | 0.948     |  0.908    |
| Decision Tree (10) | 0.900   | 0.897    | 0.996     |  0.900    |
| SVM                | 0.883   | 0.883    |           |  0.883    |
| Logistic Regression| 0.867   | 0.867    | 0.869     |  0.867    |

The scores listed above were printed in the cells below. Based on them we can conclude the SVM seems to be the best performing classification method using the list of selected predictors. 

This notebook is the 2nd draft of my way of showing how Machine learning can be potentially used to create a model for predicting the results of 
London 2016 election based on selected my selection of variables (14 out of ~72). 
This selection takes into account statistical checking of the relation between variables 
(multicolinearity) but did not assessed the statistical significance of any of the variables.  
However, this will allow to compare the model I will create in the 2nd draft to this one including as predictors the following features: 

* 'Working-age (16-64) - 2015', 
* 'Older people aged 65+ - 2015', 
* 'Median Age - 2013', 
* 'Population density (persons per sq km) - 2013', 
* '% BAME - 2011',
* '% English is First Language of no one in household - 2011',  
* 'Median Household income estimate (2012/13)', 
* '% dwellings in council tax bands A or B - 2015', 
* 'Rate of Claimant  of Housing Benefit (2015)', 
* 'Rate of JobSeekers Allowance (JSA) Claimants - 2015', 
* '% dependent children (0-18) in out-of-work households - 2014', 
* 'Level 4 and above qualifications 2011', 
* 'Crime rate - 2014/15', 
* 'Average Public Transport Accessibility score - 2014'.         

* If you read this Notebook please email me any comments or suggestions to my email sebastian@bianalytics.org