# How does London vote?

#### by  Sofia Faqir
***
#### 1. Problem and Background
#### 2. Data gathering
#### 3. How to use the data?
#### 4. Methodology
#### 5. Results
#### 6. Discussion
#### 7. Conclusion  
#### References and Sources

***

***
<h2 id="Problem">1. Problem and Background</h2>

British people have been called to the polling booths many times in recent years: Brexit Referendum, general elections, early general elections, mayoral elections, local elections etc. Voter fatigue has been increasing, which makes it even more important to understand where it is worth spending more energy (and money) canvassing and campaigning.

The Greater London area has 73 parliamentary constituencies, while the UK as a whole has 650 constituencies. Wards constitute the “building blocks” (according to “Boundary Commission for England”).
Understanding what drives a constituency to vote for a certain party can be helpful in many ways.
First, the boundaries of constituencies are regularly reviewed and amended to be fairer and more equal. This is a subjective measure, and should be tested against a range of innovative ways to make sure the governing party is not taking through boundaries more favourable to them.
Second, campaign optimization can be empowered by data science. This is particularly relevant because of the rising costs of campaigns (and the multiplication of votes…).

Here, the question that I will ask is:
Using the results for the General Elections of 2017, can the data on venues in a specific constituency help predict how they will vote?

***
<h2 id="Data">2. Data gathering</h2>

The British government has clearly made a big effort in transparency and opened many sources of data that are readily available and easily accessible. 
However, the data needed a fair amount of reworking to make it usable for my objective.

For the purpose of this project, I will need a set of data: 

1. List of constituencies and the party that they have elected: 

The data on the General Elections of 2017 is available on London Datastore, which "has been created by the Greater London Authority (GLA) as a first step towards freeing London’s data":
https://data.london.gov.uk/download/general-election-results-2017/26ee40ae-becf-4839-bb0c-509024e61bfd/2017%20General%20Election%20Results.xls

2. List of wards belonging to each constituency:

I need the ward level to increase the granularity when compiling the list of venues in a constituency. Constituencies are fairly large areas, and hence would require more work to drill down the data.

The better way is to use Ordnance Survey Open Data, but this was too involved for this.
I resorted to scraping the website: https://www.electoralcalculus.co.uk/ which had the list of wards for each constituency, and more data…

3. The coordinates of each ward.

The coordinates for all the wards in England are recorded here:
http://geoportal.statistics.gov.uk/datasets/07194e4507ae491488471c84b23a90f2_0
It included the ward code, the ward name, the longitude and latitude of the ward.

However, the same name applies to different wards (in different constituencies). Luckily, within a constituency, ward names were unique, so the couple constituency/ward was in fact unique. To get the list of ward codes, I resorted to using the data here & process it further in excel:
https://data.london.gov.uk/download/excel-mapping-template-for-london-boroughs-and-wards/58f59b22-946e-43e9-96fd-c0a4fa27f76a/Mapping-template-for-London-boroughs.xls

4. The venues that surround each ward, and hence the venues in a certain constituency.

We will be calling the FourSquare API for this purpose.

At the end of the data filtering I will have the following data:
* Constituency
* Ward
* Unique Couple: Constiuency/Ward
* Ward coordinates
* Venues: in the relevant ward, together with coordinates, and venue category.
* Party elected at the last General elections

***
<h2 id="Use">3. How to use the data?</h2>

I will be using classification techniques, like clustering or decision trees on a subset of data.
I will then check how good my results are, and if there is a clear voting pattern.

There are a few avenues that I will be exploring, for example:
* Number of venues in any given constituency
* Clustering as a function of the venues in the constituency

***

## 4. Methodology

### 4.1. Data gathering and preprocessing

As previously mentioned, a fair amount of data is readily and freely available on Internet, from reliable governmental sources. 
It was however challenging to sift through all the websites, and assemble data with differing formats.

**a. The election results per constituency:**

The data in the csv file was for the whole of England, and included election results within each constituency at party level. 
I limited the selection to London region, and to the winning party. I dropped all quantitative data on the election results.
I obtained the following: 

<img src='df_elect2.PNG'>

**b. The list of wards per constituency:**

This require scraping the website https://www.electoralcalculus.co.uk/.
Extract of the table that I put together:

<img src='Ward per Constit.PNG'>

After checking the data, it seems that ward names are not  sufficient to determine the location since there are duplicates.

I worked on excel and I used the data in the below file to get the **geocode of each ward**, so that I can get the longitude/latitude from a file including all the coordinates of all the wards in the UK.
https://londondatastore-upload.s3.amazonaws.com/dataset/excel-mapping-template-for-london-boroughs-and-wards/Mapping-template-london-ward-map-2014.xls

I put together a **csv file** with that information, and also decreased the number of wards in the city london which was disproportionately large, to avoid skewing the results too much.

The file is: 'GeoCode_wards_Constit.csv'

<img src='Ward per Constit2.PNG'>

**c. Getting the coordinates of each ward**

The list of coordinates for all the wards using the geocode were available on this website:
https://opendata.arcgis.com/datasets/07194e4507ae491488471c84b23a90f2_0.csv

With further processing, the coordinates of each ward per constituency was ready:

<img src='Ldn ward coord.PNG'>

**d. The venues from FourSquare**

I downloaded from FourSquare the venues surrounding each ward.
Limit was 100 venues, in a radius of 500m.

I only kept the venue name, its category and its coordinates.

The resulting table included 12310 rows and was as follows:

<img src='venues.PNG'>

***
***

### 4.2. Data Visualization & Exploratory Data Analysis


**b. Number of venues per constituency - Parameter**

I run a few basic statistical analysis to gain a better understanding of the data 

<img src='Constit_number_top.PNG' ><img src='Constit_number_tail.PNG'>

First, Some constituencies have a much larger number of venues than others. In particular 'Cities of London and Westminster', which is the centre of the city of London. More generally, "zone 1-3" constituencies have a much larger number of venues than the rest, which makes sense but might skew the data. We might  want to run the analysis by excluding the 10 constituencies with largest number of venues.


Second, looking at the map at the same time, we notice that most of the voters for Labour are in those zones. 

<img src='ConstitParty_number_top.PNG' >

**Is there a link between the number of venues and the political party?**

I run a box graph to check this hypothesis:

<img src='numberofvenues_box1.PNG' width = 400>


After removing the **outliers** to make the graph more readable, we get:

<img src='numberofvenues_box2.PNG' width = 400>

This shows a clear difference between the profile of Labour constituencies and Tory constituency.
Large number of venues does strongly indicate a Labour constituency, while a smaller number of venue can be found in both parties.

**a. Visualizing the data on a maps**

To gain a better understanding of the data, it is nice to visualize the data in maps: 

**Map showing the wards, each constituency has a different color**

<img src='map_wards_constit.PNG'>

**Map showing the wards,red is Labour, blue is Tory, yellow is Lib Dem**

<img src='map_wards_party.PNG'>

**c. 10 Top Venues**

First, I looked at the number of venues per venue category. Pub was by far the most occurrence.

|Venue Category | Number of Venues   |
|------|------|
|Pub| 873 |
|Coffee Shop|720|
|Café|655|
|Grocery Store|439|
|Hotel|403|
|Italian Restaurant|382|
|Park|357|
|Indian Restaurant|298|
|Pizza Place|276|
|Gym / Fitness Center|270|


**Preprocessing of data :**

we start by getting a table with :
* each row corresponding to a constituency
* and each column corresponding to a venue Category
* The value of a cell is : ( number of venues in a venue category / total number of venues) in the relevant constituency
* Each row has a total of 1.

This table will also come in handy when we will try to run clusters.

We can now get the top 10 venues for each constituency:

<img src='top10venues.PNG'>


**d. Clustering**

I run a k-means clustering algorithm on the table described previously:
* each row corresponding to a constituency
* and each column corresponding to a venue Category
* The value of a cell is : ( number of venues in a venue category / total number of venues) in the relevant constituency
* Each row has a total of 1.


With k = 5, I got the following results:  
**Legend:**  
Purple: Cluster 0  
Green: Cluster 2  
Red: Cluster 4  
Blue: Cluster 1  
Orange: Cluster 3  

<img src='map_clusters.PNG'>

|Cluster Labels|Conservative|Labour|Lib Dem|
|------|------|------|------|
|0|12|24|2|
|1|3|2|0|
|2|5|21|1|
|3|1|1|0|
|4|0|1|0|

Clusters 0 & 2 are also the most populous clusters, and hence most relevant.

It seems that the boroughs in cluster :
- Tory constituencies seem to be more frequent in cluster 0.
- There is no particular insight on Labour constituencies.  

Also, looking at the most frequent venues in each of the clusters we get the following:

Cluster Labels|1st Most Common Venue|2nd Most Common Venue|3rd Most Common Venue|4th Most Common Venue|5th Most Common Venue
------|------|------|------|------|------
0|Pub|Coffee Shop|Café|Italian Restaurant|Hotel
1|Indian Restaurant|Coffee Shop|Pub|Park|Grocery Store
2|Grocery Store|Park|Pub|Coffee Shop|Café
3|Indian Restaurant|Park|Business Service|Pub|Grocery Store
4|Café|Grocery Store|Indian Restaurant|Restaurant|Gas Station

Cluster 0: the most common venues (Pub, Coffee Shop, Cafe, etc.) are related to entertainment, so we can assume that it is the busy part of the city, probably mostly inhabited by young city dwellers. They would probably vote Labour.  

Cluster 2: the most common venues are Grocery (Grocery Store, Park, Pub) suggest that it is a more family friendly area, population is probably either families or older people, potential affluent.  
This would suggest a Conservative vote, but we didn't see it.



Finally, I also decided to **decrease the number of venue categories** by: 
* All restaurant categories that have less than 35 venues are now under Other Restaurants
* All Gyms are under Gym
* All the bars are under Bar
* Dessert stores are together (inc chocolate shops etc.)
* Categories that contain 1 or 2 venues are under Tiny Category.

This decreases the number of categories to 224 from 426.

***
## 5. Results

I run **K-Nearest Neighbours** algorithm. 

I divided the set into a train set and a test set, and run the K-NN algorithm for different values of K.  
K=2 gave the best result, as per below graph of accuracy score for different levels of K:  
Train set Accuracy:  0.90  
Test set Accuracy:  0.77  
<img src='KNN-Graph.PNG'>


***
## 6. Discussion

The venues in a constituency seem to provide some keys to understand the voting patterns in the Greater London area.
However, there was no conclusive method that would allow to predict the vote.  
The 2017 General Elections were notoriously bad for the Conservative party in the Greater London area because Brexit. For example, Kensington switched to Labour very surprisingly. This might be an element that would explain the inconclusive results.  
It would be interesting to run it on an election that is not being run over a single issue. We could use previous elections to test the result.

Also, it would make sense to add other parameters to explain the vote, like the average age, average salary etc.

This study could be made more granular buy running it at ward level, for local or parliamentary elections, as well as for other big questions like Brexit.
It can also be extended to the rest of the UK, since we have seen patterns in how people have voted.


***
## 7. Conclusion



***
## References and sources:

London Data Store: https://data.london.gov.uk/

Electoral Calculus, by Martin Baxter: https://www.electoralcalculus.co.uk/

Open Geography Portal, Office for National Statistics : http://geoportal.statistics.gov.uk/

The London Datastore, by the Greater London Authority (GLA) : https://data.london.gov.uk/


Boundary Commission for England: https://boundarycommissionforengland.independent.gov.uk/

The question being asked is very relevant and topical: How to better predict how people are going to vote?  
Here we tried to predict it based on the venues in a constituency: their numbers, the frequency of a given venue category...

The analysis did bring some answers to this question. For example, the larger the number of venues, the more likely the constituency was to vote Labour.  
However, this data is too narrow to explain the vote.

We would probably gain in increasing the granularity of the study, and going to the ward level, or the country level.