# The Battle of the Neighborhoods - The Data (Week 1)

### Data Acquisition

The data for this project will be acquired from three sources (links to all three can be found at the end of this section). Firstly, the project will utilise a crime dataset which counts the number of crimes in each London borough, per month, (from May 2018 - April 2020), according to crime type. The dataset contains the following columns:

 - **MajorText**: The higher-level/general categorisation of the crime
 - **MinorText**: The lower-level/specific categorisation of the crime (within the MajorText category)
 - **LookUp_BoroughName**: The common name for the London borough
 - **Year and month (multiple columns)**: Monthly reported count of each crime type in given borough

The second source of data will be scraped from a wikipedia page that contains the list of London boroughs. It contains the following columns:

 - **Borough**: The name of the London borough
 - **Inner**: Categorization of either an inner or outer London borough
 - **Status**: Categorization of a Royal, City, or other borough
 - **Local authority**: The local authority assigned to the borough
 - **Political control**: The political party that controls the borough
 - **Headquarters**: Location of the borough's headquarters
 - **Area (sq mi)**: Area of the borough in square miles
 - **Population (2013 est)**: The population of the borough recorded in 2013
 - **Coordinates**: The latitude and longitude of the borough
 - **Nr. in map**: The number assigned to each borough to represent visually on a map

The third data source is the list of nieghborhoods in the borough of 'Richmond upon Thames' as found on wikipedia. The dataset will be manually created from scratch and will include the following columns:

 - **Neighborhood**: Name of the neighborhood in the borough
 - **Borough**: Name of the borough
 - **Latitude**: Latitude of the neighborhood
 - **Longitude**: Longitude of the neighborhood

Links to the data sources:

 - [Crime data (source 1)](https://data.london.gov.uk/dataset/recorded_crime_summary)
 - [List of London boroughs (source 2)](https://en.wikipedia.org/wiki/List_of_London_boroughs)
 - [List of neighborhoods in Richmond upon Thames (source 3)](https://en.wikipedia.org/wiki/London_Borough_of_Richmond_upon_Thames#List_of_neighbourhoods)

### Data Cleaning

The data sources will each be prepared separately. For the first source, only the London crime data from the most recent period (April 2019 to April 2020) will be used. We will produce two dataframes here: one that shows both the major and minor crimes, as well as one that shows the total crimes for each borough:

<img src='cap1.png'>
<img src='cap2.png'>

Next, we will use the Beautiful Soup package to scrape data on each London borough from a table in a wikipedia page. The initial table is quite cluttered as it includes columns we are not necessarily interested in (e.g. the political party in power) and it also contains tags and notes that we do not require. As such, we will drop the unnecesary columns and remove any unwanted extra text.

<img src='cap3.png'>
<img src='cap4.png'>

Furthermore, because we are interested in the **crime rate**, a new column will be added to calculate this value by first dividing the number of reported crimes by the population, then multiplying the answer by 1,000. This will get us the **crime rate per 1,000 residents**.

<img src='cap5.png'>

Finally, the third data source (a wikipedia page containing a list of nieghborhoods in the borough of 'Richmond upon Thames') will be used alongside geopy to produce the following:

<img src='cap6.png'>

This resulting dataset will then be used with foursquare location data to explore the venues within each neighborhood.