## Data Preperation


In [None]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import json # library to handle JSON files
import requests # library to handle requests
import csv
import lxml.html as lh
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

from sklearn.datasets.samples_generator import make_blobs


# import k-means from clustering stage
from sklearn.cluster import KMeans

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

print('Libraries imported.')




Collecting package metadata (current_repodata.json): | 

### 1- Location

First let's download the location data and read it into a pandas dataframe.

>**Data source:** London data store - Greater London Authority  
**Source link:** https://data.london.gov.uk/dataset/postcode-directory-for-london  
**Data details:** Postcode, Latitude, Longtitude, LSOA*  
(* *A Lower Layer Super Output Area (LSOA) is a geographic area. Lower Layer Super Output Areas are a geographic hierarchy designed to improve the reporting of small area statistics in England and Wales.*)

In [None]:
location=pd.read_csv("https://data.london.gov.uk/download/postcode-directory-for-london/fd269535-973a-418f-8847-da405687e2e2/London_postcode-ONS-postcode-Directory-May15.csv")
location.head()

Let's drop unnecessary columns

In [None]:
location.rename(columns = {'pcd': 'Postcode','lsoa11': 'LSOA', 'lat' : 'Latitude', 'long': 'Longitude'} ,inplace = True)
postcodes=location[['Postcode', 'LSOA', 'Latitude', 'Longitude']]
postcodes.head()


In [None]:
location.describe()

### 2- Crime

>**Data source:** London data store , Greater London Authority   
**Source link:** https://data.london.gov.uk/dataset/recorded_crime_summary  
**Data details:**  MPS Borough Level Crime (most recent 24 months) , MPS LSOA Level Crime (most recent 24 months) 

In [None]:
crime=pd.read_csv("https://data.london.gov.uk/download/recorded_crime_summary/644a9e0d-75a3-4c3a-91ad-03d2a7cb8f8e/MPS%20LSOA%20Level%20Crime%20%28most%20recent%2024%20months%29.csv")
crime.head()

In [None]:
crime.rename(columns = {'LSOA Code': 'LSOA'} ,inplace = True)
crime['Sum_Crime'] =crime.sum(axis=1)
crime=crime [['Borough','LSOA','Sum_Crime']]
crime.head()

In [None]:
crime.describe()

In [None]:
#crime.groupby(['Borough', 'LSOA'])['Sum_Crime'].sum().reset_index()
crime_2=crime.groupby(['LSOA'])['Sum_Crime'].sum().reset_index()
crime_2.head()

In [None]:
crime.describe()

In [None]:
crime.to_csv('crime.csv')
print("saved")

In order to calculate crime per 1000 person, we will pull LSOA level population data

>**Data source:** London data store , Greater London Authority  
**Source link:** https://data.london.gov.uk/dataset/lsoa-atlas  
**Data details:**  LSOA Level population data to calculate crime rate per population ,Current LSOA boundaries post 2011

In [None]:
population=pd.read_excel("https://data.london.gov.uk/download/lsoa-atlas/b8e01c3a-f5e3-4417-82b3-02ad271e6ee8/lsoa-data.xls", header=1)
population.head()


In [None]:
population=population[['Unnamed: 0', 'Unnamed: 14']]
population=population.drop(population.index[0]).reset_index(drop=True)
population.rename(columns = {'Unnamed: 0' : 'LSOA','Unnamed: 14': 'Population'} ,inplace = True)
population.head()

In [None]:
population.describe()

Lets calculate crime per person for each LSOA

In [None]:
crime= pd.merge(crime, population, how='left', on='LSOA')
crime.head(10)

In [None]:
crime.describe()

In [None]:
crime ['crime_per_1000'] = crime ['Sum_Crime'] / crime ['Population'] *1000
crime_per_pop=crime.drop(columns=['Sum_Crime','Population'])
crime_per_pop.head()

### 4-Commuting duration to Bank station

>**Data source:** London data store , Greater London Authority   
**Source link:** https://data.london.gov.uk/download/mylondon/c2e9ebc1-935b-460c-9361-293398d84fe5/MyLondon_traveltime_to_Bank_station_OA.csv  
**Data details:** 

In [None]:
commute=pd.read_csv("https://data.london.gov.uk/download/mylondon/c2e9ebc1-935b-460c-9361-293398d84fe5/MyLondon_traveltime_to_Bank_station_OA.csv")
commute.head()

In [None]:
commute.rename(columns = {'OA11CD': 'LSOA'} ,inplace = True)
commute=commute[['LSOA', 'public_transport_time_mins']]
commute.head()

In [None]:
commute.describe()