# IBM Data Science Professional Certification - Capstone Project

## The Battle of Neighbourhoods (London)

### Introduction/Business Problem

The purpose of this project is to analyse the boroughs of London (UK) for safety based upon the criminal records available in public domain. This analysis would be help to those planning to move to London to pursue their professional ambitions. It will provide them with an unbiased report on where to rent of buy a place to live, assuming that safety is the topmost priority for anyone moving to a new place.

Once the safest borough is idenfied by the analysis, an attempt will be made to explore the top neighbourhoods in that borough. This will provide more information to a new person for selecting a specific neighbourhood to reside based upon his / her individual preferences and tastes.

### Data acquisition

The below data will be required to solve the aforementioned business problem: -
* Criminal records for the city of London
* Neighbourhood details for London boroughs <br>

The data sources that will be used to acquire the above information are mentioned below: -
* Real world dataset from Kaggle providing details about crimes in London
* Information about the neighbourhoods of London's boroughs from Wikipedia (Google Maps API geocoding will be used for sourcing coordinates of neighbourhoods within a borough)

## Part 1 - Processing London crime statistics dataset from Kaggle

Location of data: https://www.kaggle.com/jboysen/london-crime

### Importing libraries

In [67]:
import pandas as pd
import numpy as np
import requests
import random
%matplotlib inline 
import matplotlib as mplb
import matplotlib.pyplot as plt
import matplotlib.cm as mpcm
import matplotlib.colors as cols
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize
from IPython.display import Image 
from IPython.core.display import HTML
import folium
import geocoder
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans

### Converting Kaggle .csv data to a DataFrame and processing it

In [None]:
# Reading the csv file
df_lon = pd.read_csv("london_crime_by_lsoa.csv")
df_lon.head()

In [None]:
df_lon.shape

In [None]:
# Keeping the data of the latest year (2016) and dropping older data
df_lon.drop(df_lon.index[df_lon['year'] != 2016], inplace = True)
df_lon.shape

In [None]:
# Removing rows where criminal record is zero
df_lon = df_lon[df_lon.value != 0]
df_lon.shape

In [None]:
df_lon.head()

In [None]:
# Resetting index
df_lon = df_lon.reset_index(drop=True)
df_lon.head()

In [None]:
df_lon.info()

In [None]:
# Total crimes in each borough
df_lon['borough'].value_counts()

In [None]:
# Distribution of major ca
df_lon['major_category'].value_counts()

In [None]:
df_lon['minor_category'].value_counts()

### Creating a DataFrame to view category-wise crime records for every borough and processing

In [None]:
# Creating a pivot table
df_lon_pt = pd.pivot_table (df_lon, index=['borough'], columns=['major_category'], values=['value'], aggfunc=np.sum, fill_value=0)
df_lon_pt.head()

In [None]:
# Resetting the index
df_lon_pt.reset_index(inplace = True)
df_lon_pt.head()

In [None]:
df_lon_pt.shape

In [None]:
# Adding a total column and removing multiple headers
df_lon_pt['Total'] = df_lon_pt.sum(axis=1)
df_lon_pt.head()

In [None]:
# Reducing the headers from 2 to 1
df_lon_pt.columns = df_lon_pt.columns.map(''.join)
df_lon_pt.head()

In [None]:
# Renaming columns
df_lon_pt.columns = ['Borough','burglary', 'criminal_damage','drugs','other_notifiable_offences', 'robbery','theft_and_handling','violence_against_the_person','total']
df_lon_pt.head()

## Part 2 - Creating a DataFrame for London neighbourhoods from Wikipedia

Location of data: https://en.wikipedia.org/wiki/List_of_London_boroughs

In [None]:
raw_data = requests.get('https://en.wikipedia.org/wiki/List_of_London_boroughs').text
neigh = BeautifulSoup(raw_data,'xml')

In [None]:
# Selecting data from the table available on the wikipedia page
lon_neigh_table = neigh.find_all('table', {'class':'wikitable sortable'})

In [None]:
# Creating a DataFrame from the table
df_lon_neigh = pd.read_html(str(lon_neigh_table[0]), index_col=None, header=0)[0]
df_lon_neigh.head()

In [None]:
df_lon_neigh.shape

In [None]:
# Extracting the second table on the wikipedia page by the same process
df_lon_neigh_col = pd.read_html(str(lon_neigh_table[1]), index_col=None, header=0)[0]
df_lon_neigh_col.columns = ['Borough','Inner','Status','Local authority','Political control','Headquarters','Area (sq mi)','Population (2013 est)[1]','Co-ordinates','Nr. in map']
df_lon_neigh_col

In [None]:
# Merging both tables
df_lon_neigh = df_lon_neigh.append(df_lon_neigh_col, ignore_index = True) 
df_lon_neigh.head()

In [None]:
df_lon_neigh.shape

In [None]:
# Checking the full table for consitency
df_lon_neigh

In [None]:
# Cleaning the dataset to match the criminal record DataFrame created earlier
df_lon_neigh = df_lon_neigh.replace('note 1','', regex=True) 
df_lon_neigh = df_lon_neigh.replace('note 2','', regex=True) 
df_lon_neigh = df_lon_neigh.replace('note 3','', regex=True) 
df_lon_neigh = df_lon_neigh.replace('note 4','', regex=True) 
df_lon_neigh = df_lon_neigh.replace('note 5','', regex=True)
df_lon_neigh.iloc[0,0] = 'Barking and Dagenham'
df_lon_neigh.iloc[9,0] = 'Greenwich'
df_lon_neigh.iloc[11,0] = 'Hammersmith and Fulham'
df_lon_neigh

In [None]:
df_lon_pt.shape

In [None]:
df_lon_neigh.shape

In [None]:
# Merging both the DataFrames
df_lon_merged = pd.merge(df_lon_pt, df_lon_neigh, on='Borough')
df_lon_merged.head()

In [None]:
df_lon_merged.shape

In [None]:
df_lon_merged.columns.tolist()

In [None]:
# Renaming columns
df_lon_merged.columns = ['borough','burglary', 'criminal_damage','drugs','other_notifiable_offences', 'robbery','theft_and_handling','violence_against_the_person','total','inner','status','local_authority','political_control','headquarters','area_sq_mi','population_2013','co-ordinates','nr_in_map']
df_lon_merged.head()

## Part 3 - Data Analysis (exploratory) of the Merged DataFrame

In [None]:
df_lon_merged.shape

In [None]:
df_lon_merged.columns.tolist()

In [None]:
df_lon_merged.info()

In [None]:
df_lon_merged.describe()

In [None]:
# Sorting the DataFrame in increasing number of crimes
df_lon_merged.sort_values(['total'], axis = 0, inplace = True )
df_lon_merged.head()

In [None]:
# Creating a DataFrame with 10 most crime ridden boroughs
df_lon_top10 = df_lon_merged.tail(10)
df_lon_top10.sort_values(['total'], ascending = False, axis = 0, inplace = True)
df_lon_top10

In [None]:
# Creating a DataFrame with 10 least crime ridden boroughs
df_lon_least10 = df_lon_merged.head(10)
df_lon_least10.sort_values(['total'], axis = 0, inplace = True)
df_lon_least10

In [None]:
# Plotting the 10 top crime ridden boroughs
df_lon_crimes_top10 = df_lon_top10[['borough','total']]
df_lon_crimes_top10.set_index('borough',inplace = True)
ax = df_lon_crimes_top10.plot(kind='bar', figsize=(25, 10), rot =0)
ax.set_xlabel('London boroughs')
ax.set_ylabel('Count of crimes committed')
ax.set_title('10 top crime ridden London boroughs')
for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', 
                va='center', 
                xytext=(0, 10), 
                textcoords='offset points',
                fontsize = 16
               )
plt.show()

In [None]:
# Plotting the 10 least crime ridden boroughs
df_lon_crimes_least10 = df_lon_least10[['borough','total']]
df_lon_crimes_least10.set_index('borough',inplace = True)
ax = df_lon_crimes_least10.plot(kind='bar', figsize=(25, 10), rot =0)
ax.set_xlabel('London boroughs')
ax.set_ylabel('Count of crimes committed')
ax.set_title('10 least crime ridden London boroughs')
for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', 
                va='center', 
                xytext=(0, 10), 
                textcoords='offset points',
                fontsize = 16
               )
plt.show()

### Conclusions 
* **The 10 top crime ridden boroughs have to be avoided by anyone new coming to reside in London**
* **The 10 least crime ridden boroughs should be considered for residing by a newcomer**
* **City of London is the 33rd principal division of Greater London but it is not a London borough so for the purpose of this project, Kingston upon Thames will be analysed instead for neighbourhoods*** 
* ***Source - https://en.wikipedia.org/wiki/List_of_London_boroughs**

## Part 4 - Analyzing the least crime ridden borough (Kingston upon Thames) of London for neighbourhoods

Location of data: https://en.wikipedia.org/wiki/List_of_districts_in_the_Royal_Borough_of_Kingston_upon_Thames

In [None]:
# Creating a DataFrame for crimes in Kingston
df_kingston = df_lon_least10[1:2]
df_kingston = df_kingston [['borough', 'burglary', 'criminal_damage', 'drugs', 'other_notifiable_offences', 'robbery', 'theft_and_handling','violence_against_the_person']]
df_kingston.set_index('borough', inplace = True)
df_kingston.head()

In [None]:
# Plotting the created DataFrame
ax = df_kingston.plot(kind='bar', figsize=(25, 10), rot=0)
ax.set_ylabel('Count of crimes committed')
ax.set_xlabel('Borough')
ax.set_title('Type of crimes committed in Kingston Upon Thames')
for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', 
                va='center', 
                xytext=(0, 10), 
                textcoords='offset points',
                fontsize = 16
               )
plt.show()

In [None]:
df_kingston

In [None]:
# Using 'https://en.wikipedia.org/wiki/List_of_districts_in_the_Royal_Borough_of_Kingston_upon_Thames' to create a DataFrame for Neighbourhoods
Borough = ['Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames','Kingston upon Thames']
kingston_neigh = ['Berrylands','Canbury','Chessington','Coombe','Hook','Kingston upon Thames','Kingston Vale','Malden Rushett','Motspur Park','New Malden','Norbiton','Old Malden','Seething Wells','Surbiton','Tolworth']
Latitudes = []
Longitudes = []
for k in range(len(kingston_neigh)):
    address = '{},London,United Kingdom'.format(kingston_neigh[k])
    geolocator = Nominatim(user_agent="London_agent")
    location = geolocator.geocode(address)
    Latitudes.append(location.latitude)
    Longitudes.append(location.longitude)
kut_neigh = {'Neighbourhood': kingston_neigh,'Borough':Borough,'Latitudes': Latitudes,'Longitudes':Longitudes}
df_kingston_neigh = pd.DataFrame(data=kut_neigh, columns=['Neighbourhood', 'Borough', 'Latitudes', 'Longitudes'], index=None)
df_kingston_neigh

In [None]:
# Creating a map of Kingston Upon Thames using the using random coordinates from the DataFrame (say Tolworth)
kingston_map = folium.Map(location=[51.378876, -0.282860], zoom_start=12)
for lati, longi, boro, neig in zip(df_kingston_neigh['Latitudes'], df_kingston_neigh['Longitudes'], df_kingston_neigh['Borough'], df_kingston_neigh['Neighbourhood']):
    lab = '{}, {}'.format(neig, boro)
    lab = folium.Popup(lab, parse_html=True)
    folium.CircleMarker(
        [lati, longi],
        radius=5,
        popup=lab,
        color='red',
        fill=True,
        fill_color='#fffc4a',
        fill_opacity=0.6,
        parse_html=False).add_to(kingston_map)  
kingston_map

### Using Foursquare to analyze neighbourhoods within 1 km (1000m)

In [None]:
# Listing Foursquare credentials
client_id = 'L5CQWSEYKAAVGX5TBPUGCZDHY3W30HSRCTGO4HPDXRVDMORG'
client_secret = 'W0AJ1JOCNXF0JWCDLWPCPIS5QP4JCYKZXTDZXAX30IJPQ55Y'
version = '20180604'
limit = 30

In [None]:
# Defining a function to get venues for above neighbourhoods
def kingston_venues(locales, lati, longi, radius=1000):
    kingston_venuelist=[]
    for loc, lat, long in zip(locales, lati, longi):
        print(loc)    
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            client_id, 
            client_secret, 
            version, 
            lat, 
            long, 
            radius, 
            limit)
          
        results = requests.get(url).json()["response"]['groups'][0]['items']
        kingston_venuelist.append([(
            loc, 
            lat, 
            long, 
            k['venue']['name'], 
            k['venue']['location']['lat'], 
            k['venue']['location']['lng'],  
            k['venue']['categories'][0]['name']) for k in results])

    kingston_closeby_venues = pd.DataFrame([item for k in kingston_venuelist for item in k])
    kingston_closeby_venues.columns = ['Neighbourhood', 
                  'Latitudes_neighbourhood', 
                  'Longitudes_neighbourhood', 
                  'Venue', 
                  'Latitudes_venue', 
                  'Longitudes_venue', 
                  'Category_venue']
    return(kingston_closeby_venues)

In [None]:
# Creating a DataFrame using the function defined above
df_kingston_venues = kingston_venues(locales=df_kingston_neigh['Neighbourhood'], lati=df_kingston_neigh['Latitudes'], longi=df_kingston_neigh['Longitudes'])

In [None]:
df_kingston_venues.shape

In [None]:
df_kingston_venues.head()

In [None]:
# Creating a DataFrame to look closely at venue categories for every neighbourhood
df_kingston_tranformed = pd.get_dummies(df_kingston_venues[['Category_venue']], prefix="", prefix_sep="")
df_kingston_tranformed['Neighbourhood'] = df_kingston_venues['Neighbourhood'] 
move_column = [df_kingston_tranformed.columns[-1]] + list(df_kingston_tranformed.columns[:-1])
df_kingston_tranformed = df_kingston_tranformed[move_column]
df_kingston_grouped = df_kingston_tranformed.groupby('Neighbourhood').mean().reset_index()
df_kingston_grouped.head()

In [None]:
df_kingston_grouped.shape

In [None]:
# Defining a function for creating a DataFrame with 15 common venues of the neighbourhoods
def kingston_common_venues (item, number):
    item_cat_1 = item.iloc[1:]
    item_cat_2 = item_cat_1.sort_values(ascending=False)
    
    return item_cat_2.index.values[0:number]

In [None]:
# Creating a DataFrame that lists 15 common venues of the Kingston neighbourhoods
number = 15
columns = ['Neighbourhood']
for k in np.arange(number):
    try:
        columns.append('Top venue number {}{}'.format(k+1))
    except:
        columns.append('Top venue number {}'.format(k+1))
df_kingston_common_venues = pd.DataFrame(columns=columns)
df_kingston_common_venues['Neighbourhood'] = df_kingston_grouped['Neighbourhood']
for i in np.arange(df_kingston_grouped.shape[0]):
    df_kingston_common_venues.iloc[i, 1:] = kingston_common_venues(df_kingston_grouped.iloc[i, :], number)
df_kingston_common_venues

### Clustering identical neighbourhoods using K-means and analysing them

In [None]:
# Performing k-means clustering
neigh_k = 6
df_kingston_neigh_clusters = df_kingston_grouped.drop('Neighbourhood', 1)
kingston_kmeans = KMeans(n_clusters=neigh_k, random_state=0).fit(df_kingston_neigh_clusters)
kingston_kmeans.labels_[0:10]

In [None]:
# Adding clusters information to the merged DataFrame
df_kingston_common_venues.insert(0, 'Labels', kingston_kmeans.labels_)
df_kingston_merged = df_kingston_neigh
df_kingston_merged = df_kingston_neigh.join(df_kingston_common_venues.set_index('Neighbourhood'), on='Neighbourhood')
df_kingston_merged

In [None]:
df_kingston_merged

In [None]:
df_kingston_merged.shape

In [None]:
# Creating visualization for clusters generated using Berrylands as the central point
kingston_clusters_map = folium.Map(location=[51.393781, -0.284802], zoom_start=12)
x = np.arange(neigh_k)
ys = [i + x + (i*x)**2 for i in range(neigh_k)]
colors_array = mpcm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [cols.rgb2hex(i) for i in colors_array]
markers_colors = []
for lat, lon, poi, cluster in zip(df_kingston_merged['Latitudes'], df_kingston_merged['Longitudes'], df_kingston_merged['Neighbourhood'], df_kingston_merged['Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=8,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.5).add_to(kingston_clusters_map)
       
kingston_clusters_map

In [None]:
df_kingston_merged[df_kingston_merged['Labels'] == 0]

In [None]:
df_kingston_merged[df_kingston_merged['Labels'] == 1]

In [None]:
df_kingston_merged[df_kingston_merged['Labels'] == 2]

In [None]:
df_kingston_merged[df_kingston_merged['Labels'] == 3]

In [None]:
df_kingston_merged[df_kingston_merged['Labels'] == 4]

In [None]:
df_kingston_merged[df_kingston_merged['Labels'] == 5]

## Part 4 - Results and Conclusion

The purpose of this project was to analyse the boroughs of London (UK) for safety based upon the criminal records available in public domain.
Based upon the analysis conducted, **Kingston Upon Thames has been identified as the safest borough in London** for anyone planning to move to one of the most iconic cities of the world.

The project's secondary aim was to analyze the neighbourhoods of the borough and cluster them based upon their venues to further help a newbie to select a neighbourhood based upon his / her individual preferences and tastes. Based upon the clustering done, **a choice of 6 clusters is available to choose from as per the below list: -**
* Cluster 1 - For individuals prefering pubs and coffee joints.
* Cluster 2 - For individuals prefering theme parks / attractions and pubs.
* Cluster 3 - For individuals prefering parks and restaurants.
* Cluster 4 - For individuals prefering hotel and stables.
* Cluster 5 - For individuals prefering train station and convenience stores.
* Cluster 6 - For individuals prefering stables and grocery stores.