# The Battle of the Neighborhoods  
   
## Introduction
Pakistan being a developing country has limited resources and provides limited opportunities of employment to young people. As a result, people tend to migrate to developed countries in search of employment or business opportunities. This leads to an increase in the negative migration rate (reference: https://www.indexmundi.com/pakistan/net_migration_rate.html), which means that more people are migrating to other countries every year. Now, young entrepreneurs who are interested in opening up their own businesses abroad, struggle a lot and waste a lot of resources trying to research and understand the new market in order to make better business decisions. They need to have a working method to assist them in making better investment decisions without spending huge loads of money at the start on market research.

## 1. Business Problem
One of the most appealing destinations for young entrepreneurs is London(United Kingdom) because of its multi-cultural and diverse audience. London also has been a popular tourist destination for the past few decades, which makes it a dream city for businesses who want to reach a larger audience. So our business problem will be based on London. We want to assist an entrepreneur who wants to start a Pakistani restaurant in London. Instead of spending money on market research through agencies or by herself and taking huge risk with her investment even after market research, she wants to know which would be the most suitable area to open a Pakistani restaurant in London? This would help her understand the population of different areas of London and location of different types of restaurants, and based on the analysis she can make better decision on where to open a Pakistani restaurant which would be profitable.  
Any entrepreneur from Pakistan would be interested in this project, since it'll assist them in making better investment decisions and saving them good amount of market research costs.

## 2. Data
To solve our business problem, we'll need data about the areas of London and the types of restaurants across the different areas of London. We'll use the 'list of areas of London' from the Wikipedia page:  
https://en.wikipedia.org/wiki/List_of_areas_of_London  
We'll need to filter out the areas with the most Asian Population (since Pakistanis fall under the Asian race for categorization). To do that, we can use the 'Demography of London' from the Wikipedia page, which contains a table that divides the proportion of races by London Boroughs:  
https://en.wikipedia.org/wiki/Demography_of_London  
After filtering out the areas with the biggest proportion of Asian population, we'll use Python's 'Geocoder' package with 'arcgis geocoder' to get the lat, long coordinates for all our locations.
Then, we'll use Foursquare API to explore and then gather the types of venues in our preferred areas. We'll use that data to analyze the venue category type 'Asian Restaurants', and perform segmentation and clustering to devise meaningful conclusions about which cluster would be the most suitable to open a Pakistani restaurant in.

In [2]:
#Importing necessary libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import lxml

# Getting data about the areas of London
source = requests.get('https://en.wikipedia.org/wiki/List_of_areas_of_London').text
soup = BeautifulSoup(source, 'lxml')

table = soup.find('table',{'class':'wikitable sortable'})


In [53]:
# Stroring the data from wikipedia page into dataframe
data = []
columns = []

for index, tr in enumerate(table.findAll('tr')): #Using for loop to extract data and append it into 'content'
    content = []
    for td in tr.findAll(['th','td']):
        content.append(td.text.rstrip())
        
    if (index == 0):               #First row consisted of Header row elements, so we use the if-else function here
        columns = content
    else:
        data.append(content)       #From the second row onwards, we append the content in 'data'

df = pd.DataFrame(data = data, columns = columns)  #Creating a pandas Dataframe
df.head()

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


### Cleaning Data

By looking at the list of column names, we see that some of the column names are in the wrong format, so we rename those columns 

In [37]:
list(df.columns)

['Location',
 'London\xa0borough',
 'Post town',
 'Postcode\xa0district',
 'Dial\xa0code',
 'OS grid ref']

In [55]:
df.rename(columns = {'London\xa0borough':'Borough', 'Postcode\xa0district':'Postcode', 'Dial\xa0code':'Dial code'}, inplace=True)

Then we remove the reference numbers after the Borough names, which are of no use to us.

In [56]:
df['Borough'] = df['Borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))

In [57]:
df.head()

Unnamed: 0,Location,Borough,Post town,Postcode,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon,CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon,CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


We repeat the same process to get data about the Demography of London and store it into a dataframe demo_df.

In [41]:
source1 = requests.get('https://en.wikipedia.org/wiki/Demography_of_London').text
soup1 = BeautifulSoup(source1, 'lxml')

table1 = soup1.find('table',{'class':'wikitable sortable'})

In [42]:
# Stroring the data from wikipedia page into dataframe

data = []
columns = []

for index, tr in enumerate(table1.findAll('tr')): #Using for loop to extract data and append it into 'content'
    content = []
    for td in tr.findAll(['th','td']):
        content.append(td.text.rstrip())
        
    if (index == 0):               #First row consisted of Header row elements, so we use the if-else function here
        columns = content
    else:
        data.append(content)       #From the second row onwards, we append the content in 'data'

demo_df = pd.DataFrame(data = data, columns = columns)  #Creating a pandas Dataframe
demo_df.head()

Unnamed: 0,Local authority,White,Mixed,Asian,Black,Other
0,Barnet,64.1,4.8,18.5,7.7,4.8
1,Barking and Dagenham,58.3,4.2,15.9,20.0,1.6
2,Bexley,81.9,2.3,6.6,8.5,0.8
3,Brent,36.3,5.1,34.1,18.8,5.8
4,Bromley,84.3,3.5,5.2,6.0,0.9


In [43]:
df.to_csv('Demographics of London.csv', index=False)

Sorting the data as per the Asian race (since we're interested in finding the neighborhoods with the most Asian proportion)

In [45]:
demo_df_sorted = demo_df.sort_values(by = 'Asian', ascending = False)
demo_df_sorted

Unnamed: 0,Local authority,White,Mixed,Asian,Black,Other
12,Haringey,60.5,6.5,9.5,18.8,4.7
27,Southwark,54.3,6.2,9.4,26.9,3.3
22,Lewisham,53.5,7.4,9.3,27.2,2.6
18,Islington,68.2,6.5,9.2,12.8,3.4
15,Hammersmith and Fulham,68.1,5.5,9.1,11.8,5.5
26,Richmond upon Thames,86.0,3.6,7.3,1.5,1.6
21,Lambeth,57.1,7.6,6.9,25.9,2.4
2,Bexley,81.9,2.3,6.6,8.5,0.8
4,Bromley,84.3,3.5,5.2,6.0,0.9
24,Newham,29.0,4.5,43.5,19.6,3.5


By looking at the data, we can see that the column 'Asian' is not properly sorted, like the values of 43.5 and 42.6 are placed below 9.5 and 9.4. This is probably because the data values in this column are in the wrong format, so we change the format of this column to float and sort the values again.

In [52]:
demo_df_sorted['Asian'] = demo_df_sorted['Asian'].astype('float')
demo_df_sorted = demo_df_sorted.sort_values(by = 'Asian', ascending = False)
demo_df_sorted

Unnamed: 0,Local authority,White,Mixed,Asian,Black,Other
24,Newham,29.0,4.5,43.5,19.6,3.5
13,Harrow,42.2,4.0,42.6,8.2,2.9
25,Redbridge,42.5,4.1,41.8,8.9,2.7
29,Tower Hamlets,45.2,4.1,41.1,7.3,2.3
17,Hounslow,51.4,4.1,34.4,6.6,3.6
3,Brent,36.3,5.1,34.1,18.8,5.8
8,Ealing,49.0,4.5,29.7,10.9,6.0
16,Hillingdon,60.6,3.8,25.3,7.3,3.0
30,Waltham Forest,52.2,5.3,21.1,17.3,4.1
0,Barnet,64.1,4.8,18.5,7.7,4.8


Now that we have the above datasets, we can filter and narrow down our search for the top 10 neighborhoods that have the best proportion of Asian Population, and then we can extract the location coordinates using geocoder package and then use the data from Foursquare API to get venues in those neighborhoods and start our analysis, which will be done in the second part of the project (In next week).