# Coursera Capstone Project
## This Notebook will be used for the Coursera Capstone Final Delivarable Project

## Problem Definition

Many Countries are populated with variety of people encompassing to different faith and different cultures. This leads to the introduction of different kinds of food cuisine in the country. People will like to find and check out a particular cuisine eg: Italian, Chinese etc and would end up recommending it or not for others. They would like to attempt this in a particular manner without missing any cuisine at all and would like to keep a track of it as well. If all the different kinds of restaurants in a particular city can be clustered and arranged in a manner with ratings will give an idea for the customer to devour the cuisine which has appealed the customers attention.  

This data can also be explored in a such manner to find the most sought out cuisine and what would be the appropriate location to open a restaurant as well. 

## Data Description and Processing

Data Processing is the core of machine learning. It is the most time-consuming process of all when compared with other components of machine learning. Thereby requiring scrupulous effort in processing. Therefore, if performed well and properly documented, this would result in an output which would have a high quality in terms of its insight it can provide to stakeholders.

The Data which will be used for the analysis are presented below. 
1. Web Scrape the Wikipedia Table To Obtain The areas around London and its postcodes from https://en.wikipedia.org/wiki/List_of_areas_of_London. 
2. Use geocoder to find the latitude and longitude for each area and update the table.
3. Use Foursquare API to obtain venues around each location for the analysis

Import the necessary Libraries and Install if needed

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ###############################

## Data Obtaining and Cleaning
### 1. Web Scraping of Data From Wikipedia Table

In [2]:
!pip install beautifulsoup4
from bs4 import BeautifulSoup
import requests

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 5.8MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.1 soupsieve-2.0.1


In [3]:
#The URL of wikipedia and get html content code
url="https://en.wikipedia.org/wiki/List_of_areas_of_London"
html_content = requests.get(url).text
#pass it beautiful soup 
soup = BeautifulSoup(html_content,'html.parser')

In [4]:
LondonAreas = soup.find("table", attrs={"class": "wikitable"})

In [5]:
def tableDataText(table):       
    rows = []
    trs = table.find_all('tr')
    headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
    return rows

In [6]:
list_table = tableDataText(LondonAreas)
columns=(['Location','Borough','Post Town','Postcode','Dial Code','OS grid ref'])
df= pd.DataFrame(list_table[1:], columns=columns)
df.head()
#first remove Dial Code and OS grid ref column 
df=df.drop(['Dial Code','OS grid ref'],axis=1)

Find the shape of the Data

In [7]:
df.shape

(533, 4)

It Can be seen that the table consists of 533 rows and 6 columns. 
When the table is inspected above it can be seen some data cleaning is needed to strip and arrange the data in a manner the anlaysis can be performed.

In [8]:
#Clean the rows where extra character is present with []123456789
df['Borough'] = df['Borough'].str.split('[').str[0]
#strip after the bracket in the Location Column
df['Location'] = df['Location'].str.split('(').str[0]

In [9]:
df.head()

Unnamed: 0,Location,Borough,Post Town,Postcode
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Addington,Croydon,CROYDON,CR0
3,Addiscombe,Croydon,CROYDON,CR0
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


In [10]:
#Only London Post Towns Will be analysed therefore everything other London rows drop and remove
df = df[df['Post Town'] == 'LONDON']
#duplicate the rows by making postcodes where there are two in that column 

In [11]:
df.head()

Unnamed: 0,Location,Borough,Post Town,Postcode
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
6,Aldgate,City,LONDON,EC3
7,Aldwych,Westminster,LONDON,WC2
9,Anerley,Bromley,LONDON,SE20


In [12]:
#Strip unwanted characters 
df['LOC'] = df['Postcode'].str.split(',').str[1]
df['Postcode'] = df['Postcode'].str.split(',').str[0]
print(df.shape)
df.head()

(299, 5)


Unnamed: 0,Location,Borough,Post Town,Postcode,LOC
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2,
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W3,W4
6,Aldgate,City,LONDON,EC3,
7,Aldwych,Westminster,LONDON,WC2,
9,Anerley,Bromley,LONDON,SE20,


In [13]:
#drop NaN value rows make it another table and join it with previous one and update final table.
df_final=df.dropna()
df_final=df_final.drop(['Postcode'],axis=1)
df_final.rename(columns={'LOC':'Postcode'},inplace=True)
print(df_final.shape)
df_final.dropna()
df_final.head()

(45, 4)


Unnamed: 0,Location,Borough,Post Town,Postcode
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W4
10,Angel,Islington,LONDON,N1
15,Arnos Grove,Enfield,LONDON,N14
51,Blackheath Royal Standard,Greenwich,LONDON,SE12
56,Bounds Green,Haringey,LONDON,N22


In [14]:
#drop unwanted columns
df=df.drop(['LOC'],axis=1)
concatenated = df.append(df_final)
concatenated.head()

Unnamed: 0,Location,Borough,Post Town,Postcode
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W3
6,Aldgate,City,LONDON,EC3
7,Aldwych,Westminster,LONDON,WC2
9,Anerley,Bromley,LONDON,SE20


In [15]:
#sort the table according location so same will be on one below
concatenated = concatenated.sort_values(by ='Location' )
print(concatenated.shape)

(344, 4)


In [16]:
#make a new index dataseries
df1=pd.Series(range(0,344))
concatenated['INDEX']=pd.Series(range(0,344)).values
df_Lon=concatenated.set_index(['INDEX'])

In [17]:
#print the dataframe before the adjustment for latitudes and longitudes
# as we are working on London towns we can remove the post town column as well. 
df_Lon=df_Lon.drop(['Post Town'],axis=1)
df_Lon.head()

Unnamed: 0_level_0,Location,Borough,Postcode
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Abbey Wood,"Bexley, Greenwich",SE2
1,Acton,"Ealing, Hammersmith and Fulham",W4
2,Acton,"Ealing, Hammersmith and Fulham",W3
3,Aldgate,City,EC3
4,Aldwych,Westminster,WC2


### 2.Obtaining Longitudes and Latitudes for each Location

In [18]:
!pip install geocoder
import geocoder # import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 19.7MB/s ta 0:00:01
[?25hCollecting click (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/d2/3d/fa76db83bf75c4f8d338c2fd15c8d33fdd7ad23a9b5e57eb6c5de26b430e/click-7.1.2-py2.py3-none-any.whl (82kB)
[K     |████████████████████████████████| 92kB 21.2MB/s eta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Collecting future (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829kB)
[K     |████████████████████████████████| 829kB 12.2MB/s eta 0:00:01
Building wheel

In [19]:
#prepare a function to performed on all post codes
def lat_lan_finder(postCode):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London, United Kingdom'.format(postCode))
        lat_lng_coords = g.latlng
    return lat_lng_coords

### Use the above function and find latitudes and longitudes for each postcode

In [24]:
df_Lon['Latitude']=df_Lon.apply(lambda x: lat_lan_finder(x['Postcode'])[0],axis=1)
df_Lon['Longitude']=df_Lon.apply(lambda x: lat_lan_finder(x['Postcode'])[1],axis=1)

In [25]:
# Print Database
df_Lon.head()

Unnamed: 0_level_0,Location,Borough,Postcode,Latitude,Longitude
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Abbey Wood,"Bexley, Greenwich",SE2,51.49245,0.12127
1,Acton,"Ealing, Hammersmith and Fulham",W4,51.48944,-0.26194
2,Acton,"Ealing, Hammersmith and Fulham",W3,51.51324,-0.26746
3,Aldgate,City,EC3,51.512,-0.08058
4,Aldwych,Westminster,WC2,51.51651,-0.11968
