# IBM Applied Data Science Capstone

## Project title: "Indian Restaurant in Paris!"

**Introduction**

Paris is a beautiful multi-cultural city, and it is also a city of food lovers. The city has a very vibrant immigrant culture, especially from India and other south Asian countries where the cuisine is similar to that of Indian. Also, Indian food is popular worldwide. Hence, opening a restaurant in Paris specializing in Indian cuisine will be a good, profitable business venture.


**Business Problem**

Paris is a large city with various districts populated by people of different ethnicity and nationality. So, identifying the right location is key to growing the business sustainably. Setting up the restaurant in a district where there is already good number of Indian restaurants would be preferable due to the higher population of the Indian community in that area. The city of Paris is large and is composed of 20 arrondissements, also known as administrative districts. This project aims to explore these 20 districts and identify the suitable one for opening an Indian restaurant.

**Step 1: 
Scrape list of Paris arrondissements from Wikipedia, extract latitude and longitude of the districts from CSV file and creating merged Pandas DataFrame**

Note: Since geocoder is not working, geospatial data has been fetched from https://www.data.gouv.fr/en/datasets/arrondissements-1/#_ and loaded into IBM Cloud storage

In [1]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Area Code,Name,Latitude,Longitude
0,11,Popincourt,48.859059,2.380058
1,13,Gobelins,48.828388,2.362272
2,4,Hôtel-de-Ville,48.854341,2.35763
3,8,Élysée,48.872721,2.312554
4,18,Butte-Montmartre,48.892569,2.348161


In [2]:
import urllib.request
url = "https://en.wikipedia.org/wiki/Arrondissements_of_Paris"
page = urllib.request.urlopen(url)

from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")

all_tables = soup.find_all("table")
right_table = soup.find('table', class_='wikitable sortable')

A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
H=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==8:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        D.append(cells[3].find(text=True))
        E.append(cells[4].find(text=True))
        F.append(cells[5].find(text=True))
        G.append(cells[6].find(text=True))
        H.append(cells[7].find(text=True))

import pandas as pd
paris_df = pd.DataFrame(A,columns=['Arrondissement'])
paris_df['Name']=B

geo = df_data_1

df = pd.merge(paris_df, geo, on="Name")

del df['Arrondissement']
del df['Area Code']

df

Unnamed: 0,Name,Latitude,Longitude
0,Louvre,48.862563,2.336443
1,Bourse,48.868279,2.342803
2,Temple,48.862872,2.360001
3,Hôtel-de-Ville,48.854341,2.35763
4,Panthéon,48.844443,2.350715
5,Luxembourg,48.84913,2.332898
6,Palais-Bourbon,48.856174,2.312188
7,Élysée,48.872721,2.312554
8,Opéra,48.877164,2.337458
9,Entrepôt,48.87613,2.360728


**Step 2: Install required packages and fetch the geographical coordinates of Paris city**

In [3]:
# Import all the necessary libraries and modules
import numpy as np

!pip install geopy
from geopy.geocoders import Nominatim

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!pip install folium
import folium # map rendering library

print('Libraries imported.')

# Get the geographical coordinates of Paris
address = 'Paris, France'

geolocator = Nominatim(user_agent="paris_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Paris City are {}, {}.'.format(latitude, longitude))

Libraries imported.
The geograpical coordinate of Paris City are 48.8566969, 2.3514616.


**Step 3: Create map of Paris city**

In [4]:
# Create map of Paris using latitude and longitude values
map_paris = folium.Map(location=[latitude, longitude], zoom_start=12)

# Add markers to map
for district, lat, lng in zip(df['Name'],df['Latitude'], df['Longitude'], ):
    label = '{}'.format(district)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris)  
    
map_paris

**Step 4: Connect to Foursquare API and explore the districts using API**

In [16]:
# The code was removed by Watson Studio for sharing.

In [6]:
import requests

radius = 500
LIMIT = 100

venues = []

for lat, lng, district in zip(df['Latitude'], df['Longitude'], df['Name']):
    
    # Create the Request URL to call Foursquare GET API
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        lng,
        radius, 
        LIMIT)
    
    # Invoke the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # Return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            district,
            lat, 
            lng, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

**Step 5: Convert the resulting venues list into a DataFrame**

In [7]:
# Convert the venues list into a Pandas DataFrame
venues_df = pd.DataFrame(venues)

# Define the column names
venues_df.columns = ['District', 'Latitude', 'Longitude', 'Venue_Name', 'Venue_Latitude', 'Venue_Longitude', 'Venue_Category']

venues_df.shape

# Check how many venues were returned for each District
venues_df.groupby(["District"]).count()

# Print the number of unique venue categories and the venues list
print('There are {} uniques categories.'.format(len(venues_df['Venue_Category'].unique())))
venues_df['Venue_Category'].unique()[:50]

# Check if 'Indian Restaurant' is among the venue categories
"Indian Restaurant" in venues_df['Venue_Category'].unique()

There are 198 uniques categories.


True

**Step 6: Analyze each district of Paris city and create separate DataFrame for 'Indian Restaurant' venue category**

In [8]:
# One-hot encoding
paris_onehot = pd.get_dummies(venues_df[['Venue_Category']], prefix="", prefix_sep="")

# Add District name column to the dataframe
paris_onehot['District'] = venues_df['District'] 

# Move district column to the first
fixed_columns = [paris_onehot.columns[-1]] + list(paris_onehot.columns[:-1])
paris_onehot = paris_onehot[fixed_columns]

paris_onehot.shape

# Group rows by District and by taking mean of the frequency of occurrence of each venue category
paris_grouped = paris_onehot.groupby(["District"]).mean().reset_index()

print(paris_grouped.shape)
paris_grouped

len(paris_grouped[paris_grouped["Indian Restaurant"] > 0])

# Create new DataFramce for 'District' and 'Indian Restaurant' venue category
paris_ind = paris_grouped[["District","Indian Restaurant"]]

(20, 199)


**Step 7: Cluster the districts of Paris city**

In [9]:
# Set number of clusters
kclusters = 3

paris_cluster = paris_ind.drop(["District"], 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(paris_cluster)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

# Create a new dataframe that includes the cluster as well as the top 10 venues for each district
paris_merged = paris_ind.copy()

# Add clustering labels
paris_merged["Cluster Labels"] = kmeans.labels_

paris_merged

Unnamed: 0,District,Indian Restaurant,Cluster Labels
0,Batignolles-Monceau,0.0,0
1,Bourse,0.0,0
2,Butte-Montmartre,0.023256,1
3,Buttes-Chaumont,0.0,0
4,Entrepôt,0.04,2
5,Gobelins,0.0,0
6,Hôtel-de-Ville,0.0,0
7,Louvre,0.0,0
8,Luxembourg,0.0,0
9,Ménilmontant,0.0,0


In [10]:
df.rename(columns={"Name": "District"}, inplace=True)

In [11]:
# Combine the DataFrame with clusters and the previous DataFrame with Latitude and Longitude of each district
paris_merged = paris_merged.join(df.set_index("District"), on="District")

paris_merged.shape

# Sort the resulting DataFrame by Cluster Labels
paris_merged.sort_values(["Cluster Labels"], inplace=True)
paris_merged

Unnamed: 0,District,Indian Restaurant,Cluster Labels,Latitude,Longitude
0,Batignolles-Monceau,0.0,0,48.887327,2.306777
17,Temple,0.0,0,48.862872,2.360001
16,Reuilly,0.0,0,48.834974,2.421325
15,Popincourt,0.0,0,48.859059,2.380058
14,Passy,0.0,0,48.860392,2.261971
13,Panthéon,0.0,0,48.844443,2.350715
12,Palais-Bourbon,0.0,0,48.856174,2.312188
11,Opéra,0.0,0,48.877164,2.337458
10,Observatoire,0.0,0,48.829245,2.326542
9,Ménilmontant,0.0,0,48.863461,2.401188


**Step 8: Visualize the clusters on map and examine each cluster based on label**

In [12]:
# Create map
map_paris_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lng, district, cluster in zip(paris_merged['Latitude'], paris_merged['Longitude'], paris_merged['District'], paris_merged['Cluster Labels']):
    label = folium.Popup(str(district) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_paris_clusters)
       
map_paris_clusters

In [13]:
paris_merged.loc[paris_merged['Cluster Labels'] == 0]

Unnamed: 0,District,Indian Restaurant,Cluster Labels,Latitude,Longitude
0,Batignolles-Monceau,0.0,0,48.887327,2.306777
17,Temple,0.0,0,48.862872,2.360001
16,Reuilly,0.0,0,48.834974,2.421325
15,Popincourt,0.0,0,48.859059,2.380058
14,Passy,0.0,0,48.860392,2.261971
13,Panthéon,0.0,0,48.844443,2.350715
12,Palais-Bourbon,0.0,0,48.856174,2.312188
11,Opéra,0.0,0,48.877164,2.337458
10,Observatoire,0.0,0,48.829245,2.326542
9,Ménilmontant,0.0,0,48.863461,2.401188


In [14]:
paris_merged.loc[paris_merged['Cluster Labels'] == 1]

Unnamed: 0,District,Indian Restaurant,Cluster Labels,Latitude,Longitude
2,Butte-Montmartre,0.023256,1,48.892569,2.348161
18,Vaugirard,0.030769,1,48.840085,2.292826


In [15]:
paris_merged.loc[paris_merged['Cluster Labels'] == 2]

Unnamed: 0,District,Indian Restaurant,Cluster Labels,Latitude,Longitude
4,Entrepôt,0.04,2,48.87613,2.360728





## Conclusion: ##
Indian restaurants are currently located only in district 'Entrepôt' with cluster = 2, and districts 'Butte-Montmartre' and 'Vaugirard' with cluster = 1. All other districts with cluster = 0 have no Indian restaurants. Higher concentration of Indian restaurants in cluster 2 shows the viability of opening and successfully running an Indian restaurant in this district is more than the other districts. 

Hence, as the outcome of this project, the recommended district in Paris city to open an Indian restaurant is 'Entrepôt'. 