# Capstone Project - The Battle of the neighborhoods
Applied Data Science Capstone by IBM/Coursera

## Librairies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
import folium
import re

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find best locations to install a mode store in Paris. Paris is hub for luxuary industry. All majors mode brand have a location in the city like Louis Vuitton, Dior, Channel etc...

Paris's neighborhoods are split on two categories. We have some districts with an intense touristic influence and others with more local people. Our criteria is to target a location with intense touristic presence and big commercial activity. The visibility will be optimum with this criteria.

## Data <a name="data"></a>

We will need to recup paris's neighborhood informations. At first, we can find a dataset about paris's neighborhood on this <a href="https://opendata.paris.fr/">address</a>. It's open source data created by Paris's administration. Beside, we will need to recup district's name on this <a href="https://fr.geneawiki.com/index.php/Liste_des_quartiers_de_Paris">address</a>. After we can merge the two datasets together. We will collect the latitude and longitude for the main monuments in Paris. We'll do that manually with this <a href="https://www.coordonnees-gps.fr/monuments/notre-dame-de-paris">site</a>.

Next, we will extract venues for each neighborhood from Foursquare API. 

### Data Collections

We load dataset from paris_district.csv.

In [4]:
paris_datas = pd.read_csv('datas/paris_district.csv', sep=";")
paris_datas.head()

Unnamed: 0,N_SQ_QU,C_QU,C_QUINSEE,L_QU,C_AR,N_SQ_AR,PERIMETRE,SURFACE,Geometry X Y,Geometry
0,750000013,13,7510401,Saint-Merri,4,750000004,2346.004687,313040.4,"48.8585213723,2.35166696714","{""type"": ""Polygon"", ""coordinates"": [[[2.352623..."
1,750000016,16,7510404,Notre-Dame,4,750000004,3283.163371,378252.2,"48.8528955862,2.35277501212","{""type"": ""Polygon"", ""coordinates"": [[[2.361313..."
2,750000028,28,7510704,Gros-Caillou,7,750000007,4720.994373,1381893.0,"48.8582999039,2.30154155569","{""type"": ""Polygon"", ""coordinates"": [[[2.309544..."
3,750000007,7,7510203,Mail,2,750000002,2179.153605,278142.6,"48.8680083374,2.34469912743","{""type"": ""Polygon"", ""coordinates"": [[[2.346838..."
4,750000008,8,7510204,Bonne-Nouvelle,2,750000002,2233.97603,281448.2,"48.8671501183,2.35008019041","{""type"": ""Polygon"", ""coordinates"": [[[2.351518..."


We can find some informations about neighborhood in Paris like name, n° district, n°neighborhood,surface, longitude and latitude etc... </br>
The dataset need to be clear at some points so we create functions to help us.

In [5]:
# Create postal code with num district 
# ex: num_district = 1 => postal code = 75001 ; num_district = 20 => postal code = 75020
def makePostalCode(num_district):
    if num_district < 10:
        return "7500{}".format(num_district)
    else:
        return "750{}".format(num_district)

# Return the latitude 
def getLat(coords):
    coords = coords.split(",")
    lat = coords[0]
    return lat

# Return the longitude
def getLng(coords):
    coords = coords.split(",")
    lng = coords[1]
    return lng

Now, we can create postal code, latitude and longitude column.

In [6]:
paris_datas = paris_datas.copy()
paris_datas["Postal Code"] = paris_datas["C_AR"].map(lambda x : makePostalCode(x))
paris_datas["Latitude"] = paris_datas["Geometry X Y"].map(lambda x : getLat(x))
paris_datas["Longitude"] = paris_datas["Geometry X Y"].map(lambda x : getLng(x))
paris_datas["SURFACE"] = paris_datas["SURFACE"].map(lambda x : int(x)) 

paris_datas.head()

Unnamed: 0,N_SQ_QU,C_QU,C_QUINSEE,L_QU,C_AR,N_SQ_AR,PERIMETRE,SURFACE,Geometry X Y,Geometry,Postal Code,Latitude,Longitude
0,750000013,13,7510401,Saint-Merri,4,750000004,2346.004687,313040,"48.8585213723,2.35166696714","{""type"": ""Polygon"", ""coordinates"": [[[2.352623...",75004,48.8585213723,2.35166696714
1,750000016,16,7510404,Notre-Dame,4,750000004,3283.163371,378252,"48.8528955862,2.35277501212","{""type"": ""Polygon"", ""coordinates"": [[[2.361313...",75004,48.8528955862,2.35277501212
2,750000028,28,7510704,Gros-Caillou,7,750000007,4720.994373,1381893,"48.8582999039,2.30154155569","{""type"": ""Polygon"", ""coordinates"": [[[2.309544...",75007,48.8582999039,2.30154155569
3,750000007,7,7510203,Mail,2,750000002,2179.153605,278142,"48.8680083374,2.34469912743","{""type"": ""Polygon"", ""coordinates"": [[[2.346838...",75002,48.8680083374,2.34469912743
4,750000008,8,7510204,Bonne-Nouvelle,2,750000002,2233.97603,281448,"48.8671501183,2.35008019041","{""type"": ""Polygon"", ""coordinates"": [[[2.351518...",75002,48.8671501183,2.35008019041


At this moment, we can drop useless columns for our project.

In [7]:
# Drop useless columns
paris_datas.drop(["C_QUINSEE","PERIMETRE","N_SQ_QU","N_SQ_AR","Geometry X Y"] ,axis=1, inplace=True)

# Rename columns
paris_datas.rename(columns={'C_QU': 'Num_Neighborhood', 'L_QU': 'Neighborhood', "C_AR": 'Num District', "SURFACE":"Surface"}, inplace=True)

paris_datas.head()

Unnamed: 0,Num_Neighborhood,Neighborhood,Num District,Surface,Geometry,Postal Code,Latitude,Longitude
0,13,Saint-Merri,4,313040,"{""type"": ""Polygon"", ""coordinates"": [[[2.352623...",75004,48.8585213723,2.35166696714
1,16,Notre-Dame,4,378252,"{""type"": ""Polygon"", ""coordinates"": [[[2.361313...",75004,48.8528955862,2.35277501212
2,28,Gros-Caillou,7,1381893,"{""type"": ""Polygon"", ""coordinates"": [[[2.309544...",75007,48.8582999039,2.30154155569
3,7,Mail,2,278142,"{""type"": ""Polygon"", ""coordinates"": [[[2.346838...",75002,48.8680083374,2.34469912743
4,8,Bonne-Nouvelle,2,281448,"{""type"": ""Polygon"", ""coordinates"": [[[2.351518...",75002,48.8671501183,2.35008019041


We don't have the name of the district also we can get it on the address https://fr.geneawiki.com/index.php/Liste_des_quartiers_de_Paris.
For this task, we will a BeautifulSoup librairies.

In [8]:
address_district = "https://fr.geneawiki.com/index.php/Liste_des_quartiers_de_Paris"
html_data = requests.get(address_district)

soup = BeautifulSoup(html_data.text, 'html5lib')

table = soup.find_all('table')[1]

datas = list()
for row in table.tbody.find_all('tr'):
    col = row.find_all('td')
    postal_code = col[1].text
    district_name = col[3].text.replace("\n", "")
    datas.append((postal_code,district_name))

paris_district = pd.DataFrame(datas, columns=["Postal Code", "District"])
paris_district.head()

Unnamed: 0,Postal Code,District
0,Code Postal,Quartiers
1,75001,Le Louvre
2,75002,La Bourse
3,75003,Le Temple
4,75004,L'Hôtel-de-Ville


We can merge the two tables on the postal code

In [9]:
# Merge
paris_merged = paris_datas.merge(paris_district, left_on="Postal Code", right_on="Postal Code")

# Reorder columns
paris_merged = paris_merged[["Postal Code","Num District","District","Num_Neighborhood","Neighborhood","Surface","Latitude","Longitude","Geometry"]]

paris_merged.head(80)

Unnamed: 0,Postal Code,Num District,District,Num_Neighborhood,Neighborhood,Surface,Latitude,Longitude,Geometry
0,75004,4,L'Hôtel-de-Ville,13,Saint-Merri,313040,48.8585213723,2.35166696714,"{""type"": ""Polygon"", ""coordinates"": [[[2.352623..."
1,75004,4,L'Hôtel-de-Ville,16,Notre-Dame,378252,48.8528955862,2.35277501212,"{""type"": ""Polygon"", ""coordinates"": [[[2.361313..."
2,75004,4,L'Hôtel-de-Ville,15,Arsenal,487264,48.851585175,2.36476795387,"{""type"": ""Polygon"", ""coordinates"": [[[2.368512..."
3,75004,4,L'Hôtel-de-Ville,14,Saint-Gervais,422028,48.8557186509,2.35816233385,"{""type"": ""Polygon"", ""coordinates"": [[[2.363764..."
4,75007,7,Le Palais-Bourbon,28,Gros-Caillou,1381893,48.8582999039,2.30154155569,"{""type"": ""Polygon"", ""coordinates"": [[[2.309544..."
...,...,...,...,...,...,...,...,...,...
75,75001,1,Le Louvre,4,Place-Vendôme,269456,48.8670185906,2.32858166493,"{""type"": ""Polygon"", ""coordinates"": [[[2.331944..."
76,75015,15,Vaugirard,59,Grenelle,1478299,48.8501716555,2.2918526427,"{""type"": ""Polygon"", ""coordinates"": [[[2.300883..."
77,75015,15,Vaugirard,58,Necker,1578484,48.8427112503,2.31077745364,"{""type"": ""Polygon"", ""coordinates"": [[[2.306149..."
78,75015,15,Vaugirard,60,Javel,2609009,48.8390604011,2.27807634692,"{""type"": ""Polygon"", ""coordinates"": [[[2.282326..."


Now, we can collect venues for each neighborhood.

In [10]:
CLIENT_ID = 'OW0JEEJ5VSAWRKNCQBKBK4FEQ2YIJAAGNOGCXANWEEPUDCNT' # your Foursquare ID
CLIENT_SECRET = 'CCYQYNUHKVY01E1CITE2NPQV3L4NMIZUDMUKXAMUNNKO0VX3' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 1000 # A default Foursquare API limit value
radius = 1000

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print("searching on {} is done !".format(name))
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    print("Process Done")
    return(nearby_venues)

In [12]:
paris_venues = getNearbyVenues(names=paris_merged['Neighborhood'],
                                   latitudes=paris_merged['Latitude'],
                                   longitudes=paris_merged['Longitude']
                                  )

searching on Saint-Merri is done !
searching on Notre-Dame is done !
searching on Arsenal is done !
searching on Saint-Gervais is done !
searching on Gros-Caillou is done !
searching on Invalides is done !
searching on Ecole-Militaire is done !
searching on Saint-Thomas-d'Aquin is done !
searching on Mail is done !
searching on Bonne-Nouvelle is done !
searching on Gaillon is done !
searching on Vivienne is done !
searching on Gare is done !
searching on Maison-Blanche is done !
searching on Croulebarbe is done !
searching on Salpêtrière is done !
searching on Clignancourt is done !
searching on Goutte-d'Or is done !
searching on Grandes-Carrières is done !
searching on La Chapelle is done !
searching on Faubourg-Montmartre is done !
searching on Rochechouart is done !
searching on Saint-Georges is done !
searching on Chaussée-d'Antin is done !
searching on Amérique is done !
searching on Pont-de-Flandre is done !
searching on Villette is done !
searching on Combat is done !
searching 

In [13]:
print('There are {} uniques categories.'.format(len(paris_venues['Venue Category'].unique())))
print('There are {} venues.'.format(len(paris_venues['Venue'])))

paris_venues.head()

There are 302 uniques categories.
There are 5055 venues.


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Saint-Merri,48.8585213723,2.35166696714,Fleux',48.858763,2.354161,Furniture / Home Store
1,Saint-Merri,48.8585213723,2.35166696714,Place de l'Hôtel de Ville – Esplanade de la Li...,48.85701,2.351656,Plaza
2,Saint-Merri,48.8585213723,2.35166696714,Huygens Cosmetique Naturelle Sur Mesure,48.858938,2.353778,Cosmetics Shop
3,Saint-Merri,48.8585213723,2.35166696714,L'Alsacien,48.858275,2.350381,Alsatian Restaurant
4,Saint-Merri,48.8585213723,2.35166696714,Librairie Flammarion,48.86027,2.352148,Bookstore


In [14]:
paris_coord = [48.856614,2.3522219]
width, height = 1000, 1000

In [15]:
map_paris= folium.Map(location=paris_coord, zoom_start=12, width=width,height=height)
for i in range(0, paris_venues.shape[0]):
    lon = paris_venues.loc[i,"Venue Longitude"]
    lat = paris_venues.loc[i,"Venue Latitude"]
    color = 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_paris)
map_paris

## Methodology <a name="methodology"></a>

Our methodolgy will follow differents steps:
* Calculate venues density by km² for each neighborhood
* Geolocate the monuments in Paris to compare the venues density and the touristic influence
* We will make a clusetering on the neighborhood in order to define a specialization.  

## Analysis <a name="analysis"></a>

Let's calculate th venues density

In [16]:
def round_number(nb, digit):
    return int(nb*10**digit) / 10**digit

def m2_to_km2(nb):
    return nb / 10**6

In [17]:
# Count venues by neighborhood
features = ["Neighborhood","Venue"]
count_venues_by_neighborhood = paris_venues[features].groupby(['Neighborhood'], as_index=False).count()

# Merge the count venues with main data about paris's neighborhood
neighborhood_venues_stat = count_venues_by_neighborhood.merge(paris_merged ,left_on = "Neighborhood", right_on="Neighborhood")

In [18]:
# Drop useless columns
neighborhood_venues_stat.drop(["Num_Neighborhood","Num District"],axis=1, inplace=True)
neighborhood_venues_stat.rename(columns={"Venue":"Venue count"})

# convert surface in m² to km²
neighborhood_venues_stat["Surface"] = neighborhood_venues_stat["Surface"].map(lambda x : m2_to_km2(x))
neighborhood_venues_stat["Surface"] = neighborhood_venues_stat["Surface"].map(lambda x : round_number(x,3))

# venues density by km^2
neighborhood_venues_stat["Venues Density"] = neighborhood_venues_stat["Venue"]  / neighborhood_venues_stat["Surface"]
neighborhood_venues_stat["Venues Density"] = neighborhood_venues_stat["Venues Density"].map(lambda x : round_number(x, 3))

neighborhood_venues_stat

Unnamed: 0,Neighborhood,Venue,Postal Code,District,Surface,Latitude,Longitude,Geometry,Venues Density
0,Amérique,13,75019,Les Buttes-Chaumont,1.835,48.8816381673,2.39544016662,"{""type"": ""Polygon"", ""coordinates"": [[[2.409402...",7.084
1,Archives,100,75003,Le Temple,0.367,48.8591924127,2.36320505733,"{""type"": ""Polygon"", ""coordinates"": [[[2.368479...",272.479
2,Arsenal,67,75004,L'Hôtel-de-Ville,0.487,48.851585175,2.36476795387,"{""type"": ""Polygon"", ""coordinates"": [[[2.368512...",137.577
3,Arts-et-Métiers,100,75003,Le Temple,0.318,48.8664702895,2.35708313106,"{""type"": ""Polygon"", ""coordinates"": [[[2.360209...",314.465
4,Auteuil,16,75016,Passy,6.383,48.8506223427,2.25227690754,"{""type"": ""Polygon"", ""coordinates"": [[[2.249224...",2.506
...,...,...,...,...,...,...,...,...,...
75,Sorbonne,100,75005,Le Panthéon,0.433,48.8490447659,2.34574660019,"{""type"": ""Polygon"", ""coordinates"": [[[2.349244...",230.946
76,Ternes,65,75017,Les Batignolles-Monceau,1.465,48.8811775503,2.28996373812,"{""type"": ""Polygon"", ""coordinates"": [[[2.295039...",44.368
77,Val-de-Grâce,46,75005,Le Panthéon,0.703,48.841684288,2.34386092632,"{""type"": ""Polygon"", ""coordinates"": [[[2.345484...",65.433
78,Villette,57,75019,Les Buttes-Chaumont,1.285,48.8876610888,2.37446821213,"{""type"": ""Polygon"", ""coordinates"": [[[2.370498...",44.357


Fine, we have th number of venues find for each neighborhood and the density venues.
Let's visualize the result on a folium map

In [19]:
def decodeGeometry(geometry):
    # Regex to recup coordinates
    coords = re.findall("(\[[0-9]+.[0-9]+\, [0-9]+\.[0-9]+\])", geometry)
    
    coordinates = []
    for i in range(0, len(coords)):
        coord = coords[i].replace('[', '').replace(']', '')
        coord = coord.split(',')
        lat = float(coord[0])
        long = float(coord[1])
        coordinates.append([lat, long])
    return coordinates

def create_geo_json(dataset):
    geo_json = {
        "type":"FeatureCollection",
        "features" : list()
    }
    
    for i in range(0,dataset.shape[0]):
        district = dataset.loc[i,"District"]
        postal_code = dataset.loc[i,"Postal Code"]
        neighborhood = dataset.loc[i,"Neighborhood"]
        surface = dataset.loc[i,"Surface"]
        density =  dataset.loc[i,"Venues Density"]
        coordinates = decodeGeometry(dataset.loc[i,"Geometry"])
        
        feature = {
            "type":"Feature",
            "id" : neighborhood,
            "properties" :{
                "District" : district,
                "Neighborhood" : neighborhood,
                "Postal Code" : postal_code,
                "Surface" : str(surface),
                "Density" : str(density)
            },
            "geometry" :{
                "type":"Polygon",
                "coordinates":[coordinates]
            }
        }
        geo_json["features"].append(feature)
    return geo_json

In [20]:
paris_map = folium.Map(location=paris_coord, zoom_start=12, width=width,height=height)

geo_district = create_geo_json(neighborhood_venues_stat)

choropleth_surface = folium.Choropleth(
    geo_data=geo_district,
    name="choropleth",
    data=neighborhood_venues_stat,
    key_on="feature.id",
    columns=["Neighborhood","Surface"],
    fill_color="YlOrRd",
    fill_opacity=0.6,
    line_opacity=0.5,
    highlight=True,
    legend_name="Surface(km²)",
).add_to(paris_map)

choropleth_surface.geojson.add_child(folium.features.GeoJsonTooltip(["District",'Neighborhood',"Postal Code", 'Surface'],labels=True))
paris_map

We can see the smallest neighborhoods are in the downtown while the greatest are on the border city.

In [21]:
map_paris = folium.Map(location=paris_coord, zoom_start=12, width=width,height=height)
choropleth_density = folium.Choropleth(
    geo_data=geo_district,
    name="choropleth",
    data=neighborhood_venues_stat,
    key_on="feature.id",
    columns=["Neighborhood","Venues Density"],
    fill_color="YlOrRd",
    fill_opacity=0.6,
    line_opacity=0.5,
    highlight=True,
    legend_name="Count venues by km²",
).add_to(map_paris)

monuments = pd.read_csv("paris_monuments.csv", sep=";", encoding="utf8")
for i in range(0,monuments.shape[0]):
    name_monument = monuments.loc[i,"Monuments"]
    long = monuments.loc[i,"Longitude"]
    lat = monuments.loc[i,"Lattitude"]
    folium.CircleMarker([long, lat], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_paris)
choropleth_density.geojson.add_child(folium.features.GeoJsonTooltip(["District",'Neighborhood',"Postal Code", 'Density'],labels=True))
map_paris

FileNotFoundError: [Errno 2] No such file or directory: 'paris_monuments.csv'

The venues are mainly located on the downtown. The city center concetrate historical neighborhood with high touristic influence. The most of the monuments are in the city center also we can have a correlation between the venues density and the presence of a monuments int hte neighborhood.

Now let's analyze the categories venues. At first, we group all restaurants under the same label. we do the same for bar and hotel.

In [None]:
def formatCategory(x):
    if "Restaurant" in x:
        return "Restaurant"
    if "Hostel" in x:
        return "Hotel"
    if "Bar" in x:
        return "Bar"
    return x

paris_venues["Venue Category"] = paris_venues["Venue Category"].map(lambda x : formatCategory(x))

total_cat = len(paris_venues['Venue Category'].unique())
count_cat = paris_venues['Venue Category'].value_counts()

df = pd.DataFrame([], columns=["Category", "Count"])
df["Category"] = count_cat.index
df["Count"] = count_cat.values
df["Part"] = df["Count"]/total_cat

print('There are {} uniques categories.'.format(len(paris_venues['Venue Category'].unique())))
print('There are {} venues.'.format(len(paris_venues['Venue'])))

df.head(10)

Restaurants, bars and hotels represent almost 12-13 percent of all the venues. that's to much because when we want to specialize a neighborhood with the clustering we risk to have the same feature for all clusters. We want to target only shops and stores. It's good idea to delete this categories to our dataset.

In [None]:
def toDelete(x):
    if "Store" in x:
        return x
    if "Shop" in paris_venues["Venue Category"] in x:
        return x
    return np.NaN

paris_venues = paris_venues.copy()

# Delete bar, restaurant, hotel
paris_venues["Venue Category"] = paris_venues["Venue Category"].map(lambda x : toDelete(x))
paris_venues.dropna(axis=0, inplace=True)

# Analyze part categories
total_cat = len(paris_venues['Venue Category'].unique())
count_cat = paris_venues['Venue Category'].value_counts()

df = pd.DataFrame([], columns=["Category", "Count"])
df["Category"] = count_cat.index
df["Count"] = count_cat.values
df["Part"] = df["Count"]/total_cat

print('There are {} uniques categories.'.format(len(paris_venues['Venue Category'].unique())))
print('There are {} venues.'.format(len(paris_venues['Venue'])))

df.head(10)

In [None]:
paris_venues.reset_index(inplace=True)

In [None]:
from folium import plugins
from folium.plugins import HeatMap

stores_latlons = []
for i in range(0, paris_venues.shape[0]):
    lon = paris_venues.loc[i,"Venue Longitude"]
    lat = paris_venues.loc[i,"Venue Latitude"]
    stores_latlons.append([lat,lon])

map_paris = folium.Map(location=paris_coord, zoom_start=12, width=width,height=height)

for i in range(0,monuments.shape[0]):
    name_monument = monuments.loc[i,"Monuments"]
    long = monuments.loc[i,"Longitude"]
    lat = monuments.loc[i,"Lattitude"]
    folium.Marker(
        [long, lat], popup="<b>{0}</b>".format(name_monument)
    ).add_to(map_paris)

HeatMap(stores_latlons).add_to(map_paris)
map_paris

We can observe in this map 5 sectors who have high commercial activity. <br>
The sectors are :  
* Champs Elysées
* Garnier Palace / Vendome / Madeleine Church
* Les Halles
* Le Marais
* Saint-Germain-des-Près

So, we have target commercials sectors in Paris. Now, we will make a clustering in order to define a neighborhood's specialization. 

We will define the most categories venues for each neighborhood then we will make a prepocessing.

In [None]:
# one hot encoding
paris_onehot = pd.get_dummies(paris_venues[["Venue Category"]],  prefix="", prefix_sep="")
paris_onehot

# add neighborhood column back to dataframe
paris_onehot['Neighborhood'] = paris_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [paris_onehot.columns[-1]] + list(paris_onehot.columns[:-1])
paris_onehot = paris_onehot[fixed_columns]

paris_onehot.head()

In [None]:
paris_grouped_venues = paris_onehot.groupby('Neighborhood').mean().reset_index()
paris_grouped_venues

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = paris_grouped_venues['Neighborhood']

for ind in np.arange(paris_grouped_venues.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(paris_grouped_venues.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(80)

We can start the clustering. Our dataset is ready.

In [None]:
from sklearn.cluster import KMeans

In [None]:
# Create clustering dataset
X = paris_grouped_venues.copy().drop('Neighborhood', 1)

In [None]:
# Define number of clusters
kclusters = 4

# Initialization kmeans
kmeans = KMeans(n_clusters=kclusters, random_state = 0)

# Fit
kmeans.fit(X)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted['Cluster Labels'] = kmeans.labels_

In [None]:
paris_data = neighborhood_venues_stat.merge(neighborhoods_venues_sorted,left_on="Neighborhood", right_on="Neighborhood")
paris_data.head()

In [None]:
def create_geo_json_with_cluster_label(dataset):
    geo_json = {
        "type":"FeatureCollection",
        "features" : list()
    }
    
    for i in range(0,dataset.shape[0]):
        district = dataset.loc[i,"District"]
        postal_code = dataset.loc[i,"Postal Code"]
        neighborhood = dataset.loc[i,"Neighborhood"]
        surface = dataset.loc[i,"Surface"]
        density =  dataset.loc[i,"Venues Density"]
        coordinates = decodeGeometry(dataset.loc[i,"Geometry"])
        label = dataset.loc[i,"Cluster Labels"]
        
        feature = {
            "type":"Feature",
            "id" : neighborhood,
            "properties" :{
                "District" : district,
                "Neighborhood" : neighborhood,
                "Postal Code" : postal_code,
                "Surface" : str(surface),
                "Density" : str(density),
                "Cluster Labels" : str(label)
            },
            "geometry" :{
                "type":"Polygon",
                "coordinates":[coordinates]
            }
        }
        geo_json["features"].append(feature)
    return geo_json

In [None]:
m = folium.Map(location=paris_coord, zoom_start=12, width=width,height=height)
geo_district_clustering = create_geo_json_with_cluster_label(paris_data)

In [None]:
choropleth_clustering = folium.Choropleth(
    geo_data=geo_district_clustering,
    name="choropleth",
    data=paris_data,
    key_on="feature.id",
    columns=["Neighborhood","Cluster Labels"],
    fill_color="Paired",
    fill_opacity=0.6,
    line_opacity=0.5,
    highlight=True,
    legend_name="Clustering Labels",
).add_to(m)

choropleth_clustering.geojson.add_child(folium.features.GeoJsonTooltip(["District",'Neighborhood',"Postal Code","Cluster Labels"],labels=True))
m

We can see all sectors who we have targeted previously have the same label 1. 
Let's start to analyze the label.

In [None]:
paris_data.loc[paris_data['Cluster Labels'] == 1].sort_values(by="Venues Density", ascending=False)

When we watch the most common stores for neighborhood, we cleary defin the trend is wear store (women's store, men's store, shoe's store, clothing store).

## Results and Discussion <a name="results"></a>

Our analysis shows that city have an asymetric distribution of the commercial activity. The most attractive area are located around majors monuments of the city. We find those areas across the Seine River. Those sectors are located on the old city. The city center of Paris is a hub transport for the people. In fact, the biggest station is Chatelet les Halles. You have 5 subway lines and 3 express train. You have direct access to the Charles de Gaulles Airport.
We have found a lot of hotels, restaurants and bars and we have removed it from the dataset. That manipulation allowed us to focus our analyze only on the stores and shops. The clustering have specialized each neighborhood. The result is the most adapted area for mode shop are located around the Seine rive. At the south of the river we have Saint-Germain-des-Pres and at the north we have Champs-elysées, Garnier Palace / Vendome, chatelet/Les Halles and the last Le Marais. The main adavantages for those area are the visibility with high touristic frequentation with the proxity of the majors monuments and high commercial activity. The installation of store at those are could be a postal card for  the company. 

## Conclusion <a name="conclusion"></a>

Our problem was to define the best location to install a mode stor in Paris. Our analyze allowed us to define 5 areas with high potential. The store will have a excellent visibility with tourists and local people. Those are are close of the majors monuments and main bus, train and subway station.