# <b>Introduction/Business Problem Section</b>

Today, '<b>Wine bar</b>' are very trendy in <b>Paris</b>. We would like to open our restaurant in the French capital, but some questions must be answered before reaching the final decision to open the store at a specific location in the city.

<ol type="1">
<li> What "Customer base" are we going to target ?
    <ul><li> Are we targeting the business people during the lunch break or during the evening ? </li></ul>
</li>
<li> Where is the "best place" to open it ?
    <ul><li> Some boroughs are more residential than others <br /> => For example the 14<sup>th</sup> and 16<sup>th</sup> boroughs are very residential. </li>
    <li> Some boroughs are more easier (administrative procedures) to open a new business <br /> => For example in the 11<sup>th</sup> borough, it is not possible anymore to open new bar. </li>
        <li> Are we near a Metro Station / Train ? </li></ul></li>
<li> What are the competitor restaurants ?
<ul>
    <li> How many wine bars have already opened their business by borough ? </li>
    <li> Are they working correclty ? </li>
    <li> What are the feedbacks given by the users ? (satisfaction level) </li></ul></li>
</ol>

All these stated questions can be alleviated by extracting relevant data. We will speak about it in the next section.

# <b>Data Section</b>

First of all we are going to extract the borough of Paris: https://fr.wikipedia.org/wiki/Liste_des_quartiers_administratifs_de_Paris <br/>
As we can see the borough can be divided into several 'Quarters'. It will allow us to have an even more refined search.

In [2]:
import urllib3
import bs4 as BeautifulSoup
import requests

import pandas as pd
import numpy as np
import geocoder # To get latitude and longitude for our postal codes

from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import folium

from selenium import webdriver

In [3]:
# More easy to extract the postal codes from this website instead of the wiki page.
url = 'https://fr.geneawiki.com/index.php/Liste_des_quartiers_de_Paris'
http = urllib3.PoolManager()
response = http.request('GET', url)

soup = BeautifulSoup.BeautifulSoup(response.data)



### Find the table which contains all postal codes related to Paris

In [4]:
tables = soup.findAll('table')

# The Paris Boroughs table is the 2nd table in the web page
postalcodes_table = tables[1]

In [5]:
rows = postalcodes_table.find_all('tr')

In [6]:
headers = [] 
for cell in rows[0].find_all('td'):
    headers.append(cell.text.strip()) # remove \n
headers

['Code INSEE 1', 'Code Postal', 'Arrondissements', 'Quartiers']

In [7]:
# Translate headers in english and normalize it
headers[headers.index('Code INSEE 1')] = 'insee1_code'
headers[headers.index('Code Postal')] = 'postal_code'
headers[headers.index('Arrondissements')] = 'borough'
headers[headers.index('Quartiers')] = 'quarters'
headers

['insee1_code', 'postal_code', 'borough', 'quarters']

In [8]:
data = []
for row in rows[1:]: # Exclude headers line
    line = {}
    for cell_num, cell in enumerate(row.find_all('td')):
        cell = cell.text.strip()
        if cell_num <= 1: # Indexes 0 and 1
            line[headers[cell_num]] = cell 
        elif cell_num == 4:
            line['quarters'] = cell.split('\n')
        else: # indexes 2 and 3 -> Join these 2 values, for example: "I - Le Louvre"
            borough = headers[2]
            if borough not in line:
                line[borough] = []
            line[borough].append(cell)
    data.append(line)

In [9]:
df = pd.DataFrame(data)
df

Unnamed: 0,borough,insee1_code,postal_code,quarters
0,"[I, Le Louvre]",75101,75001,"[01 - Saint-Germain-l'Auxerrois, 02 - Les Hall..."
1,"[II, La Bourse]",75102,75002,"[05 - Gaillon, 06 - Vivienne, 07 - Le Mail, 08..."
2,"[III, Le Temple]",75103,75003,"[09 - Les Arts-et-Métiers, 10 - Les Enfants-Ro..."
3,"[IV, L'Hôtel-de-Ville]",75104,75004,"[13 - Saint-Merri, 14 - Saint-Gervais, 15 - L'..."
4,"[V, Le Panthéon]",75105,75005,"[17 - Saint-Victor, 18 - Le Jardin-des-Plantes..."
5,"[VI, Le Luxembourg]",75106,75006,"[21 - La Monnaie, 22 - L'Odéon, 23 - Notre-Dam..."
6,"[VII, Le Palais-Bourbon]",75107,75007,"[25 - Saint-Thomas-d'Aquin, 26 - Les Invalides..."
7,"[VIII, L'Élysée]",75108,75008,"[29 - Les Champs-Élysées, 30 - Le Faubourg-du-..."
8,"[IX, L'Opéra]",75109,75009,"[33 - Saint-Georges, 34 - La Chaussée-d'Antin,..."
9,"[X, L'Enclos-Saint-Laurent]",75110,75010,"[37 - Saint-Vincent-de-Paul, 38 - La Porte-Sai..."


# Formalize our dataframe

In [10]:
df['borough'] = df['borough'].str.join(' - ')

In [11]:
quarters_df_tmp = df['quarters'].apply(pd.Series)

In [12]:
quarters_df = []
for col in quarters_df_tmp.columns:
    quarters_df.append(quarters_df_tmp[col])
quarters_df = pd.concat(quarters_df)

In [13]:
postal_codes_df = pd.merge(df, pd.DataFrame(quarters_df), left_index=True, right_index=True)
postal_codes_df.drop('quarters', axis=1, inplace=True) # Not needed anymore. Replaced by next statement
postal_codes_df.rename(columns={0: 'quarters'}, inplace=True)
postal_codes_df.reset_index(drop=True, inplace=True)
postal_codes_df.head(10)

Unnamed: 0,borough,insee1_code,postal_code,quarters
0,I - Le Louvre,75101,75001,01 - Saint-Germain-l'Auxerrois
1,I - Le Louvre,75101,75001,02 - Les Halles
2,I - Le Louvre,75101,75001,03 - Le Palais-Royal
3,I - Le Louvre,75101,75001,04 - La Place-Vendôme
4,II - La Bourse,75102,75002,05 - Gaillon
5,II - La Bourse,75102,75002,06 - Vivienne
6,II - La Bourse,75102,75002,07 - Le Mail
7,II - La Bourse,75102,75002,08 - Bonne-Nouvelle
8,III - Le Temple,75103,75003,09 - Les Arts-et-Métiers
9,III - Le Temple,75103,75003,10 - Les Enfants-Rouges


Now we would like to have the longitude and latitude of each quarter. This data is stored in each Wikipedia page of each quarter.<br />So, we are going to create a WebBrowser bot to find the interesting Wikipedia pages for us (regarding the quarters found previously). The coordinates of each quarter can be retrieved by extracting the value of the \<div\> attributes : 'data-lat' and 'data-lon'. 

In [14]:
# geocoder does not give us the coordinates of the quarters... .
# Let's scrap wikipedia to have the coordinates instead.

def coordinates_from(url):
    """
    Find the coordinates from the given URL
    """
    http = urllib3.PoolManager()
    response = http.request('GET', url)

    soup = BeautifulSoup.BeautifulSoup(response.data)

    latitude, longitude = 0, 0
    for div in soup.find_all('a'):
        if 'data-lat' in div.attrs:
            latitude, longitude = div['data-lat'], div['data-lon'] 
            break
    return latitude, longitude

In [15]:
quarters = []
for quarter in postal_codes_df['quarters']:
    # Transform '01 - Saint-Germain-l'Auxerrois' -> 'Saint Germain l Auxerrois'
    quarters.append(quarter[5:].replace('-', ' ').replace('\'', ' '))

def find_best_url(a_tags, words_to_find):
    """
    Find the best url among the given tags: [<a href='url'>, ...]
    'words_to_find' list of words to found in the URL
    """
    for a_tag in a_tags:
        has_matched = True
        url_candidate = a_tag.get_attribute('href').lower()
        for word in words_to_find:
            if word.lower() not in url_candidate:
                has_matched = False
                break
        if has_matched:
            return a_tag.get_attribute('href') # All words have been found
    return '' # No URLs have satisfied the pre-requisites. Do not return any.

def url_from(query):
    """
    Find the URL (wiki page) based on the given query.
    For example:
    - query: Saint-Germain-l'Auxerrois
    - Will return: https://fr.wikipedia.org/wiki/Quartier_Saint-Germain-l%27Auxerrois
    We will find each wiki page for each quarter as it contains the coordinates of each quarter.
    """
    with webdriver.Firefox(executable_path=r'/home/thomas/geckodriver') as driver:
        # add 'wiki' and 'quartier' in the query to improve results
        driver.get('https://www.google.com/search?q=wiki quartier {}'.format(query))
        a_tags = driver.find_elements_by_partial_link_text('fr.wikipedia.org')
        try:
            return find_best_url(a_tags, ['wiki', 'quartier'])
        except IndexError:
            return ''

In [40]:
wiki_quarter_urls = {}
for quarter_name, quarter_normalized in zip(postal_codes_df['quarters'], quarters):
    quarter_url = url_from(quarter_normalized)
    print('{}: {}'.format(quarter_name, quarter_url))
    wiki_quarter_urls[quarter_name] = (quarter_url)

01 - Saint-Germain-l'Auxerrois: https://fr.wikipedia.org/wiki/Quartier_Saint-Germain-l%27Auxerrois
02 - Les Halles: https://fr.wikipedia.org/wiki/Quartier_des_Halles
03 - Le Palais-Royal: https://fr.wikipedia.org/wiki/Quartier_du_Palais-Royal
04 - La Place-Vendôme: https://fr.wikipedia.org/wiki/Quartier_de_la_Place-Vend%C3%B4me
05 - Gaillon: https://fr.wikipedia.org/wiki/Quartier_Gaillon
06 - Vivienne: https://fr.wikipedia.org/wiki/Quartier_Vivienne
07 - Le Mail: https://fr.wikipedia.org/wiki/Quartier_du_Mail
08 - Bonne-Nouvelle: https://fr.wikipedia.org/wiki/Quartier_de_Bonne-Nouvelle
09 - Les Arts-et-Métiers: https://fr.wikipedia.org/wiki/Quartier_des_Arts-et-M%C3%A9tiers
10 - Les Enfants-Rouges: https://fr.wikipedia.org/wiki/Quartier_des_Enfants-Rouges
11 - Les Archives: https://fr.wikipedia.org/wiki/Quartier_des_Archives
12 - Saint-Avoye: https://fr.wikipedia.org/wiki/Quartier_Sainte-Avoye
13 - Saint-Merri: https://fr.wikipedia.org/wiki/Quartier_Saint-Merri
14 - Saint-Gervais: http

In [43]:
# Let's add the coordinates into our DataFrame
postal_codes_df['latitude'] = np.nan
postal_codes_df['longitude'] = np.nan

for quarter_idx, wiki_quarter_url in enumerate(wiki_quarter_urls):
    latitude, longitude = coordinates_from(wiki_quarter_urls[wiki_quarter_url])
    postal_codes_df.loc[quarter_idx, 'latitude'] = float(latitude)
    postal_codes_df.loc[quarter_idx, 'longitude'] = float(longitude)    



In [50]:
# Let's save it to avoid to do same process later
# postal_codes_df.to_csv('postal_codes_df.csv', index=False)

In [16]:
postal_codes_df = pd.read_csv('postal_codes_df.csv')

In [17]:
postal_codes_df.head(10)

Unnamed: 0,borough,insee1_code,postal_code,quarters,latitude,longitude
0,I - Le Louvre,75101,75001,01 - Saint-Germain-l'Auxerrois,48.860112,2.340195
1,I - Le Louvre,75101,75001,02 - Les Halles,48.862541,2.344744
2,I - Le Louvre,75101,75001,03 - Le Palais-Royal,48.864912,2.337749
3,I - Le Louvre,75101,75001,04 - La Place-Vendôme,48.867495,2.329402
4,II - La Bourse,75102,75002,05 - Gaillon,48.869083,2.332867
5,II - La Bourse,75102,75002,06 - Vivienne,48.869069,2.339176
6,II - La Bourse,75102,75002,07 - Le Mail,48.867982,2.344615
7,II - La Bourse,75102,75002,08 - Bonne-Nouvelle,48.866776,2.350087
8,III - Le Temple,75103,75003,09 - Les Arts-et-Métiers,48.866253,2.356846
9,III - Le Temple,75103,75003,10 - Les Enfants-Rouges,48.864729,2.363155


# Explore locations using Foursquare API

In [24]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 
              'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    data = [item for venue_list in venues_list for item in venue_list]
    
    return pd.DataFrame(data=data, columns=columns)

In [29]:
# paris_venues = getNearbyVenues(postal_codes_df['borough'], latitudes=postal_codes_df['latitude'], longitudes=postal_codes_df['longitude'])
# paris_venues.to_csv('paris_venues.csv', index=False)

In [8]:
# Avoid to fetch several times the Foursquare API (as the calls to the API are limited per day).
paris_venues = pd.read_csv('paris_venues.csv')
paris_venues.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,I - Le Louvre,48.860112,2.340195,Cour Carrée du Louvre,48.86036,2.338543,Pedestrian Plaza
1,I - Le Louvre,48.860112,2.340195,Place du Louvre,48.859841,2.340822,Plaza
2,I - Le Louvre,48.860112,2.340195,La Vénus de Milo (Vénus de Milo),48.859943,2.337234,Exhibit
3,I - Le Louvre,48.860112,2.340195,Le Fumoir,48.860424,2.340868,Cocktail Bar
4,I - Le Louvre,48.860112,2.340195,Église Saint-Germain-l'Auxerrois (Église Saint...,48.85952,2.341306,Church
5,I - Le Louvre,48.860112,2.340195,La Régalade Saint-Honoré,48.86162,2.341749,French Restaurant
6,I - Le Louvre,48.860112,2.340195,Pont des Arts,48.858565,2.337635,Bridge
7,I - Le Louvre,48.860112,2.340195,Pret A Manger,48.861811,2.341311,Sandwich Place
8,I - Le Louvre,48.860112,2.340195,Coffee Crêpes,48.858841,2.340802,Coffee Shop
9,I - Le Louvre,48.860112,2.340195,Boutique yam'Tcha,48.86171,2.34238,Chinese Restaurant


In [10]:
paris_venues.shape

(5668, 7)

We have extracted 5668 'interesting points' (which can be shops / restaurant / historical monuments, ...).

<b>Summary of the data extracted</b>: This dataset will help us to visualize the repartition of the different shops in the City. Like that, we can avoid to open our store next to another Wine bar.<br /> Also It will give us an overview of the 'attractive' location position (for example: 'Pont des Arts' which is located here: 48.858565, 2.337635) which will make a place more attractive

We have now extracted our dataset from Foursquare in order to analyze the restaurants repartition in Paris. <br />
There is also another website which can be really interesting in order to add other data to our current dataset which is :
    https://opendata.paris.fr/pages/home/