## CAPSTONE PROJECT FOR FINDING OPTIMIZED VENUE AT FRANCE

# Introduction

This report is for the final course of the Data Science Specialization. A 9-courses series created by IBM, hosted on Coursera platform. The problem and the analysis approach are left for the learner to decide, with a requirement of leveraging the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve.

In this project, the problem is to find the optimal location or finding the cityof cluster which has user preferred venue eg. BAR,PLAZA and GYM in France. To achieve this task, an analytical approach will be used, based on advance machine learning techniques and data analysis,concretely clustering and perhaps some data visualization techniques.

So can the city surrounding has user preferred venues ?
If so, what types of venues cluster has the most affect, both positively and negatively?

The Target Audience for this project is for who prefer to stay in hotel based on 
on their preferred venues(eg.Tourists).


# Import required libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.18.1               |             py_0          51 KB  conda-forge
    openssl-1.0.2p             |       h470a237_2         3.1 MB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0         conda-forge
    geopy:         1.18.1-py_0       conda-forge

The following packages will be UPDATED:

    openssl:       1.0.2p-h470a237_1 conda-forge --> 1.0.2p-h470a237_2 conda-forge


Downloading and Extracting Packages
geopy-1.18.1         | 51 KB     | #############

# Download and Explore Dataset

Get Data from https://simplemaps.com/data/fr-cities as CSV

In [2]:
df = pd.read_csv('france_geo.csv', sep = ';')

In [3]:
data_df = pd.DataFrame(df)

In [4]:
data_df.head()

Unnamed: 0,city,lat,lng,country,iso2,capital,population
0,Paris,48.866667,2.333333,France,FR,primary,9904000
1,Lyon,45.748457,4.846711,France,FR,admin,1423000
2,Marseille,43.285413,5.37606,France,FR,admin,1400000
3,Lille,50.632971,3.058585,France,FR,admin,1044000
4,Nice,43.713644,7.25952,France,FR,927000,338620


Change the Column names as understandable

In [5]:
data_df.columns = ['CITY', 'LATITUDE', 'LONGITUDE','COUNTRY','COUNTRY_CODE','CAPITAL','POPULATION']

Drop the columns that are not required

In [6]:
data_df = data_df.drop(['COUNTRY_CODE','CAPITAL'], axis=1)

In [7]:
data_df.head()

Unnamed: 0,CITY,LATITUDE,LONGITUDE,COUNTRY,POPULATION
0,Paris,48.866667,2.333333,France,9904000
1,Lyon,45.748457,4.846711,France,1423000
2,Marseille,43.285413,5.37606,France,1400000
3,Lille,50.632971,3.058585,France,1044000
4,Nice,43.713644,7.25952,France,338620


### Use geopy library to get the latitude and longitude values of France

In [8]:
address = 'France'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of france are {}, {}.'.format(latitude, longitude))

  This is separate from the ipykernel package so we can avoid doing imports until


The geograpical coordinate of france are 46.603354, 1.8883335.


#### Create a map of France with cities superimposed on top.

In [9]:
# create map of france using latitude and longitude values
map_france = folium.Map(location=[latitude, longitude], zoom_start=6)

# add markers to map
for lat, lng, borough, neighborhood in zip(data_df['LATITUDE'], data_df['LONGITUDE'], data_df['COUNTRY'], data_df['CITY']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_france)  
    
map_france

### Define Foursquare Credentials and Version

In [10]:
CLIENT_ID = 'ESYH340ZLYESMFLKUKCHDQ33YNUJINGWDUPRBZC21VVYTFMT' # your Foursquare ID
CLIENT_SECRET = 'EYRI0QRQTSMVWD5AWU1JGD4FXZBNCPOXM4NRO1TKBS3EVHOZ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ESYH340ZLYESMFLKUKCHDQ33YNUJINGWDUPRBZC21VVYTFMT
CLIENT_SECRET:EYRI0QRQTSMVWD5AWU1JGD4FXZBNCPOXM4NRO1TKBS3EVHOZ


#### Let's explore the first neighborhood/City in our dataframe.

In [11]:
data_df.loc[0,'CITY']

'Paris'

In [12]:
neighborhood_latitude = data_df.loc[0, 'LATITUDE'] # neighborhood latitude value
neighborhood_longitude = data_df.loc[0, 'LONGITUDE'] # neighborhood longitude value

neighborhood_name = data_df.loc[0, 'CITY'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Paris are 48.866667, 2.333333.


In [13]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

#create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=ESYH340ZLYESMFLKUKCHDQ33YNUJINGWDUPRBZC21VVYTFMT&client_secret=EYRI0QRQTSMVWD5AWU1JGD4FXZBNCPOXM4NRO1TKBS3EVHOZ&v=20180605&ll=48.866667,2.333333&radius=500&limit=100'

In [14]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5c388426dd57975fd5e7d21f'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Place Vendôme',
  'headerFullLocation': 'Place Vendôme, Paris',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 244,
  'suggestedBounds': {'ne': {'lat': 48.8711670045, 'lng': 2.340161078526742},
   'sw': {'lat': 48.8621669955, 'lng': 2.326504921473258}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4cbdcb0b7148f04d510aefab',
       'name': 'Pierre Hermé',
       'location': {'address': "39 avenue de l'Opéra",
        'lat': 48.86822151447183,
        'lng': 2.333396617684349,
        'labeledLatLngs': [{'label': 'display',
          'lat': 48.86822151

In [15]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [16]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Pierre Hermé,Pastry Shop,48.868222,2.333397
1,Le Roch Hotel & Spa Paris,Hotel,48.8662,2.332995
2,Cantine California,Food Truck,48.867401,2.332017
3,Boulangerie Aki,Bakery,48.866211,2.335458
4,Brasserie Réjane,Restaurant,48.865486,2.334824


In [17]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Now write the code to run the above function on each cities and create a new dataframe called france_venues.

In [None]:
france_venues = getNearbyVenues(names=data_df['CITY'],
                                   latitudes=data_df['LATITUDE'],
                                   longitudes=data_df['LONGITUDE']
                                  )

Paris
Lyon
Marseille
Lille
Nice
Toulouse
Bordeaux
Rouen
Strasbourg
Nantes
Metz
Grenoble
Toulon
Montpellier
Nancy
Saint-Étienne
Melun
Le Havre
Tours
Clermont-Ferrand
Orléans
Mulhouse
Rennes
Reims
Caen
Angers
Dijon
Nîmes


#### Let's check the size of the resulting dataframe

In [None]:
print(france_venues.shape)
france_venues.head()

In [None]:
df_venues2 = france_venues.copy()
df_venues3 = france_venues.copy()
df_venues_rest = df_venues2[df_venues2['Venue Category'].str.contains('Bar')].reset_index(drop=True)
df_venues_rest['Venue Type'] = 'Bar'
df_venues_hotel = df_venues3[df_venues3['Venue Category'].str.contains('Plaza')].reset_index(drop=True)
df_venues_hotel['Venue Type'] = 'Plaza'
df_venues_final = pd.concat([df_venues_rest,df_venues_hotel]).reset_index(drop=True)
df_venues_final.shape

In [None]:
df_venues_final.groupby('Neighborhood')['Venue Type']\
.value_counts()\
.unstack(level=1)\
.plot.bar(stacked=True)


In [None]:
print('There are {} uniques categories.'.format(len(france_venues['Venue Category'].unique())))

### Analyze Each Cities

In [None]:
# one hot encoding
france_onehot = pd.get_dummies(france_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
france_onehot['Neighborhood'] =  france_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [france_onehot.columns[-1]] + list(france_onehot.columns[:-1])
france_onehot = france_onehot[fixed_columns]

france_onehot.head()

### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
france_grouped = france_onehot.groupby('Neighborhood').mean().reset_index()
france_grouped

#### Let's print each cities along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in france_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = france_grouped[france_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each cities.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = france_grouped['Neighborhood']

for ind in np.arange(france_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(france_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

## Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 5

france_grouped_clustering = france_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(france_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each cities.

In [None]:
france_merged = data_df

# add clustering labels
france_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with france to add latitude/longitude for each neighborhood
france_merged = france_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='CITY')

france_merged.head() # check the last columns!

Finally, let's visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=6)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(france_merged['LATITUDE'], france_merged['LONGITUDE'], france_merged['CITY'], france_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
france_merged.count()

### CLUSTER 1

In [None]:
cluster1 = france_merged.loc[france_merged['Cluster Labels'] == 0, france_merged.columns[[0] + list(range(4, france_merged.shape[1]))]]
cluster1

### CLUSTER 2

In [None]:
france_merged.loc[france_merged['Cluster Labels'] == 1, france_merged.columns[[0] + list(range(4, france_merged.shape[1]))]]

### CLUSTER 3

In [None]:
france_merged.loc[france_merged['Cluster Labels'] == 2, france_merged.columns[[0] + list(range(6, france_merged.shape[1]))]]

### CLUSTER 4

In [None]:
france_merged.loc[france_merged['Cluster Labels'] == 3, france_merged.columns[[0] + list(range(6, france_merged.shape[1]))]]

### CLUSTER 5

In [None]:
france_merged.loc[france_merged['Cluster Labels'] == 4, france_merged.columns[[0] + list(range(6, france_merged.shape[1]))]]

In [None]:
get_Hotel = france_merged[france_merged.eq('Hotel').any(axis=1)]  
tot_cluster = get_Hotel[france_merged.eq('Bar','Plaza','Shopping Mall').any(axis=1)]  
#tot_cluster = get_cluster[get_cluster.eq('Plaza').any(axis=1)]
tot_cluster.head(10)

### USER_CLUSTER 1

Cluster based on user selection

In [None]:
tot_cluster.loc[tot_cluster['Cluster Labels'] == 0, tot_cluster.columns[[0] + list(range(6, france_merged.shape[1]))]]

### USER_CLUSTER 2

Cluster based on user selection

In [None]:
tot_cluster.loc[tot_cluster['Cluster Labels'] == 1, tot_cluster.columns[[0] + list(range(6, france_merged.shape[1]))]]

### USER_CLUSTER 3

Cluster based on user selection

In [None]:
tot_cluster.loc[tot_cluster['Cluster Labels'] == 2, tot_cluster.columns[[0] + list(range(6, france_merged.shape[1]))]]

### USER_CLUSTER 4

Cluster based on user selection

In [None]:
tot_cluster.loc[tot_cluster['Cluster Labels'] == 3, tot_cluster.columns[[0] + list(range(6, france_merged.shape[1]))]]

### USER_CLUSTER 5

Cluster based on user selection

In [None]:
tot_cluster.loc[tot_cluster['Cluster Labels'] == 4, tot_cluster.columns[[0] + list(range(6, france_merged.shape[1]))]]

## Create MAP for USER based on user input filter

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tot_cluster['LATITUDE'], tot_cluster['LONGITUDE'], tot_cluster['CITY'], tot_cluster['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

5.)Discussion :

	It is interesting how the venues and people from different cities varies to one another. The main differentiation is after the clusters filtered upon the user inputs but also we could see some common venues among the clusters.

As a recommendation, it must be said in study to make better predictions about the where to locate cluster city with user venue. for example if tourist want to locate the city with hotel clusters based on bar,plaza,gym etc..


6.)Conclusion :

	As far as we can see with this data, some of the clusters are not populated because of user filter. 

It is highly possible that user_cluster 1 & 5 has more cities which has the user preferences of hotel cluster. If the user input data should perform with more data and logic also framed in proper way then we can provide more accurate output .

7.)References

https://developer.foursquare.com/docs/api/venues/

https://simplemaps.com/data/fr-cities

https://www.coursera.org/