<a name='home'></a>
# Applied Data Science Capstone
## Table of Content
1. [Introduction to the Capstone Project](#intro)
2. [Foursquare API](#api)
3. [Neighbourhood Segmentation and Clustering](#clust)
4. []()
5. []()
6. Labs:  
    6.1 [Capstone Project Notebook](#lab_note)  
    6.2 [Foursquare API](#lab_note)  
    6.3 [k-means Clustering](#lab_clust)  
    6.4 [Segmenting and Clustering Neighborhoods in New York City](#lab_seg)
    


[Home](#home)

---
<a name='intro'></a>
## Introduction

### Week 1 - Introduction to Capstone Project
* Introduction to Capstone Project
* Location Data Providers
* Signing-up for a Watson Studio Account
* Peer-review Assignment: Capstone Project Notebook

### Week 2 - Foursquare API
* Introduction to Foursquare
* Getting Foursquare API Credentials
* Using Foursquare API
* Lab: Foursquare API
* Quiz: Foursquare API

### Week 3 - Neighborhood Segmentation and Clustering
* Clustering
* Lab: Clustering
* Lab: Segmenting and Clustering Neighborhoods in New York City
* Peer-review Assignment: Segmenting and Clustering Neighborhoods in Toronto

### Week 4 - Capstone Project

### Week 5 - Capstone Project (Cont'd)  

[Home](#home)

---
<a name='api'></a>
## Introduction to Foursquare
Start learning about Foursquare and how their data looks like. So let's get started. As a brief introduction for those who are still not very clear about Foursquare, Foursquare is a technology company that built a massive dataset of location data. What is interesting about Foursquare is that they were very smart about building their dataset. They actually crowd-sourced their data and had people use their app to build their dataset and add venues and complete any missing information they had in their dataset. Currently its location data is the most comprehensive out there, and quite accurate that it powers location data for many popular services 

Given a city like the City of Toronto, you will segment it into different neighborhoods using the geographical coordinates of the center of each neighborhood, and then using a combination of location data and machine learning, you will group the neighbourhoods into clusters 

**Location Data:**
Location data is data describing places and venues, such as their geographical location, their category, working hours, full address, and so on, such that for a given location given in the form of its geographical coordinates (or latitude and longitude values) one is able to determine what types of venues exist within a defined radius from that location.

Among the many location data providers are 
* Foursquare, 
* Google Places, 
* Yelp

Location data providers will differ in a number of features. 
* rate limits for example is one of them, which essentially means how many API calls you can make in a defined time frame such as calls per hour or calls per day. 
* cost, which is how much it would cost you to use their API to fetch location data. 
* coverage, which is geospatial coverage. In other words, how many countries or geographical locations the location data set covers. 
* accuracy, so how accurate is the location data provided by each provider
* etc

We will use the **Foursquare** location dataset as their dataset is most comprehensive. Also creating a developer account to use their API is quite straightforward and the easiest compared to other providers. Therefore, let's start learning how to use the Foursquare API to leverage location data.  

[Home](home)

## Foursquare Search
Communicating with the Foursquare database is really very easy, all thanks to their **RESTful API**. You simply create a uniform resource identifier, or URI, and you append it with extra parameters depending on the data that you are seeking from the database. Any call request you make is composed of, we can call this base URI, which is api.foursquare.com/v2, and you can request data about venues, users, or tips.  

Every time you make a call request, you have to pass your developer account credentials, which are your Client ID and Client Secret as well as what is called the version of the API, which is simply a date. It is designed to give developers the freedom to adapt to Foursquare API changes on their own schedule. In other words, you request the data to be returned to you in the format that was the latest up to the date defined by the version.

We make the call to the database, and in return we get a JSON file that match our query. Remember, this is a regular call and with a personal developer account, we can make up to 99,500 similar calls.

We only get **two tips and photos per venue** and not the entire list of tips. And remember, that this type of call is premium so with a personal account we **can only make 500** similar calls per day.

[Home](#home)

---
<a name=''></a>
## Neighborhood Segmentation and Clustering
Learn about k-means clustering, which is a form of unsupervised learning. Then you will use clustering and the Foursquare API to segment and cluster the neighborhoods in the city of New York. Furthermore, you will learn how to scrape website and parse HTML code

### Clustering
K-means can group data only unsupervised based on the similarity of customers to each other. There are various types of clustering algorithms such as partitioning, hierarchical, or density based clustering. K-means is a type of partitioning clustering. That is, it divides the data into k non-overlapping subsets or clusters without any cluster internal structure or labels. This means, it's an unsupervised algorithm. 
Objects within a cluster are very similar, and objects across different clusters are very different or dissimilar.

The distance of samples from each other is used to shape the clusters. So, we can say, K-means tries to minimize the intra-cluster distances and maximize the inter-cluster distances. We can easily use a specific type of Minkowski distance to calculate the distance of these two customers. Indeed, it is the euclidean distance.

we can still use the same formula but this time in a two-dimensional space. Also we can use the same distance matrix for multidimensional vectors. Of course, we have to normalize our feature set to get the accurate dissimilarity measure.

The key concept of the K-means algorithm is that, it randomly picks a center point for each cluster. It means, we must initialize k which represents number of clusters. centroids of clusters and should be of same feature size of our customer feature set. After the initialization step, which was defining the centroid of each cluster, we have to assign each customer to the closest center. For this purpose, we have to calculate the distance of each data point or in our case each customer from the centroid points

form a matrix where each row represents the distance of a customer from each centroid. It is called the distance matrix. The main objective of K-means clustering is to minimize the distance of data points from the centroid of it's cluster and maximize the distance from other cluster centroids. 

The error is the total distance of each point from its centroid. It can be shown as within-cluster sum of squares error. Intuitively, we try to reduce this error. It means we should shape clusters in such a way that the total distance of all members of a cluster from its centroid be minimized. Once again, we will have to calculate the distance of all points from the new centroids. The points are re-clustered and the centroids move again. This continues until the centroids no longer move. Please note that whenever a centroid moves, each points distance to the centroid needs to be measured again. Yes, K-means is an iterative algorithm, and we have to repeat steps two to four until the algorithm converges. There is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters

[Home](#home)

---
<a name='lab'></a>
## Lab exercise


---
<a name='lab_note'></a>
### Foursquare API
1. Import the libraries  
    import requests # library to handle requests  
    import pandas as pd # library for data analsysis  
    import numpy as np # library to handle data in a vectorized manner  
    import random # library for random number generation  
    
    !conda install -c conda-forge geopy --yes  
    from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values  
    
    #### libraries for displaying images
    from IPython.display import Image  
    from IPython.core.display import HTML  
    
    ### tranforming json file into a pandas dataframe library  
    from pandas.io.json import json_normalize  
    !conda install -c conda-forge folium=0.5.0 --yes  
    import folium # plotting library  
    print('Folium installed')  
    print('Libraries imported.')
2. Define the Foursquare credentials  
    CLIENT_ID = 'KTIAERGR5QSKR3SEMZSPSN2XOOWQKKZVHENIGBGM2LDUVGYJ' # your Foursquare ID  
    CLIENT_SECRET = '4CWWT2Z1IE5UD0A0RFEP4YNOXO4TEOEOD3XIX4O1RPAGO4CZ' # your Foursquare Secret  
    VERSION = '20180604'  
    LIMIT = 30  
    print('Your credentails:')  
    print('CLIENT_ID: ' + CLIENT_ID)  
    print('CLIENT_SECRET:' + CLIENT_SECRET)
3. Define the points of interest geo-location  
    address = '102 North End Ave, New York, NY'  
    geolocator = Nominatim(user_agent="foursquare_agent")  
    location = geolocator.geocode(address)  
    latitude = location.latitude  
    longitude = location.longitude  
    print(latitude, longitude)

### Define searches:
1. Italien food within 500m distance  
    search_query = 'Italian'  
    radius = 500  
    print(search_query + ' .... OK!')  
    
    Define the URL:  
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)  
    url  
    
    send the GET request  
    results = requests.get(url).json()  
    results
    
    Transform the relevant part and transform it into a pandas framework  
    enues = results['response']['venues']  
    dataframe = json_normalize(venues)  
    dataframe.head()  
    
    Define the information of interest
    filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
    dataframe_filtered = dataframe.loc[:, filtered_columns]  
    
    def get_category_type(row):  
    try:  
        categories_list = row['categories']  
    except:  
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:  
        return None  
    else:  
        return categories_list[0]['name']  
        
    dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)  
    dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]  
    dataframe_filtered
2. Explore a given venue  
    Define the URL:  
    venue_id = '4fa862b3e4b0ebff2f749f06' # ID of Harry's Italian Pizza Bar  
    url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)  
    url  
    
    Send the get request for results  
    result = requests.get(url).json()  
    print(result['response']['venue'].keys())  
    result['response']['venue']  
    
    Get the venues overall rating:  
    result = requests.get(url).json()  
    print(result['response']['venue'].keys())  
    result['response']['venue']  
    
    Get the number of tips  
    result['response']['venue']['tips']['count']
3. Search for a Foursqare user
4. 

[Home](#home)

---
<a name='lab_clust'></a>
### k-means Clustering
Despite its simplicity, k-means is vastly used for clustering in many data science applications, especially useful if you need to quickly discover insights from unlabeled data.
1. Import the libraries
2. Load the data points
3. Define a function that assigns each datapoint to a cluster
4. Define a function that updates the centroid of each cluster
5. Define a function that plots the data points along with the cluster centroids
6. Initialize k-means - plot data points
7. Initialize k-means - randomly define clusters and add them to plot
8. Run k-means

Running k-means with larger data sets:
1. Generating the data with random.seed()
2. Display the scatter plot 
3. Setting up k-means  
    * init: Initialization method of the centroids. Value will be: "k-means++". k-means++ selects initial cluster centers for k-means clustering in a smart way to speed up convergence.  
    * n_clusters: The number of clusters to form as well as the number of centroids to generate. Value will be: 4 (since we have 4 centers)  
    * n_init: Number of times the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
4. Fit the model
5. Visualising the resulting clusters

Using k-means for Clustering Segmentation
1. Get the data
2. Put it into a dataframe
3. Clean up the data - drop categorical data of transform
4. Normalise the data
5. Modeling and Labeling
6. Group by lables
7. Profile the clusters by  
    * OLDER, HIGH INCOME, AND INDEBTED
    * MIDDLE AGED, MIDDLE INCOME, AND FINANCIALLY RESPONSIBLE
    * YOUNG, LOW INCOME, AND INDEBTED

[Home](#home)

---
<a name='lab_seg'></a>
### Segmenting and Clustering Neighborhoods in New York City
In this lab, you will 
* convert addresses into their equivalent latitude and longitude values. 
* use the Foursquare API to explore neighborhoods in New York City
* explore the function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. 
* use k-means clustering algorithm to complete this task
* use the Folium library to visualize the neighborhoods in New York City and their emerging clusters

[Original file](https://labs.cognitiveclass.ai/tools/jupyterlab/lab/tree/labs/DP0701EN/DP0701EN-3-3-2-Neighborhoods-New-York-py-v1.0.ipynb)

1. Import the libraries
2. Download the data set and explore the data
3. Extract the key data set "features" by putting it into a seperate dataframe
4. Define the coordinates of New York
5. Create a map of New York
6. Cut it down to Manhatten (slice)
7. Define the coordinates of Manhatten and print the neighbourhood on it
8. Switch to Foursquare and define account credentials
9. Define the URL request
10. get the json.-file
11. Define a function to extract the categories
12. Put the "venues" data into a pd.dataframe
13. Create a function to repeat the process for all neighbourhoods in Manhatten
14. Analyse the data set of venues
15. Analyse each neighbourhood
16. group by neighbourhood and put into pd.dataframe
17. Cluster the neighbourhood
18. Visualise the result
19. Examine the clusters





[Home](#home)

---
<a name='xxx'></a>
### Xxx


[Home](#home)