   # Applied Data Science Capstone

## *!The Battle of Neighborhoods!*

### 1. Introduction/Business Problem

**In Costa Rica there is a company that sells tours to places of New York and Toronto, this company wants to offer its customers the possibility of visualizing the similarities and differences between one city and another in order to make a decision based on their travel preferences.**

**So in this project is aimed at tour operators interested in attracting more customers who wish to travel to the cities they offer, with the option to see the functionality between two of the major cities of note america such as Toronto and New York in our case.**

#### 1.1 Stackeholders

1. The project should interest companies that sell tours to different places and can easily recommend cities according to the similarity found with the help of the use of machine learning.

2. People who want to travel and have doubts about the destination to choose because they do not have enough information about destinations to make a decision.

### 2. Description of the data to be used to solve the problem

**For this project we will use the knowledge acquired in course 9 to use the benefits of the Forsquare API to explore data from both cities and neighborhoods, and data that is relevant for people who travel such as coffee shops, hotels, restaurants that can visit and be close, theaters and many places for which to have a choice.**


**We will also use the data that wikipedia provides about each neighborhood of both cities which can be accessed from: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.

**Data from the neighborhoods of New York City will be charged from https://cocl.us/new_york_dataset.**

**Both datasets must be joined with data where the geospatial coordinates are available in order to consult and explore the necessary data with the Forsquare API. These data will be loaded from the address http://cocl.us/Geospatial_data with the help of the python requests module**

It must be prepared and cleaned to apply the algorithms that we will need:

1. First of all we will be working with all Boroughs that are necessary and provide value.
2. To be able to use the wikipedia data, the first thing is to extract the table with the Borough information and convert it into a dataframe so that we can work with the data as we need it. We will do this with the help of the BeautifulSoup python module
3. We will also eliminate all rows that have the value of "Not assigned" since they do not generate value.
4. In the case of neighborhoods that have the value of "Not assigned" they will be assigned the same value as the "Borough" data.
5. The similutud and differences will be based on the Boroughs, therefore we will group the Borough data and concatenate the values for each neighborhood.
6. To be able to perform explorations with the Forsquare API we will need to cross the data with the coordinates dataset. The merge of the data will be done through the Postcode column.
7. With the clean data for both New York and Toronto we can continue with the analysis of the data, the visualization to verify the current distribution of the data, train our machine learning model to generate the clusters and find the differences and similarities between Both cities.


### 3. Methodology

**First  let's download all the dependencies that we will need**

In [8]:
import requests #request API with python
from bs4 import BeautifulSoup # Beautiful Soup is a Python library for pulling data out of HTML and XML files.
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import io #Core tools for working with streams

import json # library to handle JSON files

print('Libraries imported.')

Libraries imported.


In [3]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library
print('folium imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    altair-3.1.0               |           py36_0         724 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be 

### 3.1 Load the data

**Load the data that we need for Toronto**

In [46]:
#Get the dara required for Toronto from wiki url and then parse with BeautifulSoup finding the table class
wiki_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text # get entire page
soup = BeautifulSoup(wiki_text,'lxml') # convert text to xml soup object in order to extract what i need.
table_dataset = soup.find('table',{'class':'wikitable'}) # find table with  BeautifulSoup find function

dfs = pd.read_html(str(table_dataset))# Read into a list table html 
df_toronto = pd.concat(dfs) #then convert to a dataframe
df_toronto.rename(columns={'Postcode': 'Postalcode'}, inplace=True)#rename the column
#-------------------
#Now we need to load the data for latitude and longitud this will be based on a merge by Postcode colum

csv_file_content=requests.get("http://cocl.us/Geospatial_data").content#get the csv content with the help of requests module
lat_long_df=pd.read_csv(io.StringIO(csv_file_content.decode('utf-8'))) #convert into pandas dataframe
lat_long_df.rename(columns={'Postal Code': 'Postalcode'}, inplace=True)

print("data loaded")

df_toronto.head()

data loaded


Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


----
**Load the data that we need for New York**

In [21]:
#First, download the data.
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

#Next, let's load the data.
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

#In this case the relevant data is in the features key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.
neighborhoods_data = newyork_data['features']

#then we need to tranform the data that is in a json format into a pandas dataframe
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
df_newyork = pd.DataFrame(columns=column_names)

#Then let's loop through the data and fill the dataframe one row at a time.
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    df_newyork = df_newyork.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
print('Data loaded!')
df_newyork.head()

Data downloaded!
Data loaded!


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


### 3.2 Clear and format the data

__This process is very important because we need to preparate and format the data in order to be posible to analyse and apply the algoritms that we need__

**Format the data for Toronto**

In [54]:
#Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df_toronto = df_toronto[df_toronto['Borough'] != 'Not assigned'].reset_index(drop=True)

#rename the column Neighbourhood to be consistent with the New York data
df_toronto.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True)

#More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
df_toronto = df_toronto.groupby(['Postalcode','Borough'], sort=False).agg( ', '.join).reset_index()

#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
for index, data_row in df_toronto.iterrows():
    if data_row['Neighborhood'] == 'Not assigned':
        data_row['Neighborhood'] = data_row['Borough']

#Then we can merge the tables
df_toronto = pd.merge(lat_long_df, df_toronto, on='Postalcode')
df_toronto = df_toronto[['Postalcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']]#change the columns order

#Now we have 103 rows and 5 columns
print("Now we have 103 rows and 5 columns. Shape:",df_toronto.shape)

#show the results
df_toronto.head()

Now we have 103 rows and 5 columns. Shape: (103, 5)


Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


-------------






**Format the data for New York**

In [66]:
df_newyork['Borough'].value_counts()

Queens           81
Brooklyn         70
Staten Island    63
Bronx            52
Manhattan        40
Name: Borough, dtype: int64

### 3.3 Analysis and data exploration