# Clustering of Neighborhoods in Washington D.C.

## Table of Content
1. <a href="#item1">Introduction</a>
2. <a href="#item2">Data</a>
3. <a href="#item3">Methodology</a>
4. <a href="#item4">Results</a>
5. <a href="#item5">Discussion</a>
6. <a href="#item6">Conclusion</a>

## 1. Introduction/Business Problem

### Problem
The problem we address is to known which zip code in Washington D.C. share the same features.
In this project, we will leverage the Foursquare location data to compare neighborhood in Washington D.C. 
### Goal
The goal of the project is to cluster zip codes that are similar in features. 
### How may be interested, target audience  and who would care about this problem
This project will be useful for someone that wants to relocate in Washington D.C. The user will then have the option to chose which zip code has similar features as the one (s)he likes and is the closest to the point (s)he wants to relocate to.


## 2. Data

In this section, we give a description of the Data and its use to solve the business problem at hand.
### Desciption of the Data
Our data comes from three different sources. 

#### First Source
First, we obtained geographical coordinates of all zip codes in the US. We use this webpage: https://gist.github.com/erichurst/7882666 
This data is made of the following columns:

| Zip codes | Latitude | Longitude |
|-----------|----------|-----------|
|           |          |            |

The table gives the GPS coordinates of each zip code
#### Second Source
The second source consist of all 
zip that are located in Washington D.C. We use this webpage: https://www.zillow.com/browse/homes/dc/district-of-columbia-county/
The structure of the Data is:

| Zip Codes |
|-----------|

The table will give all zip codes in Washington D.C.
#### Third source
The third source of data is provided by the API of Foursquare.
We use the API of Foursquare to fetch data related to venues around the coordinates of the zip code. We will mostly use
the following fields:
1. The venue's name : The name of the place
2. The venue's location latitude: The latitude of the place
3. The venue's location longitude: The longitude of the place
4. The venue's categories: This field gives us the type of the venue (restaurant, bar, fast-food

### How we will use the Data
We intent to analyse the most common venues around each zip code and perform clustering on this data set. This will give us an idea of which neighborhood are similar.

### The Preprocessed and cleaned Data Set

After preprocessing and cleaning, we obtain the following Data set. The fields are as follows:

1. Neighborhood: This is a zip code in Washington D.C. 
2. Neighborhood Latitude : The latitude of the zip code
3. Neighborhood Longitude : The longitude of the zip code
3. Venue: The name of the venue
4. Venue Latitude 	
5. Venue Longitude 	
6. Venue Category: This will mainly be used to determine how similar two neighborhoods are. 

In [1]:
import pandas as pd
pd.read_csv('venues_df').head()

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,20064,38.936354,-76.999167,&pizza,38.932582,-76.996696,Pizza Place
1,1,20064,38.936354,-76.999167,Chick-fil-A,38.935476,-76.998198,Food Service
2,2,20064,38.936354,-76.999167,Busboys and Poets,38.932117,-76.99764,American Restaurant
3,3,20064,38.936354,-76.999167,Starbucks Reserve,38.932484,-76.997172,Coffee Shop
4,4,20064,38.936354,-76.999167,BGR Burgers Grilled Right,38.932647,-76.99674,Burger Joint


## 3. Methodology 

### Exploratory Data Analysis.


Let us first see which types of venues are the most common in Washington D.C.

In [2]:
pd.read_csv("df_most_common.cvs").head(10)

Unnamed: 0.1,Unnamed: 0,Frequency
0,Park,1.332066
1,Coffee Shop,1.071133
2,Sandwich Place,0.972823
3,American Restaurant,0.844335
4,Convenience Store,0.841558
5,Harbor / Marina,0.752828
6,Hotel,0.734662
7,Pizza Place,0.72681
8,Gym,0.680204
9,Boat or Ferry,0.622205


We now figure out which venues are the most popular in each neighborhood.

In [3]:
pd.read_csv("neighborhoods_venues_sorted.csv").head()

Unnamed: 0.1,Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,20001,Thai Restaurant,Liquor Store,BBQ Joint,Grocery Store,Bookstore,Market,Building,Spanish Restaurant,Middle Eastern Restaurant,Gas Station
1,1,20002,American Restaurant,Bar,Gym,Moving Target,New American Restaurant,Park,Pharmacy,Diner,Convenience Store,Sandwich Place
2,2,20003,Pizza Place,Bar,Coffee Shop,Art Gallery,Gym / Fitness Center,Bakery,Sandwich Place,Spa,Mobile Phone Shop,Pet Store
3,3,20004,Hotel,Science Museum,History Museum,Exhibit,American Restaurant,Food Truck,Museum,Coffee Shop,Bakery,Sandwich Place
4,4,20005,Hotel,Hotel Bar,Coffee Shop,American Restaurant,Salon / Barbershop,Latin American Restaurant,Sandwich Place,Sushi Restaurant,New American Restaurant,Deli / Bodega


### Machine Learning
In this project, we want to find neighborhoods with the same fitures. We want to see which group of zip codes could be put in the same cluster. This is why we decided to use  a k-means clustering algorithm. We will be using the different venues as componnents of a 237 columns vector. After experimenting with different values of k, the number of clusters, we found that 4 clusters is optimal.

![image](map.png)

## 4. Results

In this section, we discuss the results of the project. We were able throughout this project to cluster the neighborhoods in Washington D.C. in four clusters. Our final deliverable gives the option to the user to enter the zip code (s)he is moving from, the zip code (s)he is planning to move to. Then we recommend a zip code to the user based on which zip code is the nearest zip code to the target zip code that is also in of the same type (in the same cluster ) as the original zip code. 

Our clusters are as follows:

* Custer No 1: 20064, 20319, 20373, 20593, 20390, 20003, 20002, 20005, 20004, 20007, 20009, 20008, 20016, 20015, 20018, 20017, 20020, 20019, 20024, 20032, 20052, 20057, 20510
* Custer No 2: 20036
* Custer No 4: 20006
* Custer No 3: 20317, 20001, 20011, 20010, 20012, 20037, 20202

We learned along side that the most common venues in Washington D.C. are 

1. Park 	
2. Coffee Shop 
3. Sandwich Place
4. American Restaurant
5. Convenience Store  
6. 	Harbor / Marina 	 
7. 	Hotel 	 
8. 	Pizza Place  
9. 	Gym 	 
10. 	Boat or Ferry

## 5. Discussion

In this section, we discuss some observations we noted. It turns out that most neighborhood may be grouped in two groups.  We have two neighborhoods that do have  unique features. The first is the one around the World Bank. The second is the the neighborhood near Dupond Circle. The third cluster happens to be  a band between Georgia Avenue and 16th Street. The fourth cluster, which is by far the largest, is all over the place.

The recommendations we can make based on the results require that we obtain the outbound zip code and the inbound zip code from the user.

## 6. Conclusion

Througout this project, we obtained a segmentation of Washington D.C. in four clusters. This segmentation was based on most common venues in each neighborhood. Our projects gives the option to the user to enter the zip code (s)he is moving from, the zip code (s)he is planning to move to. Then we recommand a zip code to the user based on which zip code is the nearest zip code to the target zip code that is also in of the same type (in the same cluster ) as the original zip code. 