# **Capstone Project Report - Educated Relocation**

## John Williams

### **1. Introduction**

#### **1.1 Background**
Moving is a challenging endeavor no matter the distance which can be made even more so when moving to a location about which little is known. Nevertheless, people must move for a variety of reasons. In general, they would like to move to a neighborhood that they would like to make the transition to a foreign place easier. We can assume that people will like neighborhoods that are like neighborhoods they've enjoyed living in or have liked in the past.

#### **1.2 Problem**
A client has reached out to us to help with the relocation process. They are planning a move to the Pacific Northwest and would like to find neighborhoods in both Portland, OR and Seattle, WA that they would like. They have enjoyed living in their current neighborhood, Hampden, in Baltimore, MD and have enjoyed spending time in Decatur, GA. In discussion with the client, they liked these neighborhoods based on the venues they have. Furthermore, they would like to have a park and a grocery store in their neighborhood, and they would like the grocery store to be ‘nice’. In this project we will attempt to identify less than ten neighborhoods that the client would like in either Portland or Seattle.

#### **1.3 Interest**
The client is obviously interested in the results. If the project is successful, addition clients may be interested in such services in order to help direct their search in the relocation project.

### **2. Data**

#### **2.1 Data Sources**
In order to determine which neighborhoods the client might like, we first need to create a list of the neighborhoods in Portland, OR and Seattle, WA. For this we will use the two CSV files found on opendata.arcgis.com that lists the neighborhoods in Portland, Or and Seattle, WA respectively. Once we have the list we will use Foursquare and geolocator to determine the coordinates of each neighborhood. From there we will use Foursquare Place API to search each neighborhood and generate a list of venues within 750m of its coordinates. Finally, once we have narrowed down the neighborhoods to less than ten, we will again use Foursquare Places API but this time we will explore each grocery store to determine its rating and rank the neighborhoods accordingly.

#### **2.2 Data Cleaning**
The CSV files from opendata.arcgis.com are not the cleanest and require some work to get a list of accurate neighborhoods. To get an accurate list, we removed any rows with blank neighborhood names or neighborhood names that were nonsensical (i.e. “OOO”). After making this list we generated the coordinates for each neighborhood using geolocator.  We then plotted each neighborhood with a 500 radius around the coordinates to ensure we have accurate coverage of the cities. Upon visual inspection, we were missing a few neighborhoods in each city, so these were added manually, and their coordinates generated with geolocator. Also, a 500m radius did not seem to have enough coverage of the cities, so the radius was increased to 750m prior to venue exploration.
![alt text][prt_final]
![alt text][sea_final]
With an accurate list of neighborhoods and their coordinates, it was almost time to move on to venue generation. Before doing this, we added Hampden, Baltimore, MD and Decatur, GA to the list, so that when we cluster the neighborhood, these two neighborhoods are included. 
After adding the two known neighborhoods, we used Foursquare API to get up to 100 venues in each neighborhood. We then cleaned the get response in order to identify the category of venue as that will ultimately be used in our clustering.
Finally, after clustering and further narrowing of the neighborhoods, we explored the grocery stores in the remaining neighborhoods to generate their ratings. We then ranked each neighborhood according to the grocery store rating in order to provide a final list of recommendations to the client.

### **3. Methodology**

#### **3.1 Clustering**
After we generated the list of venues in each neighborhood, we used K Means Clustering to determine which neighborhoods were most similar to Hampden, Baltimore, MD and Decatur, GA. We used K Means clustering since we had a medium data frame with several variables.  To do this we made a One Hot matrix with each venue as a row and its category as the column. When then found the mean of each type of venue in each category. With this, we started clustering.
In order to find the optimal number of clusters, we determined the summed squared error for each iteration, called “cost”, and plotted it against the number of clusters to determine an elbow point. An elbow point is the point at which additional clusters generate only marginal improvement in the sum of squared errors. We examined 2 to 150 clusters and settled on 30. There was no blatantly obvious elbow point, but anything above 30 seemed to only provide marginal benefit to the clustering. 
![alt text][elbow]
Once the clusters were set, we looked at the cluster for both Hampden and Decatur to determine the neighborhoods that were most similar based on type of venue category.

#### **3.2 Narrowing**
After clustering we still had more than 10 potential neighborhoods, so further narrowing was required. To do this, we used the clients’ preferences to have both a park and a grocery store in their future neighborhood. We started by eliminating neighborhoods without a park. We then eliminated neighborhoods without a grocery store. This brought us to 8 neighborhoods, so we moved on to ranking them.

#### **3.3 Ranking**
The clients want to not only have a grocery store, but a nice grocery store in their neighborhood. We used the Foursquare API to explore each grocery store to find its rating. If a rating did not exist, the grocery store was given a rating of “0”. The neighborhoods were then ranked based on the rating of its grocery store. If multiple grocery stores existed in a neighborhood, the neighborhoods were ranked based on the one with the highest rating. 

### ** 4. Results**
We started with 180 potential neighborhoods in Portland, OR and Seattle, WA. After generating venues category and count within each neighborhood we clustered them into 30 groups. The majority of the groups contained 2 or fewer neighborhoods. Luckily, both Hampden and Decatur were in the same cluster, Cluster 25, so we moved forward with further narrowing of that cluster. The cluster had 33 neighborhood which meant 31 potential neighborhoods. The venue categories most common in these neighborhoods were coffee shops, bars, and pizza places. However, since our goal was to provide a list of less than 10 potential neighborhoods to the client, further narrowing was required.
![alt text][clust_neigh]
![alt tect][clust25]
We narrowed based on client preference. We removed all neighborhoods without a park, which dropped the potential neighborhood list to 23. Then we dropped neighborhoods without a grocery store.  This got us to a list of 8 potential neighborhoods.
![alt text][rank]
In order to better guide the clients, we ranked the neighborhoods based on the quality of its highest rated grocery store. Two neighborhoods did not have a grocery store that had been rated, so they were placed at the bottom of the list. Final neighborhood ranking is listed below.
1.	Boise, Portland, OR
2.	Hosford-Abernethy, Portland, OR
3.	Roosevelt, Seattle, WA
4.	Vernon, Portland, OR
5.	Pearl District, Portland, OR
6.	Woodstock, Portland, OR
7.	Beaumont-Wilshire, Portland, OR
8.	Piedmont, Portland, OR

### **5. Discussion**
Over the course of the project, a few challenges arose. The first was coming up with a list of neighborhoods in Portland and Seattle. Luckily, we were able to find a CSV file for each. However, they did not provide adequate coverage of the city and were missing a few neighborhoods. While the missing neighborhoods were added manually, another approach would have been to cover the city with slightly overlapping sectors of a given radius. This would have allowed adequate coverage without the overlap we saw in high density neighborhood regions. After narrowing to fewer sectors, we could have back calculated which neighborhoods were in these sectors. Nevertheless, we were able to identify neighborhoods that will likely be a good match for the client.
The second challenge was determining which the clustering method. K Means did not seem to have an obvious elbow point, so we tried DBSCAN. However, it seemed like the points were too dense for the algorithm to determine the subtleties between neighborhoods. We returned to K Means and plotted the sum of squared error for each number of clusters. The slope seemed to change at around 30 clusters, so this was chosen as our elbow point. Luckily, Hampden and Decatur were in the same cluster and we were able to move forward with our analysis.
Ultimately, we were able to pare the list down to a reasonable size by requiring both a park and grocery store. This was easy to do and again we were lucky that only 8 of the potential 31 neighborhoods had both. We were thus able recommend fewer than 10 neighborhoods to the client in which to focus their search when relocating to the Pacific Norwest.

### **6. Conclusion**
In this project we helped a hypothetical client determine which neighborhoods they might like when they move to the Norwest. Both Portland, O and Seattle, WA had many neighborhoods to chose from and through clustering based on popular venue categories in each neighborhood and further narrowing based on specific venue category requirements, we were able to provide a list of 8 neighborhoods in which they can direct their search. 


[elbow]: https://github.com/willij10/Coursera_Capstone/blob/master/Elbow.JPG?raw=true "Elbow Point"
[prt_final]: https://github.com/willij10/Coursera_Capstone/blob/master/Portland%20Neighborhoods.JPG?raw=true "Portland Neighborhood Map"
[sea_final]: https://github.com/willij10/Coursera_Capstone/blob/master/Seattle%20Neighborhoods.JPG?raw=true "Seattle Neihgborhood Map"
[clust neigh]: https://github.com/willij10/Coursera_Capstone/blob/master/Neigh%20per%20clust.JPG?raw=true "Neighborhoods per Cluster"
[clust25]: https://github.com/willij10/Coursera_Capstone/blob/master/Top%20venues%20in%20clust25.JPG?raw=true "Venues in cluster 25"
[rank


