# Battle of the Neighborhoods - Which place should I choose?
### --- Report

### Table of Contents
- A. Introduction/Business Problem
- B. Data
- C. Method
- D. Results
- E. Discussion
- F. Conclusion

## A. Introduction/Business Problem

Jack Lee is currently living in **Flushing, Queens, New York City**. Recently he received a job offer from a great company in **Manhattan** with great career prospects. To make life easier, he wants to move to Manhattan to get rid of the annoying traffic. However, he really enjoys his neighborhood in Flushing, mainly because of many great amenities and other types of venues that exist in the neighborhood, such as various kinds of restaurants, bakery shop and so on. It would be great if we can help him determine neighborhoods on the other side of the city that are **similar to his current neighborhood**.

In the meantime, Jack has another requirement - he wants to **rent a studio in a neighborhood that has the lowest average rental price** as he just graduated and has to pay back student loans, so he needs to be frugal. 

#### **So the problem is --- Which neighborhood in Manhattan is similar to Flushing, Queens, NYC and has the lowest average rent for a studio?**

**Who would be interested in this project ---** This project not only provides immediate help to Jack Lee on deciding which place to live based on his requirements, but also can serve as a template to help people compare the similarities of different neighborhoods/cities and select the result(s) based on certain criteria.

## B. Data

### 1. Geographic Data

We will analyze and compare Flushing and the neighborhoods in Manhattan. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains Flushing, Queens and the neighborhoods that exist in Manhattan as well as the latitude and longitude coordinates of each neighborhood. The dataset will also be used to create Maps.

The link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572 
The files can be downloaded from IBM Cloud by run a `wget` command. And then do some data wrangling and transform the data into dataframe with 4 columns:
**['Borough', 'Neighborhood', 'Latitude', 'Longitude']**

![geo_data](https://raw.githubusercontent.com/yiyaoma/Coursera_Capstone/master/geo_data.png)

### 2. Foursquare Location Data

The Foursquare location data will be used to explore the neighborhoods and segment them into clusters so that we can find the similar neighborhoods.

We use Foursquare API to get top 100 venues that are in Flushing and the neighborhoods in Manhattan within a radius of 500 meters, and transform the data into dataframe with 7 columns: 
**[‘Neighborhood', ‘Neighborhood Latitude', ‘Neighborhood Longitude', ‘Venue’, ‘Venue Latitude', ‘Venue Longitude', ‘Venue Category’]**

Analyze each neighborhood by one hot encoding, grouping rows by neighborhood and taking the mean of the frequency of occurrence of each ‘Venue Category’, and then run k-means to cluster the neighborhood into clusters.

### 3. Rent Data

Average Rent Data in Manhattan, NY by neighborhood from online sources: https://www.rentcafe.com/average-rent-market-trends/us/ny/manhattan/
The dataset includes average rent data for “Studio”, “1-bedroom”, “2-bedrooms”, “3-bedrooms” and “All-rentals” in Manhattan by neighborhood. We do web Scraping and transform data into a dataframe. As we only need 'Studio Rent' data for analysis, the dataframe only has 2 columns: **[‘Neighborhood', 'Studio Rent']**.

The 'Studio Rent' data will be used as a criteria to rank the similar neighborhood(s) and therefore returned the result with the lowest average 'Studio Rent' price.

![rent_data](https://raw.githubusercontent.com/yiyaoma/Coursera_Capstone/master/rent_data.png)

## C. Methodology

We will leverage the **Foursquare location data** to explore or compare **Flushing, Queens, NYC** and **the neighborhoods of Manhattan**, use the **k-means clustering algorithm** to find the similar neighborhood(s) - neighborhoods in Manhattan that are in the same cluster as Flushing, use the **Folium library** to visualize the neighborhoods and their emerging clusters, and finally apply **Beautifulsoup library** to extract the **average rent data** in Manhattan by neighborhood from online sources to find the similar neighborhoods in Manhattan that has the lowest average rent for a studio.

### What machine learnings were used and why?
We use **k-means clustering** method. **K-means** is vastly used for clustering in many data science applications, especially useful if you need to quickly discover insights from unlabeled data.

### Detailed exploratory data analysis and statistical testing methods are shown below:

### 1. Download and Explore Dataset

#### (1) Rent Data. 
- Use **BeautifulSoup library** to extract Average Rent Data in Manhattan, NY by neighborhood from website. 

In [1]:
#website_url = urllib.request.urlopen('https://www.rentcafe.com/average-rent-market-trends/us/ny/manhattan/').read()
#soup = bs.BeautifulSoup(website_url,'lxml')
#mytable = soup.find('table',{'class':'market-trends', 'id':'MarketTrendsAverageRentTable'})
#mytable

- Use **findAll('th')** and **find_all('tr')** to get  Neighborhood from table headers and Studio Rent from table rows.
- Create DataFrame consisting Neighborhoods and their average Studio Rent price.
- Data Wrangling: use **.str.replace() method** to delete ',' and '$', and use **.astype(int) method** to convert string to integer.

In [2]:
# get Neighborhood from table headers
#table_headers = [th.getText() for th in mytable.findAll('th')]
#table_headers_neigh = table_headers[7:]
#df_col_neigh = pd.DataFrame({'Neighborhood':table_headers_neigh})

# get Studio Rent from table rows
#table_data = []
#table_rows = mytable.find_all('tr')
#for tr in table_rows:
#    td = tr.find_all('td')
#    row = [i.text for i in td]
#    table_data.append(row)
# dataframe consists of five columns: All rentals, Studio Rent, 1 Bed , 2 Beds, and 3 Beds
#df_data = pd.DataFrame(table_data,columns=['All rentals','Studio Rent','1 Bed','2 Beds','3 Beds'])
#df_data.drop([0,1],inplace=True)
# only keep 'Studio Rent' column
#df_data.drop(columns = ['All rentals','1 Bed','2 Beds','3 Beds'], inplace=True)
#df_data.reset_index(drop=True, inplace=True)
#df_data['Studio Rent'] = df_data['Studio Rent'].str.replace(',', '')
#df_data['Studio Rent'] = df_data['Studio Rent'].str.replace('$', '')
#df_data['Studio Rent'] = df_data['Studio Rent'].astype(int)

# Create DataFrame consisting Neighborhoods and their average Studio Rent price
#df_rent = pd.concat([df_col_neigh, df_data], axis=1)
#df_rent

#### (2) Geographic Data.
- Run  a `wget` command to download data from IBM Cloud. 

In [3]:
#!wget -q -O 'newyork_data.json' https://ibm.box.com/shared/static/fbpwbovar7lf8p5sgddm06cgipa2rxpe.json
#print('NYC Data downloaded!')

- Instantiate the dataframe by defining the dataframe columns **['Borough', 'Neighborhood', 'Latitude', 'Longitude']**.
- Use the **for loop** to fill the dataframe one row at a time.

In [4]:
# define the dataframe columns
#column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
#neighborhoods = pd.DataFrame(columns=column_names)

#for data in neighborhoods_data:
#    borough = neighborhood_name = data['properties']['borough'] 
#    neighborhood_name = data['properties']['name']
        
#    neighborhood_latlon = data['geometry']['coordinates']
#    neighborhood_lat = neighborhood_latlon[1]
#    neighborhood_lon = neighborhood_latlon[0]
    
#    neighborhoods = neighborhoods.append({'Borough': borough,
#                                          'Neighborhood': neighborhood_name,
#                                          'Latitude': neighborhood_lat,
#                                          'Longitude': neighborhood_lon}, ignore_index=True)

- **Slice the dataframe** and create a new dataframe of only **Flushing** and the **Manhattan** data.
- Use **geolocator.geocode() method** to get the geographical coordinates of Manhattan
- Visualize Flushing and the neighborhoods in Manhattan using **folium method**.

In [5]:
# Slice the dataframe and create a new dataframe of only Flushing and Manhattan** data
#fl_data = neighborhoods[neighborhoods['Neighborhood'] == 'Flushing']
#manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
#fl_manhattan_data = pd.concat([fl_data,manhattan_data]).reset_index(drop=True)
#fl_manhattan_data

#address = 'Manhattan, NY'

#geolocator = Nominatim()
#location = geolocator.geocode(address)
#latitude = location.latitude
#longitude = location.longitude
#print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

# create map of Manhattan using latitude and longitude values
#map_fl_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
#for lat, lng, label in zip(fl_manhattan_data['Latitude'], fl_manhattan_data['Longitude'], fl_manhattan_data['Neighborhood']):
#    label = folium.Popup(label, parse_html=True)
#    folium.CircleMarker(
#        [lat, lng],
#        radius=5,
#        popup=label,
#        color='blue',
#        fill=True,
#        fill_color='#3186cc',
#        fill_opacity=0.7,
#        parse_html=False).add_to(map_fl_manhattan)  
    
#map_fl_manhattan

#### (3) Foursquare Location Data.
- Define Foursquare Credentials and Version.
- Define function that extracts the category of the venue.

In [6]:
#CLIENT_ID = 'TCYPPRL3BT41PAGLPZBCKOQTRUTJNVISGZMJSYRCVR1NSR2I'
#CLIENT_SECRET = 'AWH4400LSJVSZJKSTVQW0RSQTELTR3KCHKTJVMQ5UESYZJCO'
#VERSION = '20180605'

### 2. Explore Neighborhoods in New York City

- Create a function by using **Foursquare API** to get top 100 venues that are in Flushing and the neighborhoods in Manhattan within a radius of 500 meters, and transform the data into dataframe with 7 columns: **[‘Neighborhood', ‘Neighborhood Latitude', ‘Neighborhood Longitude', ‘Venue’, ‘Venue Latitude', ‘Venue Longitude', ‘Venue Category’]**

In [7]:
#def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
#    
#    venues_list=[]
#    for name, lat, lng in zip(names, latitudes, longitudes):
#        print(name)
#            
        # create the API request URL
#        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
#            CLIENT_ID, 
#            CLIENT_SECRET, 
#           VERSION, 
#           lat, 
#           lng, 
#           radius, 
#           LIMIT)
            
        # make the GET request
#        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
#        venues_list.append([(
#            name, 
#            lat, 
#            lng, 
#            v['venue']['name'], 
#            v['venue']['location']['lat'], 
#            v['venue']['location']['lng'],  
#            v['venue']['categories'][0]['name']) for v in results])

#    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
#    nearby_venues.columns = ['Neighborhood', 
#                  'Neighborhood Latitude', 
#                  'Neighborhood Longitude', 
#                  'Venue', 
#                  'Venue Latitude', 
#                  'Venue Longitude', 
#                  'Venue Category']
    
#    return(nearby_venues)

- Run the above function on each neighborhood and create a new dataframe called **fl_manhattan_venues**.

In [8]:
#fl_manhattan_venues = getNearbyVenues(names=fl_manhattan_data['Neighborhood'],
#                                   latitudes=fl_manhattan_data['Latitude'],
#                                   longitudes=fl_manhattan_data['Longitude']
#                                  )

### 3. Analyze Each Neighborhood
- one hot encoding

In [9]:
# one hot encoding
#fl_manhattan_onehot = pd.get_dummies(fl_manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
#fl_manhattan_onehot['Neighborhood'] = fl_manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
#fixed_columns = [fl_manhattan_onehot.columns[-1]] + list(fl_manhattan_onehot.columns[:-1])
#fl_manhattan_onehot = fl_manhattan_onehot[fixed_columns]

- Group rows by neighborhood and taking the mean of the frequency of occurrence of each category.

In [10]:
# fl_manhattan_grouped = fl_manhattan_onehot.groupby('Neighborhood').mean().reset_index()

### 4. Cluster Neighborhoods --- using **k-means clustering**

- Set number of clusters equals to 4
- Run k-means clustering 
- Add clustering labels
- Visualize the resulting clusters

In [11]:
# set number of clusters
#kclusters = 4

#fl_manhattan_grouped_clustering = fl_manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
#kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(fl_manhattan_grouped_clustering)

#fl_manhattan_merged = fl_manhattan_data

# add clustering labels
#fl_manhattan_merged['Cluster Labels'] = kmeans.labels_

- Visualize the resulting clusters using **folium method**

In [12]:
# create map
#map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
#x = np.arange(kclusters)
#ys = [i+x+(i*x)**2 for i in range(kclusters)]
#colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
#rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
#markers_colors = []
#for lat, lon, poi, cluster in zip(fl_manhattan_merged['Latitude'], fl_manhattan_merged['Longitude'], fl_manhattan_merged['Neighborhood'], fl_manhattan_merged['Cluster Labels']):
#    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
#    folium.CircleMarker(
#        [lat, lon],
#        radius=5,
#        popup=label,
#        color=rainbow[cluster-1],
#        fill=True,
#        fill_color=rainbow[cluster-1],
#        fill_opacity=0.7).add_to(map_clusters)
       
#map_clusters

### 5. Examine Clusters and Find Similar Neighborhoods in Manhattan

- Examine each cluster
- Create a dataframe that includes all the similar neighborhoods in Manhattan as Flushing

### 6. Rank the Similar Neighborhoods by Studio Rent Price

- Studio Rent of Similar Neighborhoods as Flushing ranking in ascending order

In [13]:
#pd.concat([manhattan_cluster4, df_rent], axis=1, join='inner')
#cluster4_rent = pd.merge(manhattan_cluster4, df_rent,  how='left', left_on=['Neighborhood'], right_on = ['Neighborhood']).dropna()
#cluster4_rent_asc = cluster4_rent.sort_values('Studio Rent', axis=0, ascending = 'True').reset_index(drop=True)
#cluster4_rent_asc

- Return the Similar neighborhood in Manhattan as Flushing with Lowest Studio Rental price

In [14]:
#print('The similar neighborhood in Manhattan as Flushing with Lowest Studio Rental Price: ', cluster4_rent_asc.iloc[0]['Neighborhood'])
#print('The Average Rental Price in ', cluster4_rent_asc.iloc[0]['Neighborhood'], ': $', cluster4_rent_asc.iloc[0]['Studio Rent'])

## D. Results

#### 1. Here is the clustering map of Flushing and Manhattan after running k-means clustering.

![Cluster_map](https://raw.githubusercontent.com/yiyaoma/Coursera_Capstone/master/cluster%20map.png)

#### 2. Examining each cluster, we can see that Flushing, Queens is in Cluster 4 (['Cluster Labels'] == 3).

![cluster4_table](https://raw.githubusercontent.com/yiyaoma/Coursera_Capstone/master/cluster4_table.png)

#### 3. Let's create a dataframe that includes all the neighborhoods in Manhattan in Cluster4. 

![cluster4_neighborhoods](https://raw.githubusercontent.com/yiyaoma/Coursera_Capstone/master/cluster4_neighborhoods.png)

#### 4. Let's check the average Studio Rent of similar neighborhoods ranking in ascending order. 
#### We can see that **Washington Heights** has the **lowest Studio Rent in the Cluster 4**.
*Note: Because the rent dataset doesn't include the rent data of all the neighborhoods, some neighborhoods such as Manhattanville, Upper East Side, Clinton do not have rent data for analysis. To simplify the process, we simply delete those neighborhoods which means that Jack Lee won't choose from those neighborhoods.

![cluster4_withrent](https://raw.githubusercontent.com/yiyaoma/Coursera_Capstone/master/cluster4_with%20rent.png)

## E. Discussion

As we can see from the results above: **The similar neighborhood in Manhattan as Flushing with Lowest Studio Rental price is Washington Heights** with studio rent of $1678.

From the **Cluster 4** table as shown below, we can see that both **Washington Heights** and **Flushing** have many various kinds of restaurants such as Chinese restaurant, Asian restaurant, Bakeries and etc which makes **those two similar to each other in regarding to venue categories** and thus are in the same Cluster. And probably the main reason that Jack Lee enjoys living in Flushing is due to such many diversified restaurants especially Chinese restaurants as he was originally from China, which makes a lot of sense.

In the meantime, as Jack just graduated and has to pay back student loans, he needs to be frugal about his living expenses. Therefore he needs to rent a studio in a neighborhood that has the lowest average rental price. **According to the *rentcafe*(https://www.rentcafe.com/average-rent-market-trends/us/ny/manhattan/), among the available average rent data for studio by neighborhood in Manhattan, the lowest rent is $1678 and is in Washington Heights**.

![cluster4_table](https://raw.githubusercontent.com/yiyaoma/Coursera_Capstone/master/cluster4_table.png)

## F. Conclusion

### **Conclusion:**

Based on the analysis and results shown above, **we conclude that the similar neighborhood in Manhattan as Flushing with Lowest Studio Rental price is Washington Heights with studio rent of 1678 USD.**

**Therefore, we recommend Jack Lee to rent a studio in Washington Heights because it meets his two requirements:**

- (1) Washington Heights has the similar neighborhoods/venues and facilities as Flushing, and

- (2) Washington Heights has the lowest average rent price for studio among the similar neighborhoods.

### **Further Possible Developments:**

#### Due to some limitations, there are several areas that could be developed or improved in the future research:

- **(1) Use DBSCAN (Density-based spatial clustering of applications with noise) as the machine learning method to cluster neighborhoods for further studies.** In this project, we use traditional K-Means clustering to group data in an unsupervised way. However, when applied to tasks with arbitrary shaped clusters or clusters within clusters, traditional techniques might not be able to achieve good results that is, elements in the same cluster might not share enough similarity or the performance may be poor. Additionally, while partitioning based algorithms such as K-Means may be easy to understand and implement in practice, the algorithm has no notion of outliers that is, all points are assigned to a cluster even if they do not belong in any. Therefore, in the future research, we may want to apply DBSCAN model to analyze clusters.

- **(2) Find more completed rent dataset that includes the average rental costs of all neighborhoods.** In this project, due to the some limitations, we extract rent data from *rentcafe* (https://www.rentcafe.com/average-rent-market-trends/us/ny/manhattan/) which doesn't include all the neighborhoods in Manhattan. To further improve the project in the future, we can improve our dataset.