# Capstone Project - The Battle of the Neighborhoods (Week 2)

### Applied Data Science Capstone by IBM/Coursera

## Introduction & Business Problem

The purpose of this project is to determine the best location to open up a high-end cafe in Houston. The definition of a high-end cafe is one that caters to the upper-middle to upper financial class of society. Goods and services will have a premium that is justified by the high quality in comparison to general cafes. Products from the cafe such as baked goods and beverages will be created by distinguished bakers and dessert chefs. The ingredients used will have a greater focus on quality than on affordability. As the economy thrives and a larger number of people realize a personal taste for premium services and goods, intelligent investors will be looking for locales that cater to a higher echelon of society (financially). It is innate human nature that people with greater than average buying power are willing to pay a greater premium for exclusivity. The Foursquare location data will be vital to determining locations that are dense, wealthy residential neighborhoods that lack establishments of a similar nature. The greater the competition, the less appealing a certain location will be. 

## Data

As mentioned earlier, the Foursquare data will be essential to determine the number of cafes (especially ones that tailor to the same market as mine will) in all neighborhoods in Houston. This will enable me to make an educated decision on which neighborhoods to filter out and which ones are prime candidates based on market availability. Another source I will use is median household income data per zipcode in Houston. This table also contains population per zipcode as well as specific latitude and longitude information (location). Knowing this, I can determine which neighborhoods have the most residents with the ability to afford the products and services my establishment will be offering, and which neighborhoods I will have to filter out due to a lower level of buying power. Coupling this location/financial data with the Foursquare data I will be able to cluster specific zipcodes/neighborhoods within Houston that are the most ideal candidates for beginning a upscale cafe business. The specific URL of this dataset is the following: "http://zipatlas.com/us/tx/houston/zip-code-comparison/median-household-income.htm". An example of the data is the following: zipcode 77010 of Houston, Texas has a population of 76 people with an average household income of $200,000. The location of this neighborhood is 29.75310, -95.361109.

Based on definition of our problem, factors that will influence our decision are:

number of existing cafes in the neighborhood (high-end cafes)

Following data sources will be needed to extract/generate the required information:

number of restaurants and their type and location in every neighborhood will be obtained using Foursquare API

### Exploring Houston's Neighborhoods

Let's start off by importing all of the necessary tools to solve this problem.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from urllib.request import urlopen
from bs4 import BeautifulSoup

Now let's open the website (source) and begin extraction of data that we want. A simple test is important to verify that our code is working and we are looking at the correct dataset.

In [42]:
data = pd.read_html('http://zipatlas.com/us/tx/houston/zip-code-comparison/median-household-income.htm')
isolate_data=data[-3]
isolate_data

Unnamed: 0,0,1,2,3,4,5,6
0,#,Zip Code,Location,City,Population,Avg. Income/H/hold,National Rank
1,1.,77010,"29.754310, -95.361109","Houston, Texas",76,"$200,000.00",#1
2,2.,77094,"29.769285, -95.681292","Houston, Texas",7779,"$123,244.00",#78
3,3.,77046,"29.733084, -95.430659","Houston, Texas",471,"$105,863.00",#181
4,4.,77059,"29.615219, -95.134960","Houston, Texas",16690,"$104,844.00",#197
5,5.,77005,"29.718435, -95.423555","Houston, Texas",23338,"$104,035.00",#208
6,6.,77024,"29.771991, -95.515453","Houston, Texas",32746,"$82,620.00",#706
7,7.,77068,"30.008830, -95.487234","Houston, Texas",9505,"$77,724.00",#948
8,8.,77095,"29.916055, -95.663077","Houston, Texas",39275,"$76,814.00",#992
9,9.,77062,"29.575781, -95.134334","Houston, Texas",26978,"$75,689.00","#1,066"


Let's rename the headers with the data in the first row of the data, and remove the first column since it's redundant with the index. Reorganize table to most efficient and comfortable form for further use.

In [43]:
isolate_data=isolate_data.rename(columns=isolate_data.iloc[0]).drop(isolate_data.index[0])

In [44]:
dfwant=isolate_data[['Zip Code','Location','Population','Avg. Income/H/hold']].copy()
dfwant

Unnamed: 0,Zip Code,Location,Population,Avg. Income/H/hold
1,77010,"29.754310, -95.361109",76,"$200,000.00"
2,77094,"29.769285, -95.681292",7779,"$123,244.00"
3,77046,"29.733084, -95.430659",471,"$105,863.00"
4,77059,"29.615219, -95.134960",16690,"$104,844.00"
5,77005,"29.718435, -95.423555",23338,"$104,035.00"
6,77024,"29.771991, -95.515453",32746,"$82,620.00"
7,77068,"30.008830, -95.487234",9505,"$77,724.00"
8,77095,"29.916055, -95.663077",39275,"$76,814.00"
9,77062,"29.575781, -95.134334",26978,"$75,689.00"
10,77056,"29.749035, -95.469021",14031,"$71,926.00"


In [45]:
dfwant['Location']=dfwant['Location'].str.strip()

In [46]:
dfwant[['Latitude','Longitude']] = dfwant.Location.str.split(",",expand=True,)

In [47]:
dfwant=dfwant.drop(columns=['Location'])
dfwant

Unnamed: 0,Zip Code,Population,Avg. Income/H/hold,Latitude,Longitude
1,77010,76,"$200,000.00",29.754310,-95.361109
2,77094,7779,"$123,244.00",29.769285,-95.681292
3,77046,471,"$105,863.00",29.733084,-95.430659
4,77059,16690,"$104,844.00",29.615219,-95.134960
5,77005,23338,"$104,035.00",29.718435,-95.423555
6,77024,32746,"$82,620.00",29.771991,-95.515453
7,77068,9505,"$77,724.00",30.008830,-95.487234
8,77095,39275,"$76,814.00",29.916055,-95.663077
9,77062,26978,"$75,689.00",29.575781,-95.134334
10,77056,14031,"$71,926.00",29.749035,-95.469021


In [48]:
dfwant['Avg. Income/H/hold'] = dfwant['Avg. Income/H/hold'].str.replace(',', '')
dfwant['Avg. Income/H/hold'] = dfwant['Avg. Income/H/hold'].str.replace('$', '')
dfwant['Avg. Income/H/hold'] = dfwant['Avg. Income/H/hold'].str[:-3]
dfwant['Avg. Income/H/hold'] = dfwant['Avg. Income/H/hold'].astype(int)
dfwant['Zip Code'] = dfwant['Zip Code'].astype(int)
dfwant['Population'] = dfwant['Population'].astype(int)
dfwant['Latitude'] = dfwant['Latitude'].astype(float)
dfwant['Longitude'] = dfwant['Longitude'].astype(float)

In [49]:
dfwant

Unnamed: 0,Zip Code,Population,Avg. Income/H/hold,Latitude,Longitude
1,77010,76,200000,29.754310,-95.361109
2,77094,7779,123244,29.769285,-95.681292
3,77046,471,105863,29.733084,-95.430659
4,77059,16690,104844,29.615219,-95.134960
5,77005,23338,104035,29.718435,-95.423555
6,77024,32746,82620,29.771991,-95.515453
7,77068,9505,77724,30.008830,-95.487234
8,77095,39275,76814,29.916055,-95.663077
9,77062,26978,75689,29.575781,-95.134334
10,77056,14031,71926,29.749035,-95.469021


Isolate the neighborhoods with the Top 5 highest average income per household since this is the market I want to be targeting.

In [51]:
dfrich = dfwant[dfwant['Avg. Income/H/hold'] > 100000]
dfrich

Unnamed: 0,Zip Code,Population,Avg. Income/H/hold,Latitude,Longitude
1,77010,76,200000,29.75431,-95.361109
2,77094,7779,123244,29.769285,-95.681292
3,77046,471,105863,29.733084,-95.430659
4,77059,16690,104844,29.615219,-95.13496
5,77005,23338,104035,29.718435,-95.423555


### Foursquare

Shift to Foursquare to visualize the Houston area.

In [12]:
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0           conda-forge
    geopy:          

In [13]:
address = 'Houston, TX'

geolocator = Nominatim(user_agent="Houston_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Houston are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Houston are 29.7589382, -95.3676974.


In [14]:
# create map of Houston using latitude and longitude values
map_houston = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(dfrich['Latitude'], dfrich['Longitude'], dfrich['Zip Code']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_houston)  
    
map_houston

## Methodology

Create a function to summon all nearby venues around the select zip codes of Houston.

In [15]:
import requests

In [16]:
def getNearbyVenues(names, latitudes, longitudes, radius=1600):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Zip Code', 
                  'Zip Code Latitude', 
                  'Zip Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Input Foursquare credentials to utilize API

In [69]:
CLIENT_ID='OGAJTDQP2NS0OBOGZDU4BENV15GWD5RGEYI4C0TN4ETTB2OP'
CLIENT_SECRET='LRVL3KUWY4EX2GZZYIQADLESIGCVPJG3EBXGAJ1CYTIR00VJ'
VERSION='20180605'
LIMIT=100
radius=1000

In [72]:
houston_venues=getNearbyVenues(names=dfrich['Zip Code'],
                              latitudes=dfrich['Latitude'],
                              longitudes=dfrich['Longitude']
                              )

77010
77094
77046
77059
77005


In [73]:
houston_venues.groupby(['Zip Code']).count()

Unnamed: 0_level_0,Zip Code Latitude,Zip Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Zip Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
77005,100,100,100,100,100,100
77010,100,100,100,100,100,100
77046,100,100,100,100,100,100
77059,17,17,17,17,17,17
77094,4,4,4,4,4,4


In [74]:
print('There are {} uniques categories.'.format(len(houston_venues['Venue Category'].unique())))

There are 121 uniques categories.


Let's take a careful look at the amenities that are located within each, unique zip code.

In [75]:
#one hot encoding
houston_onehot=pd.get_dummies(houston_venues[['Venue Category']], prefix="",prefix_sep="")

# add neighborhood column back to dataframe
houston_onehot['Zip Code'] = houston_venues['Zip Code'] 

# move neighborhood column to the first column
fixed_columns = [houston_onehot.columns[-1]] + list(houston_onehot.columns[:-1])
houston_onehot = houston_onehot[fixed_columns]

houston_onehot.head()

Unnamed: 0,Zip Code,ATM,American Restaurant,Arts & Crafts Store,Asian Restaurant,Auto Dealership,Auto Garage,BBQ Joint,Bagel Shop,Bakery,...,Thai Restaurant,Theater,Trail,Turkish Restaurant,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,77010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,77010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,77010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,77010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,77010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [76]:
houston_grouped = houston_onehot.groupby('Zip Code').mean().reset_index()
houston_grouped['Zip Code'] = houston_grouped['Zip Code'].astype(str)
houston_grouped

Unnamed: 0,Zip Code,ATM,American Restaurant,Arts & Crafts Store,Asian Restaurant,Auto Dealership,Auto Garage,BBQ Joint,Bagel Shop,Bakery,...,Thai Restaurant,Theater,Trail,Turkish Restaurant,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,77005,0.0,0.05,0.0,0.0,0.0,0.0,0.01,0.01,0.01,...,0.0,0.01,0.01,0.01,0.0,0.0,0.02,0.01,0.0,0.01
1,77010,0.0,0.03,0.01,0.0,0.0,0.0,0.03,0.0,0.0,...,0.01,0.02,0.0,0.0,0.04,0.0,0.02,0.0,0.0,0.0
2,77046,0.0,0.03,0.0,0.02,0.0,0.0,0.01,0.0,0.03,...,0.01,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.01
3,77059,0.058824,0.0,0.0,0.058824,0.0,0.058824,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0
4,77094,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [77]:
num_top_venues = 5

for hood in houston_grouped['Zip Code']:
    print("----"+hood+"----")
    temp = houston_grouped[houston_grouped['Zip Code'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----77005----
                 venue  freq
0          Coffee Shop  0.05
1  American Restaurant  0.05
2   Italian Restaurant  0.05
3         Burger Joint  0.04
4       Ice Cream Shop  0.03


----77010----
                   venue  freq
0                  Hotel  0.09
1            Pizza Place  0.04
2  Vietnamese Restaurant  0.04
3                    Bar  0.04
4                   Park  0.04


----77046----
                     venue  freq
0       Mexican Restaurant  0.07
1             Burger Joint  0.05
2       Italian Restaurant  0.05
3       Seafood Restaurant  0.05
4  New American Restaurant  0.04


----77059----
                venue  freq
0         Pizza Place  0.12
1                 ATM  0.06
2       Grocery Store  0.06
3  Salon / Barbershop  0.06
4      Sandwich Place  0.06


----77094----
                        venue  freq
0  Construction & Landscaping  0.25
1             Auto Dealership  0.25
2                 Gas Station  0.25
3           Other Repair Shop  0.25
4               

In [78]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now that we have all of the information we need, it is time to organize the data in a viewable format. Append the top 5 venues for each candidate location to the original dataframe to have a comprehensive view of the information I have collected.

In [85]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Zip Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Zip Code'] = houston_grouped['Zip Code']

for ind in np.arange(houston_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(houston_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Zip Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,77005,American Restaurant,Italian Restaurant,Coffee Shop,Burger Joint,Ice Cream Shop
1,77010,Hotel,Bar,Pizza Place,Vietnamese Restaurant,Park
2,77046,Mexican Restaurant,Seafood Restaurant,Burger Joint,Italian Restaurant,New American Restaurant
3,77059,Pizza Place,Spa,Home Service,Wings Joint,Italian Restaurant
4,77094,Construction & Landscaping,Auto Dealership,Other Repair Shop,Gas Station,Fast Food Restaurant


In [86]:
# set number of clusters
kclusters = 5

houston_grouped_clustering = houston_grouped.drop('Zip Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(houston_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([3, 4, 1, 2, 0], dtype=int32)

Necessary datatype conversion to properly merge the two dataframes.

In [87]:
neighborhoods_venues_sorted["Zip Code"]=neighborhoods_venues_sorted["Zip Code"].astype(str).astype(int)

In [88]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

houston_merged = dfrich

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
houston_merged = houston_merged.join(neighborhoods_venues_sorted.set_index('Zip Code'), on='Zip Code')

houston_merged.head() # check the last columns!

Unnamed: 0,Zip Code,Population,Avg. Income/H/hold,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,77010,76,200000,29.75431,-95.361109,4,Hotel,Bar,Pizza Place,Vietnamese Restaurant,Park
2,77094,7779,123244,29.769285,-95.681292,0,Construction & Landscaping,Auto Dealership,Other Repair Shop,Gas Station,Fast Food Restaurant
3,77046,471,105863,29.733084,-95.430659,1,Mexican Restaurant,Seafood Restaurant,Burger Joint,Italian Restaurant,New American Restaurant
4,77059,16690,104844,29.615219,-95.13496,2,Pizza Place,Spa,Home Service,Wings Joint,Italian Restaurant
5,77005,23338,104035,29.718435,-95.423555,3,American Restaurant,Italian Restaurant,Coffee Shop,Burger Joint,Ice Cream Shop


Finally, let's visualize these clusters on the map of Houston.

In [89]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(houston_merged['Latitude'], houston_merged['Longitude'], houston_merged['Zip Code'], houston_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results & Discussion

The original datasource that was utilized in this project contained an extensive list of zip codes within the Houston area that were ranked by average household income. I targeted the top 5 zip codes based on a personal assumption that yearly incomes of over $100,000 would indicate a certain level of spending and taste for high-end services and products.

At first glance, this thought process would lead one to believe the area pertaining to the zip code 77010 is the most ideal location for an upscale cafe due to the fact that it has the highest average household income out of the entire list. However, it is important to take into consideration that for a retail operation, buying power is secondary to customer attendance. What this means is that while the ability for the surrounding population to buy the services and goods that I am offering is important, the frequency of customers is by far the most important factor. As such, to best garner a larger consumer base, 77010 is not an option.

Next, we have to take a careful look at the surrounding venues. Foursquare data accurately depicts that zipcodes 77094 and 77005 are not ideal locations for a coffee shop. 77005 simply because there are already a high density of cafes within this region. 77094 lacks the appeal due to the fact that the surrounding area is dominated by service providers that are uncharacteristic of a typical, high-income residential area. The atmosphere and vibe of a cafe is its lifeforce and being surrounded by gas stations, construction and landscaping, auto dealerships, and other repair shops would be difficult to justify for high-end consumers. That leaves two candidates - 77046 and 77059.

Taking a look at the surrounding venues in zipcode 77059, one can realize that there is a similar issue to 77094. While, admittedly, 77059 does have a significantly larger population of food providers in the vicinity, the price level and target population is more of the general public. The area is dominated by cheap and quick eats that cater to the average population. The people that I am targeting with this upscale cafe do not frequent these areas so the synergy with the surrounding venues would be nonexistent for my establishment.

The reason 77046 is the perfect location for an establishment of this nature is due to two factors. The first factor is the surrounding venues. According to the data drawn from Foursquare, the top 5 most common venues in this location are all food providers (restaurants) of varying nature - Italian, New American, Mexican and Seafood. This variety facilitates the perfect location for a cafe to come in. The synergy between an upscale dessert cafe in a neighborhood dominated by middle to upscale restaurants would create the perfect location for families to stop by and spend their time and money for leisurely or formal enjoyment. Consumers wouldn't have to travel farther than a kilometer to be faced with a variety of delectable options for any day and any occasion. The second factor is the location relative to local landmarks. This zipcode is located within the trifecta of Houston's most well-off areas - the galleria, river oaks, and Rice University. This is a strategically superior location to any other zipcode due to this placement as it could draw on the collective population of all three areas, which have plenty of consumers with plenty of buying power.

## Conclusion

The purpose of this project was to identify areas around Houston with a low number of cafes (particularly upscale establishments) in order to aid stakeholders in narrowing down the search for the optimal location to build a new high-end bakery/cafe. By sorting zip codes within Houston based on average household income, we have first narrowed down an extensive list of potential locations throughout Houston that justify further analysis. Then, we proceeded to access Foursquare data to identify what the characteristics of each of these locations were like. The information was primarily concerning what were the five most common venues in each of the zip codes. Clustering of each of these locations was thenm performed in order to visualize these points of interest on a map of Houston pending final decision making by stakeholders.

The final decision on an optimal cafe location will be made by stakeholders based on the unique characteristics of each zipcode location. The process will take into consideration additional factors such as the appeal of each region (proximity to high-end neighborhoods), levels of human traffic, as well as social and economic dynamics based on most common venues. The recommendation of this project is that the establishment be built in the location of zipcode 77046.