<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4:  Using Yelp cost estimates for estimating neighborhood affluency

<i>
                
                Submitted by Shannon Bingham and Roy Kim
</i>

 
## Problem Statement
This tool will estimate the affluence of a neighborhood based on the number of `$` of businesses and services (according to Yelp) in a given neighborhood. (`$`, `$$`, `$$$`, `$$$$`) This tool will expect to get, as an input, a list of zip codes or names of neighborhoods and will estimate the wealth of the locality. While traditional methods typically estimate wealth of a locality based on demographic characteristics (e.g. income or unemployment rate), the novelty of this approach is in its use of big data related to commercial activity and cost of product and services as an indicator for affluency.

## Notebook Description
_This notebook is used to summarize the data retrieved from the Yelp API by summarizing the data (number of `$`, stars, reviews, etc.)_

In [26]:
# Glob is a library that helps manage the files for multiple access
import glob
import pandas as pd

In [27]:
# Initializing DataFrame to hold zip_summary data
df_zip_summary = pd.DataFrame()

In [28]:
# Creating a definition to summarize data from each zip code in yelp retrieval
def summary(dataframe, zip_code):
    # Dropping any businesses without a price marker
    dataframe.dropna(subset=['price'], inplace=True)
    
    n_d1=0
    n_d2=0
    n_d3=0
    n_d4=0
    n_s1=0
    n_s1plus=0
    n_s2=0
    n_s2plus=0
    n_s3=0
    n_s3plus=0
    n_s4=0
    n_s4plus=0
    n_s5=0
    n_review=0
    zipcode=0
    n_business=0
    
    for index, row in dataframe.iterrows():
        if row['price'] == '$':
            n_d1+=1
        elif row['price'] == '$$':
            n_d2+=1
        elif row['price'] == '$$$':
            n_d3+=1
        else:
            n_d4+=1

        if row['rating'] == 1:
            n_s1+=1
        elif row['rating'] == 1.5:
            n_s1plus+=1
        if row['rating'] == 2:
            n_s2+=1
        elif row['rating'] == 2.5:
            n_s2plus+=1
        if row['rating'] == 3:
            n_s3+=1
        elif row['rating'] == 3.5:
            n_s3plus+=1
        if row['rating'] == 4:
            n_s4+=1
        elif row['rating'] == 4.5:
            n_s4plus+=1
        else:
            n_s5+=1

        n_review += row['review_count']
        n_business += 1
        
    dicto = {
        'n_d1' : n_d1,
        'n_d2' : n_d2,
        'n_d3' : n_d3,
        'n_d4' : n_d4,
        'n_s1' : n_s1,
        'n_s1plus' : n_s1plus,
        'n_s2' : n_s2,
        'n_s2plus' : n_s2plus,
        'n_s3' : n_s3plus,
        'n_s3plus' : n_s3plus,
        'n_s4' : n_s4,
        'n_s4plus' : n_s4plus,
        'n_s5' : n_s5,
        'n_review' : n_review,
        'n_business' : n_business,
        'zipcode' : zip_code
    }
    
    return pd.Series(dicto)

In [29]:
# Using glob to glob together all the zip code .csvs
files = glob.glob('../data/full_yelp_zipcodes/yelp_api_zip*.csv')

In [30]:
# Using glob, we are able to create separate dataframes to put 
#     into the summary function created above
# For the second parameter of the function (zipcode), I had to play around
#     with the string f to get just the zipcode
for f in files:
    df_zip_summary = df_zip_summary.append(summary(pd.read_csv(f), f[len(f)-9:len(f)-4]), ignore_index=True)

In [31]:
# Checking shape to see if data is outputted correctly, should be 50 rows and 20 columns
df_zip_summary.shape

(100, 16)

In [32]:
df_zip_summary.sort_values('n_business')

Unnamed: 0,n_business,n_d1,n_d2,n_d3,n_d4,n_review,n_s1,n_s1plus,n_s2,n_s2plus,n_s3,n_s3plus,n_s4,n_s4plus,n_s5,zipcode
26,8.0,5.0,3.0,0.0,0.0,123.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,2.0,4.0,53006
70,9.0,4.0,5.0,0.0,0.0,144.0,0.0,0.0,0.0,0.0,2.0,2.0,1.0,4.0,4.0,54003
72,29.0,17.0,12.0,0.0,0.0,386.0,0.0,0.0,1.0,0.0,6.0,6.0,10.0,4.0,15.0,54228
66,29.0,13.0,16.0,0.0,0.0,390.0,0.0,1.0,1.0,2.0,5.0,5.0,7.0,7.0,15.0,54822
46,31.0,11.0,19.0,1.0,0.0,351.0,0.0,0.0,2.0,3.0,3.0,3.0,14.0,5.0,12.0,54166
86,35.0,14.0,21.0,0.0,0.0,525.0,0.0,2.0,0.0,1.0,5.0,5.0,13.0,8.0,14.0,53091
7,35.0,11.0,23.0,0.0,1.0,886.0,0.0,0.0,1.0,3.0,8.0,8.0,11.0,10.0,14.0,54521
87,36.0,20.0,16.0,0.0,0.0,531.0,0.0,0.0,0.0,4.0,4.0,4.0,12.0,7.0,17.0,53522
76,40.0,14.0,24.0,2.0,0.0,589.0,0.0,2.0,1.0,3.0,6.0,6.0,11.0,8.0,21.0,54449
96,40.0,20.0,19.0,1.0,0.0,633.0,0.0,1.0,4.0,1.0,5.0,5.0,8.0,11.0,21.0,53916


In [25]:
# Saving this to a .csv file for later use
df_zip_summary.to_csv('../data/all_wi_yelp.csv', index=False)

### The .csv contains all the relevant Yelp data!
_Next step would be to combine this data with the data from city-data.com to create a final data file for analysis._