<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 70px">

# Client Project: Estimating Neighborhood Affluence with Yelp

<i>
                
                Submitted by Shannon Bingham and Roy Kim
</i>

 
## Problem Statement
The goal of the project is to estimate the affluence of a neighborhood based on the number of `$` of businesses and services (according to Yelp) in a given neighborhood (`$`, `$$`, `$$$`, `$$$$`). The project takes a list of zip codes as input and estimates the wealth of the locality. While traditional methods typically estimate wealth of a locality based on demographic characteristics (e.g. income or unemployment rate), the novelty of this approach is in its use of big data related to commercial activity and cost of product and services as an indicator for affluency.

## Notebook Description
This notebook contains python code that is used to make JSON requests to Yelp and return: 
- selected business data 
- summary price and rating data

## Data

#### Data Source
| File Name | Description | Source |
| :------------ | :------------ | :------------ |
| API |Yelp business data for one zip code |  [www.yelp.com](https://api.yelp.com/v3/businesses/search) | 

#### Data Dictionary
| Output | Description | File Name |
| :------------ | :-------------- | :-------------- |
| name | Business name | yelp.api.zip{zip}.csv |
| price | Relative price charged (`$`, `$$`, `$$$`, `$$$$`) | yelp.api.zip{zip}.csv |
| rating | Rating assigned by users (1-5, incl halves) | yelp.api.zip{zip}.csv |
| review_count | Number of reviews by users | yelp.api.zip{zip}.csv |
| categories | Keywords used for search (list) | yelp.api.zip{zip}.csv 
| zipcode | 5-digit Zip code | yelp.summary.zip{zip}.csv |
| n_business | Number of businesses | yelp.summary.zip{zip}.csv |
| n_dn`$` | Number of businesses with price `$`-`$$$$` (where n`$` is 1-4 | yelp.summary.zip{zip}.csv |
| n_review | Number of user reviews | yelp.summary.zip{zip}.csv |
| n_dn`S` | Number of businesses with rating of 1 - 5 stars, incl. halves (where n`S` is the number of stars | yelp.summary.zip{zip}.csv |


## Set up environment.

In [1]:
# Install yelp-python.
# !pip install yelp

In [39]:
# Import libraries.
import requests
import time
import pandas as pd
import random
import json
import pprint
import requests
from select_zips import random_zips

# Set random seed.
random.seed(42)

## Request business data from yelp website using API.

Note that this notebook is set up to use a free version of the API.  The api_key has been removed from the notebook.  In order to execute this code, a new key will need to be provided (directions are located at [www.yelp.com/developers](https://www.yelp.com/developers)).  Because of limitations with the API, this code has been set up to process a single zipcode at a time.    

In [41]:
# The list of Wisconsin Zip codes is provided here
wi_zip_codes = pd.read_csv('../data/WIzips.csv').iloc[:,0].tolist()

In [None]:
# Using the Python script to obtain 50 random zip codes from WI
# with number of businesses between 50 and 1000
zip_list = random_zips(wi_zip_codes, 50)

In [3]:
# Set location to select. 
zipcode = '54521'

# Set starting point.
offset = 0

In [4]:
# Function: get_details
# Gets the details of each business in the json data.  
# Returns the details in a list. 
def get_details(the_json, keys):

    # Initialize list.
    get_data = []

    # Loop through the entries for selected dictionary key values.
    for i in range(len(the_json['businesses'])):
        
        get_data.extend([{k : the_json['businesses'][i][k] 
                         for k in keys
                         if the_json['businesses'][i].get(k) is not None}])
        
    # Return details.
    return get_data

In [5]:
# Set up API call variables.
api_key = ""

headers = {'Authorization': 'Bearer %s' % api_key}
url = 'https://api.yelp.com/v3/businesses/search'

# Set details (dictionary keys) to select.  
select_keys = ['name', 'price', 'rating', 'review_count', 
           'categories']

# Initialize list to hold all selected business data.
api_zip_data = []

# Calculate end of range for request processing due to yelp limit.
end = offset + 1000

# Print progress message.
print(url)
print('Request processing starting')

# Make maximum number of requests.
for o in range(offset, end, 50):
    
    # Set parameters for API call.
    params = {
        'limit': 50, 
        'location': zipcode.replace(' ', '+'),
        'is_closed': False,
        'offset': o
    }

    # Make request.   
    response = requests.get(url, headers=headers, params=params)
    
    # Process response.
    if response.status_code == 200:     # successful request
        
        # Save response.
        the_json = response.json()
        
        # Print progress message.
        if o == 0:
            print('Total records for zip code {} is {}'.format(
                zipcode, the_json['total']))
        print(f'Retrieving records {o}-{o+49} ...')
        
        # Get the business details from response.
        api_zip_data.extend(get_details(the_json, select_keys))
        
        # Stop if reached total number of records.
        if o >= the_json['total']:
            break
        
    else:                               # unsuccessful request
        print('Processing ended unexpectedly.') 
        print('Request.get response is ', response.status_code)
        break
        
     # Wait.
    time.sleep(3)
         
# Print progress message.
print('Request processing ended')

https://api.yelp.com/v3/businesses/search
Request processing starting
Total records for zip code 54521 is 56
Retrieving records 0-49 ...
Retrieving records 50-99 ...
Retrieving records 100-149 ...
Request processing ended


## Clean data.

In [6]:
# Load posts to a dataframe.
api_zip = pd.DataFrame(api_zip_data, columns = select_keys)

# Verify load.
api_zip.shape

(56, 5)

In [7]:
# Insert zip code.
api_zip.insert(loc=0, column='zipcode', value=zipcode)

# Verify update.
api_zip.head()

Unnamed: 0,zipcode,name,price,rating,review_count,categories
0,54521,Eddie B's,$$,4.0,101,"[{'alias': 'tradamerican', 'title': 'American ..."
1,54521,Eagle River Roasters,$$,4.5,43,"[{'alias': 'coffeeroasteries', 'title': 'Coffe..."
2,54521,BuckShot's Saloon & Eatery,$$,4.0,36,"[{'alias': 'burgers', 'title': 'Burgers'}, {'a..."
3,54521,Leif's Cafe,$,4.0,75,"[{'alias': 'breakfast_brunch', 'title': 'Break..."
4,54521,Pirates Hideaway,$$,4.5,14,"[{'alias': 'boatcharters', 'title': 'Boat Char..."


In [8]:
# Drop duplicates.
api_zip.drop_duplicates(subset=['name', 'price', 'rating', 'review_count'],
                                          inplace=True)

# Verify drop.
api_zip.shape

(56, 6)

In [9]:
# Drop observations with null price.

# Count number of nulls.
print(sum(api_zip.isnull().sum()))

# Drop.
api_zip.dropna(subset = ['price'], inplace=True)

# Verify drop.
api_zip.info()

21
<class 'pandas.core.frame.DataFrame'>
Int64Index: 35 entries, 0 to 49
Data columns (total 6 columns):
zipcode         35 non-null object
name            35 non-null object
price           35 non-null object
rating          35 non-null float64
review_count    35 non-null int64
categories      35 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 1.9+ KB


In [10]:
# Drop observations with null rating.

# Count number of nulls.
print(sum(api_zip.isnull().sum()))

# Drop.
api_zip.dropna(subset = ['rating'], inplace=True)

# Verify drop.
api_zip.info()

0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 35 entries, 0 to 49
Data columns (total 6 columns):
zipcode         35 non-null object
name            35 non-null object
price           35 non-null object
rating          35 non-null float64
review_count    35 non-null int64
categories      35 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 1.9+ KB


## Create summary data per zip code.

In [11]:
# Initialize summary data observation.
sum_data  = {'zipcode'           : [zipcode], 
             'n_business'        : [0] ,
             'n_d1'              : [0] ,
             'n_d2'              : [0] , 
             'n_d3'              : [0] , 
             'n_d4'              : [0] ,
             'n_review'          : [0] , 
             'n_s1'              : [0] ,
             'n_s1plus'          : [0] ,
             'n_s2'              : [0] ,
             'n_s2plus'          : [0] ,
             'n_s3'              : [0] ,
             'n_s3plus'          : [0] ,
             'n_s4'              : [0] ,
             'n_s4plus'          : [0] , 
             'n_s5'              : [0] ,
            }

# Create summary dataframe.
sum_zip = pd.DataFrame(data=sum_data)

# Verify dataframe.
sum_zip.head()

Unnamed: 0,zipcode,n_business,n_d1,n_d2,n_d3,n_d4,n_review,n_s1,n_s1plus,n_s2,n_s2plus,n_s3,n_s3plus,n_s4,n_s4plus,n_s5
0,54521,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [12]:
# Calculate the total number of businesses.
sum_zip['n_business'] = len(api_zip)

# Verify column.
sum_zip['n_business']

0    35
Name: n_business, dtype: int64

In [13]:
# Get values of price.
api_zip.price.value_counts().sort_index()

$       11
$$      23
$$$$     1
Name: price, dtype: int64

In [14]:
# Add number of $ to the dataframe.
sum_zip['n_d1'] = sum(api_zip['price'] == '$')
sum_zip['n_d2'] = sum(api_zip['price'] == '$$')
sum_zip['n_d3'] = sum(api_zip['price'] == '$$$')
sum_zip['n_d4'] = sum(api_zip['price'] == '$$$$')

# Verify counts.
sum_zip

Unnamed: 0,zipcode,n_business,n_d1,n_d2,n_d3,n_d4,n_review,n_s1,n_s1plus,n_s2,n_s2plus,n_s3,n_s3plus,n_s4,n_s4plus,n_s5
0,54521,35,11,23,0,1,0,0,0,0,0,0,0,0,0,0


In [15]:
# Calculate total number of reviews.
sum_zip['n_review']      = sum(api_zip['review_count'])

# Verify update.
sum_zip

Unnamed: 0,zipcode,n_business,n_d1,n_d2,n_d3,n_d4,n_review,n_s1,n_s1plus,n_s2,n_s2plus,n_s3,n_s3plus,n_s4,n_s4plus,n_s5
0,54521,35,11,23,0,1,887,0,0,0,0,0,0,0,0,0


In [16]:
# Get values of rating.
api_zip.rating.value_counts().sort_index()

2.0     1
2.5     3
3.0     2
3.5     8
4.0    11
4.5    10
Name: rating, dtype: int64

In [17]:
# Add ratings to the dataframe.
sum_zip['n_s1']      = api_zip[api_zip['rating'] == 1]  ['review_count'].sum() 
sum_zip['n_s1plus']  = api_zip[api_zip['rating'] == 1.5]['review_count'].sum()
sum_zip['n_s2']      = api_zip[api_zip['rating'] == 2]  ['review_count'].sum() 
sum_zip['n_s2plus']  = api_zip[api_zip['rating'] == 2.5]['review_count'].sum()
sum_zip['n_s3']      = api_zip[api_zip['rating'] == 3]  ['review_count'].sum() 
sum_zip['n_s3plus']  = api_zip[api_zip['rating'] == 3.5]['review_count'].sum()
sum_zip['n_s4']      = api_zip[api_zip['rating'] == 4]  ['review_count'].sum() 
sum_zip['n_s4plus']  = api_zip[api_zip['rating'] == 4.5]['review_count'].sum()
sum_zip['n_s5']      = api_zip[api_zip['rating'] == 5]  ['review_count'].sum() 
# Verify counts.
sum_zip

Unnamed: 0,zipcode,n_business,n_d1,n_d2,n_d3,n_d4,n_review,n_s1,n_s1plus,n_s2,n_s2plus,n_s3,n_s3plus,n_s4,n_s4plus,n_s5
0,54521,35,11,23,0,1,887,0,0,5,53,74,216,370,169,0


## Save files.

In [18]:
# Set file locations.
api_zip_csv = (f'./data/yelp_api_zip{zipcode}.csv')
sum_zip_csv  = (f'./data/yelp_summary_zip{zipcode}.csv')

# Save.
api_zip.to_csv(api_zip_csv, encoding='utf-8', index=False)
sum_zip.to_csv(sum_zip_csv, encoding='utf-8', index=False)