#### <img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4:  Using Yelp cost estimates for estimating neighborhood affluency

<i>
                
                Submitted by Shannon Bingham and Roy Kim
</i>

 
## Problem Statement

This tool will estimate the affluence of a neighborhood based on the number of $ of businesses and services (according to Yelp) in a given neighborhood. ($, $$, $$$) This tool will expect to get, as an input, a list of zip codes or names of neighborhoods and will estimate the wealth of the locality. While traditional methods typically estimate wealth of a locality based on demographic characteristics (e.g. income or unemployment rate), the novelty of this approach is in its use of big data related to commercial activity and cost of product and services as an indicator for affluency.


## Set up environment.

In [1]:
# Import libraries.
import requests
import time
import pandas as pd

## Read in positioning data.

In [2]:
# Locate the file.
next_post_csv = './data/next_post_20181216-231653.csv'
# next_post_csv = './data/next_post_init.csv'

# Load the data.  
next_post_df = pd.read_csv(next_post_csv, index_col='url')

# Take a look. 
next_post_df.head()

In [3]:
def get_details(the_json, keys):
# Get the details of each reddit post in the json data.  
# Return the details in a list. 

    # Set details (dictionary keys) to select.  
    select_keys = ['title', 'subreddit', 'author', 'domain', 
                   'num_comments', 'permalink']

    # Initialize list.
    posts = []

    # Loop through the posts for selected dictionary key values.
    for i in range(len(the_json['data']['children'])):
        
        posts.extend([{k : the_json['data']['children'][i]['data'][k] 
                       for k in keys}])
    
    # Return details.
    return posts

## Request posts from reddit website using API.

In [12]:
# Initialize lists.
urls = ['https://www.reddit.com/r/news.json']
        'https://www.reddit.com/r/UpliftingNews.json']

all_posts = []

# Set up header for API.
headers = {'User-agent': 'Bingham v1.0'}

# Loop through urls.
for url in urls:

    # Set number of requests to make.
    n_requests = 40

    # Set details (dictionary keys) to select from reddit posts.  
    select_keys = ['title', 'name', 'subreddit', 'author', 'domain', 
                   'num_comments', 'permalink']
    
    # Print progress message.
    print(url)
    print('Request processing starting')
    
    after = 'init'

    # Loop for each request (note that each request returns up to 25 posts).
    for n in range(n_requests):

        # Print progress message.
        print('Making request #', n)

        # Set variable to indicate the id of the next post.
        after = next_post_df.loc[url,'after']
        
        if after == 'init':
            params = {}
        else:
            params = {'after': after}

        # Make request.      
        res = requests.get(url, params=params, headers=headers)

        # Process data.
        if res.status_code == 200:     # successful request
              
            # Get the details from all the returned posts.
            the_json = res.json()
            all_posts.extend(get_details(the_json, select_keys))
            
            # Prepare for next request.
            after = the_json['data']['after']
            next_post_df.loc[url,'after'] = after
            
            print("Last post in request:", after)
        
        else:                          # unsuccessful request
            print('Processing ended unexpectedly.') 
            print('Request.get response is ', res.status_code)
            print('url is ', url)
            break

        # Wait.
        time.sleep(3)
         
    # Print progress message.
    print('Request processing ended')

https://www.reddit.com/r/UpliftingNews.json
Request processing starting
Making request # 0
Last post in request: t3_a70j15
Making request # 1
Last post in request: t3_a65964
Making request # 2
Last post in request: t3_a652sf
Making request # 3
Last post in request: t3_a5b6mf
Making request # 4
Last post in request: t3_a5cb5k
Making request # 5
Last post in request: t3_a5101s
Making request # 6
Last post in request: t3_a4s11e
Making request # 7
Last post in request: t3_a4j9xl
Making request # 8
Last post in request: t3_a49t5y
Making request # 9
Last post in request: t3_a3r903
Making request # 10
Last post in request: t3_a3hrsy
Making request # 11
Last post in request: t3_a2rrnz
Making request # 12
Last post in request: t3_a2j1qe
Making request # 13
Last post in request: t3_a1qa0b
Making request # 14
Last post in request: t3_a17fgx
Making request # 15
Last post in request: t3_a0x903
Making request # 16
Last post in request: t3_a059g2
Making request # 17
Last post in request: t3_9z3qaj
Ma

In [13]:
len(all_posts)


966

In [14]:
# Check for duplicate titles.
len(set([p['title']for p in all_posts]))

482

In [15]:
# Load posts to a dataframe.
all_posts_df = pd.DataFrame(all_posts, columns = select_keys)

# Verify load.
all_posts_df.shape

(966, 7)

In [9]:
# Take a look.
all_posts_df.tail(50)

Unnamed: 0,title,name,subreddit,author,domain,num_comments,permalink
902,"Stethoscopes loaded with bacteria, including s...",t3_a6i0ro,news,jq1984_is_me,cbsnews.com,11,/r/news/comments/a6i0ro/stethoscopes_loaded_wi...
903,Israel destroys house of Palestinian charged w...,t3_a6gwwq,news,jq1984_is_me,reuters.com,43,/r/news/comments/a6gwwq/israel_destroys_house_...
904,Sandy Hook School Students Sent Home After Rec...,t3_a65xzu,news,mattpsu79,nbcconnecticut.com,424,/r/news/comments/a65xzu/sandy_hook_school_stud...
905,"Big data hints at how, when and where mental d...",t3_a6exk4,news,glowalmond,sciencenews.org,26,/r/news/comments/a6exk4/big_data_hints_at_how_...
906,"Special forces solider charged with murder; ""a...",t3_a69ix8,news,richmanding0,bbc.com,292,/r/news/comments/a69ix8/special_forces_solider...
907,Nurse Denied Life Insurance Because She Carrie...,t3_a64ca7,news,Drumlin,npr.org,1784,/r/news/comments/a64ca7/nurse_denied_life_insu...
908,Texas judge who approved plea deal for alleged...,t3_a654l5,news,constellationdust,usatoday.com,849,/r/news/comments/a654l5/texas_judge_who_approv...
909,Dept. of Education to cancel $150 million in s...,t3_a64nxv,news,ThouHastLostAn8th,nbcnews.com,1419,/r/news/comments/a64nxv/dept_of_education_to_c...
910,Three Men Indicted In Conspiracy to Kill Whist...,t3_a683s9,news,ItsEntirelyPossible,justice.gov,62,/r/news/comments/a683s9/three_men_indicted_in_...
911,J&amp;J shares plunge as much as 8.6% after re...,t3_a668f3,news,EnoughPM2020,cnbc.com,246,/r/news/comments/a668f3/jj_shares_plunge_as_mu...


In [10]:
# Take a look at next post data.
# next_post_df

## Save files.

In [16]:
timestamp = time.strftime("%Y%m%d-%H%M%S")

# Set file locations.
all_posts_csv = (f'./data/all_posts_{timestamp}.csv')

# Save.
next_post_df.to_csv(next_post_csv, encoding='utf-8')
all_posts_df.to_csv(all_posts_csv, encoding='utf-8', index=False)