# Exploring Public Sentiments of Chicago Neighborhoods

What is the relationship between the reputation and the characteristics of a neighborhood? 
Below I explore this question in the context of Chicago, IL, using social media (Twitter) data and a dataset of socioeconomic data per neighborhood available from the City of Chicago data portal (https://data.cityofchicago.org). For this analysis, a "neighborhood" is defined as one of Chicago's 77 community areas, which are officially recognized by the City of Chicago. I model neighborhood reputation as an index summarizing the sentiment of tweets mentioning a given neighborhood (more details provided as I work my way through the question).

I proceed in the following steps:
* Set up (import modules and keys that I'll need along the way)
* Construct dataset
    * Download relevant Twitter data
    * Apply sentiment analysis to tweets
    * Generate neighborhood reputation index scores for each neighborhood
    * Add socioeconomic data per neighborhood
* Develop descriptive statistics
* Develop infenrential statistics
* Summarize findings

# Set up

Importing relevant modules and keys

In [1]:
import time
#import datetime

import pandas as pd
pd.set_option('display.max_colwidth', -1)  # this helps show as much of the tweets in the dataframes as possible

import nltk
# in command line, pip install twython 
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# in terminal: easy_install pip
# in terminal: pip install tweepy
# in terminal: pip install --upgrade pip [per suggestion shown inside terminal]
import tweepy

%matplotlib inline
#import matplotlib.pyplot as plt

In [7]:
# These four codes constitute your authorization to extract data from Twitter
consumer_token = 'qZMmvxcPLfwCixedks1m3jXGg'
consumer_secret = 'UFFyLdOlePodkPYvR6NR64N0SVinVsPezNb1IKg1hNXl06jy67'
access_token = '2382930698-0eCycGIeqv4SUmOvSINQbkhnb2v9hTPDlSpcb8q'
access_token_secret = 'eTmMeL9pj7pcxoupFDqDuNmzWSXUn8UQWrYeYAcKu8xyR'

# OAuth process, using the keys and tokens
auth = tweepy.OAuthHandler(consumer_token, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Creation of the actual interface, using authentication
# api = tweepy.API(auth)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# Construct dataset

<h3>Downloading, cleaning up, and storing Twitter data in a csv file</h3>

Dataset Criteria:   
1. list of tweets within last 6-9 days. 
  * This time range limit is set by Twitter, see https://dev.twitter.com/rest/public/search
2. tweet mentions one of Chicago's 77 community areas
  * FYI there are two Community Areas inside Chicago with "Englewood" in the name. They're located adjacent to each other. There are also "Englewood" in other states, such as NJ.
3. tweet originates within 15 miles of Chicago, IL
  * Determine this by using geocoding of tweets
  * Identify geocode for Chicago by clicking on latitude/longitude info on Chicago wikipedia page: https://en.wikipedia.org/wiki/Chicago
4. tweet is in English


In [17]:
# importing table with socioeconomic stats per neighborhood, but for now only use its column with community area names
SES = pd.read_csv('https://raw.githubusercontent.com/yarikan/final_project_in_progress/master/Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012.csv?token=AQAvwPu3y_8vQZSH9YaxBCH2d7QLk3KOks5YWXU0wA%3D%3D')
SES = SES.dropna()  # do this so that the last row is dropped, which totals for the whole city
#SES
#SES.columns
SES['COMMUNITY AREA NAME']

0     Rogers Park           
1     West Ridge            
2     Uptown                
3     Lincoln Square        
4     North Center          
5     Lake View             
6     Lincoln Park          
7     Near North Side       
8     Edison Park           
9     Norwood Park          
10    Jefferson Park        
11    Forest Glen           
12    North Park            
13    Albany Park           
14    Portage Park          
15    Irving Park           
16    Dunning               
17    Montclaire            
18    Belmont Cragin        
19    Hermosa               
20    Avondale              
21    Logan Square          
22    Humboldt park         
23    West Town             
24    Austin                
25    West Garfield Park    
26    East Garfield Park    
27    Near West Side        
28    North Lawndale        
29    South Lawndale        
           ...              
47    Calumet Heights       
48    Roseland              
49    Pullman               
50    South De

In [8]:
# check the status of my search request limit
api.rate_limit_status('search')

{'rate_limit_context': {'access_token': '2382930698-0eCycGIeqv4SUmOvSINQbkhnb2v9hTPDlSpcb8q'},
 'resources': {'search': {'/search/tweets': {'limit': 180,
    'remaining': 180,
    'reset': 1481679991}}}}

In [23]:
# establish lists or "buckets" into which each relevant data point is going to be appended:
community_area_name_list = []
user_screen_name_list = []
user_location_list = []  
user_followers_count_list = []
tweet_coordinates_list = []
tweet_date_time_list = [] 
tweet_content_list = [] 
tweet_num_retweet_list = [] 
tweet_num_liked_list = []

# caution, the following takes a super long time

for name in SES['COMMUNITY AREA NAME']:
    time.sleep(60)  # suspends execution of the current thread for this many seconds
    for tweet in tweepy.Cursor(api.search,
                           q = name,
                           lang = 'en', 
#                           geocode = '41.836944,-87.684722,15mi').items(10):
                           geocode = '41.836944,-87.684722,15mi').items():
        community_area_name_list.append(name.lower())
        user_screen_name_list.append(tweet.user.screen_name)
        user_location_list.append(tweet.user.location)
        user_followers_count_list.append(tweet.user.followers_count)
        tweet_coordinates_list.append(tweet.coordinates)
        tweet_date_time_list.append(tweet.created_at)
        tweet_content_list.append(tweet.text)
        tweet_num_retweet_list.append(tweet.retweet_count)
        tweet_num_liked_list.append(tweet.favorite_count)

df = pd.DataFrame({
        "Community Area Mentioned": community_area_name_list,
        "User Screen Name": user_screen_name_list,
        "User Location": user_location_list,
        "User Number of Followers":user_followers_count_list,
        "Tweet Coordinates": tweet_coordinates_list,
        "Tweet Day and Time": tweet_date_time_list,
        "Tweet Number of Times Retweeted": tweet_num_retweet_list,
        "Tweet Number of Times Liked": tweet_num_liked_list,
        "Tweet Content": tweet_content_list})

Rate limit reached. Sleeping for: 724
Rate limit reached. Sleeping for: 580
Rate limit reached. Sleeping for: 634
Rate limit reached. Sleeping for: 714
Rate limit reached. Sleeping for: 712
Rate limit reached. Sleeping for: 557
Rate limit reached. Sleeping for: 634


In [26]:
df.shape
#df[0:3]
#df.to_csv('df_raw_data.csv')

(18158, 9)

<h3>Applying sentiment analysis to tweets</h3>

In [43]:
df = pd.read_csv('df_raw_data.csv')
df.columns

Index(['Unnamed: 0', 'Community Area Mentioned', 'Tweet Content',
       'Tweet Coordinates', 'Tweet Day and Time',
       'Tweet Number of Times Liked', 'Tweet Number of Times Retweeted',
       'User Location', 'User Number of Followers', 'User Screen Name'],
      dtype='object')

In [44]:
analyzer = SentimentIntensityAnalyzer()
df["Sentiment"] = df['Tweet Content'].apply(lambda tweet: analyzer.polarity_scores(tweet))

In [45]:
df[0:1]

Unnamed: 0.1,Unnamed: 0,Community Area Mentioned,Tweet Content,Tweet Coordinates,Tweet Day and Time,Tweet Number of Times Liked,Tweet Number of Times Retweeted,User Location,User Number of Followers,User Screen Name,Sentiment
0,0.0,rogers park,RT @ChiTribBiz: This Rogers Park building is the latest to attract investor attention as the neighborhood undergoes “a renaissance” https:/…,,2016-12-13 18:48:38,0.0,2.0,,443.0,giovannibacci36,"{'neu': 0.884, 'compound': 0.3612, 'pos': 0.116, 'neg': 0.0}"


In [46]:
df.Sentiment[0]
#df['Sentiment'][0]

# The following examples demonstrate the value of the compound score:
# VADER is smart, handsome, and funny. {'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316} 
# VADER is smart, handsome, and funny! {'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439} 
# VADER is very smart, handsome, and funny. {'neg': 0.0, 'neu': 0.299, 'pos': 0.701, 'compound': 0.8545} 
# VADER is VERY SMART, handsome, and FUNNY. {'neg': 0.0, 'neu': 0.246, 'pos': 0.754, 'compound': 0.9227} 
# VADER is VERY SMART, handsome, and FUNNY!!! {'neg': 0.0, 'neu': 0.233, 'pos': 0.767, 'compound': 0.9342} 
# VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!! {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469} 

{'compound': 0.3612, 'neg': 0.0, 'neu': 0.884, 'pos': 0.116}

In [47]:
# print(type(df['Sentiment'][0]))
type(df.Sentiment[0])

dict

In [48]:
# convert to string
#df['Sentiment'] = df['Sentiment'].apply(str)
df.Sentiment = df.Sentiment.apply(str)

type(df.Sentiment[0])

str

In [49]:
# proceed to extract sentiment compound scores as additional column
df_temp = df.Sentiment.str.split(',').apply(pd.Series)
#df_temp[0:5]
df_temp[1][0:5]

0     'compound': 0.3612
1     'compound': 0.3612
2     'compound': 0.34  
3     'compound': 0.5574
4     'compound': 0.5574
Name: 1, dtype: object

In [50]:
left_split = df_temp[1].str.split(': ').apply(pd.Series)
#left_split[1][0:5]
compound = left_split[1].apply(float)
df['Compound Sentiment Score'] = compound

In [52]:
#df[0:5]
df.to_csv('df_wCompound.csv')

<h3>Generating neighborhood reputation index scores for each neighborhood</h3>

In [151]:
df = pd.read_csv('df_wCompound.csv')
df.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'Community Area Mentioned',
       'Tweet Content', 'Tweet Coordinates', 'Tweet Day and Time',
       'Tweet Number of Times Liked', 'Tweet Number of Times Retweeted',
       'User Location', 'User Number of Followers', 'User Screen Name',
       'Sentiment', 'Compound Sentiment Score'],
      dtype='object')

In [152]:
del df['Unnamed: 0']
del df['Unnamed: 0.1']
df['Count 1'] = 1
df.shape

(18159, 12)

In [153]:
df[0:5]

Unnamed: 0,Community Area Mentioned,Tweet Content,Tweet Coordinates,Tweet Day and Time,Tweet Number of Times Liked,Tweet Number of Times Retweeted,User Location,User Number of Followers,User Screen Name,Sentiment,Compound Sentiment Score,Count 1
0,rogers park,RT @ChiTribBiz: This Rogers Park building is the latest to attract investor attention as the neighborhood undergoes “a renaissance” https:/…,,2016-12-13 18:48:38,0.0,2.0,,443.0,giovannibacci36,"{'neu': 0.884, 'compound': 0.3612, 'pos': 0.116, 'neg': 0.0}",0.3612,1
1,rogers park,RT @ChiTribBiz: This Rogers Park building is the latest to attract investor attention as the neighborhood undergoes “a renaissance” https:/…,,2016-12-13 18:47:24,0.0,2.0,Chicago,568.0,_ChicagoEDBHome,"{'neu': 0.884, 'compound': 0.3612, 'pos': 0.116, 'neg': 0.0}",0.3612,1
2,rogers park,Spirit Bascom Ventures Pays $18.9 Million for Sheridan Court Apts. in Chicago's Rogers Park: Spirit Bascom Ventures… https://t.co/xwXL7OYoQD,,2016-12-13 18:20:42,0.0,0.0,Chicago IL,1011.0,RENewsChicago,"{'neu': 0.825, 'compound': 0.34, 'pos': 0.175, 'neg': 0.0}",0.34,1
3,rogers park,Public Good’s platform empowers people to Take Action when they have read about an issue and make positive change.\n\nhttps://t.co/9d5vpw3Jhy,,2016-12-13 17:21:06,0.0,0.0,"Chicago, IL.",418.0,WilliamRMorton,"{'neu': 0.841, 'compound': 0.5574, 'pos': 0.159, 'neg': 0.0}",0.5574,1
4,rogers park,Public Good’s platform empowers people to Take Action when they have read about an issue and make positive change.\n\nhttps://t.co/eU7hXH9JyD,,2016-12-13 17:21:06,0.0,0.0,"Chicago, IL",109.0,Arthur_Avenue,"{'neu': 0.841, 'compound': 0.5574, 'pos': 0.159, 'neg': 0.0}",0.5574,1


I assume that reputation entails both a sentiment (positive, negative) and its diffusion. Therefore I will compute a neighborhood reputation index using the sentiment data, as well as information on the number of "likes" and "retweets" for each tweet and followers for the author of each tweet.

In [154]:
# I begin by computing the diffusion of each tweet.
# I assume that likers, retweeters, and followers do not overlap although in reality they likely do.

# diffusion to passive readers (Followers)
df['User to Follower Diffusion'] = df['Compound Sentiment Score']*df['User Number of Followers']

# diffusion to more active readers, demonstrated by liking a tweet
df['User to Liker Diffusion'] = df['Compound Sentiment Score']*df['Tweet Number of Times Liked']

# diffusion to even more active readers, demonstrated by retweeting a tweet
df['User to Retweeter Diffusion'] = df['Compound Sentiment Score']*df['Tweet Number of Times Retweeted']

# total diffusion
df['Tweet Total Diffusion'] = df['User to Follower Diffusion'] + df['User to Liker Diffusion'] + df['User to Retweeter Diffusion']

In [155]:
#df[0:5]

In [156]:
# Now I establish the number of positive vs. negative sentiment tweets per neighborhood. 

#tweet_by_neighb = df.groupby('Community Area Mentioned')
#tweet_total_by_neighb_num = tweet_by_neighb['Community Area Mentioned'].agg('count')
#tweet_total_by_neighb_num

tweet_by_neighb = df.groupby('Community Area Mentioned')

# total number of positive tweets per neighborhood:
total_tweet_by_neighb_pos_num = tweet_by_neighb.apply(lambda x: x[x['Compound Sentiment Score'] > 0]['Count 1'].sum())
# total number of negative tweets per neighborhood:
total_tweet_by_neighb_neg_num = tweet_by_neighb.apply(lambda x: x[x['Compound Sentiment Score'] < 0]['Count 1'].sum())
# total number of positive and negative tweets per neighborhood:
total_tweet_by_neighb_num = total_tweet_by_neighb_pos_num + total_tweet_by_neighb_neg_num

In [157]:
# Now I compute summary measures of the sentiment component of a neighborhood's reputation.
# Because I assume that a reputation involves a positive or negative sentiment, 
# I exclude all tweets that score as neutral (compound score = 0, review usign df[df['Compound Sentiment Score'] == 0]).

tweet_by_neighb = df.groupby('Community Area Mentioned')

for neighborhood in tweet_by_neighb:
    total_pos_tweet_by_neighb = tweet_by_neighb.apply(lambda x: x[x['Compound Sentiment Score'] > 0]['Compound Sentiment Score'].sum())
    total_neg_tweet_by_neighb = tweet_by_neighb.apply(lambda x: x[x['Compound Sentiment Score'] < 0]['Compound Sentiment Score'].sum())

In [158]:
neighb_level_df = pd.DataFrame({'Total Number of Positive Tweets':total_tweet_by_neighb_pos_num,
                                'Total Number of Negative Tweets':total_tweet_by_neighb_neg_num,
                                'Total Number of Positive and Negative Tweets': total_tweet_by_neighb_num,
                                'Total Positive Tweet Sentiment':total_pos_tweet_by_neighb,
                                'Total Negative Tweet Sentiment':total_neg_tweet_by_neighb})

In [159]:
#neighb_level_df.iloc[0]  # access a specific row in the df
neighb_level_df = neighb_level_df.drop(neighb_level_df.index[[0]])
neighb_level_df.shape

(72, 5)

In [160]:
neighb_level_df

Unnamed: 0_level_0,Total Negative Tweet Sentiment,Total Number of Negative Tweets,Total Number of Positive Tweets,Total Number of Positive and Negative Tweets,Total Positive Tweet Sentiment
Community Area Mentioned,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
albany park,-5.6676,16,18,34,9.7646
archer heights,0.0000,0,1,1,0.5574
ashburn,-10.6303,28,15,43,5.5188
auburn gresham,-5.9133,17,2,19,1.1233
austin,-14.1287,34,36,70,18.5554
avalon park,0.0000,0,0,0,0.0000
avondale,-7.2062,18,15,33,7.6270
belmont cragin,0.0000,0,0,0,0.0000
beverly,-35.3824,115,344,459,174.6288
bridgeport,-16.1919,36,87,123,44.0589


<h4>Computing weighted and unweighted reputation scores per neighborhood</h4>

<u>Unweighted Reputation Score:</u> Average is computed evenly, assuming a positive tweet is equal in influence as a negative tweet (except for direction of opinion)

<u>Weighted Reputation Score:</u> Humans have a famous bias for the negative - "even when of equal intensity, things of a more negative nature (e.g. unpleasant thoughts, emotions, or social interactions; harmful/traumatic events) have a greater effect on one's psychological state and processes than do neutral or positive things" (https://en.wikipedia.org/wiki/Negativity_bias). Based on HBR article (Folkman, Jack Zenger and Joseph. 2013. “The Ideal Praise-to-Criticism Ratio.” Harvard Business Review. March 15. https://hbr.org/2013/03/the-ideal-praise-to-criticism.), ideal praise-to-criticism ratio = 5-6 positive comments for every negative one. So the weighted reputation score calculation below pretends that there are five times as many negative tweets for every neighborhood. 

In [161]:
for neighborhood in neighb_level_df:
    avg_neg = total_neg_tweet_by_neighb / total_tweet_by_neighb_neg_num
    avg_pos = total_pos_tweet_by_neighb / total_tweet_by_neighb_pos_num
    reputation_index_1 = (avg_neg+avg_pos) / 2
    reputation_index_2 = ((5*avg_neg)+avg_pos) / 6
    
neighb_level_df['Average Negative Sentiment'] = avg_neg
neighb_level_df['Average Positive Sentiment'] = avg_pos
neighb_level_df['Unweighted Reputation Score'] = reputation_index_1
neighb_level_df['Weighted Reputation Score'] = reputation_index_2

In [162]:
neighb_level_df.fillna(0, inplace=True)
neighb_level_df.sort_values(by='Unweighted Reputation Score')

Unnamed: 0_level_0,Total Negative Tweet Sentiment,Total Number of Negative Tweets,Total Number of Positive Tweets,Total Number of Positive and Negative Tweets,Total Positive Tweet Sentiment,Average Negative Sentiment,Average Positive Sentiment,Unweighted Reputation Score,Weighted Reputation Score
Community Area Mentioned,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
near north side,-0.8020,1,16,17,7.8572,-0.802000,0.491075,-0.155463,-0.586487
oakland,-18.7194,32,22,54,6.1594,-0.584981,0.279973,-0.152504,-0.440822
chicago lawn,-7.2072,12,19,31,5.9058,-0.600600,0.310832,-0.144884,-0.448695
riverdale,-134.1097,210,20,230,8.7716,-0.638618,0.438580,-0.100019,-0.459085
mount greenwood,-15.6672,25,14,39,6.5758,-0.626688,0.469700,-0.078494,-0.443957
west garfield park,-5.0693,7,8,15,4.6487,-0.724186,0.581088,-0.071549,-0.506640
burnside,-3.4787,5,2,7,1.1148,-0.695740,0.557400,-0.069170,-0.486883
hyde park,-79.1596,122,239,361,124.0035,-0.648849,0.518843,-0.065003,-0.454234
chatham,-72.7402,114,27,141,14.6224,-0.638072,0.541570,-0.048251,-0.441465
new city,-68.2218,120,224,344,106.1651,-0.568515,0.473951,-0.047282,-0.394771


In [163]:
neighb_level_df.shape
neighb_level_df.to_csv('neighb_level_df.csv')

<h3>Adding socioeconomic data per neighborhood</h3>

# Develop descriptive statistics

# Develop infenrential statistics

# Summarize findings