# CHIsentiment: Exploring Public Sentiments of Chicago Neighborhoods

What is the relationship between the reputation and the characteristics of a neighborhood? 
Below I explore this question in the context of Chicago, IL, using social media (Twitter) data and a dataset of socioeconomic data per neighborhood available from the City of Chicago data portal (https://data.cityofchicago.org). For this exploration, a "neighborhood" is defined as one of Chicago's 77 community areas, which are officially recognized by the City of Chicago. I model neighborhood reputation as an index summarizing the sentiment of tweets mentioning a given neighborhood (more details provided as I work my way through the question).

I proceed in the following steps:
* Set up (import modules and keys that I'll need along the way)
* Construct dataset
    * Download relevant Twitter data
    * Apply sentiment analysis to tweets
    * Generate neighborhood reputation index scores for each neighborhood
    * Put socioeconomic and Twitter data per neighborhood into one table
* Develop descriptive statistics
* Develop inferential statistics
* Conclusions

# Set up

Importing relevant modules and keys

In [3]:
import time

import pandas as pd
pd.set_option('display.max_colwidth', -1)  # this helps show as much of the tweets in the dataframes as possible

%matplotlib inline
#import matplotlib.pyplot as plt

In [None]:
# in terminal: easy_install pip
# in terminal: pip install tweepy
# in terminal: pip install --upgrade pip [per suggestion shown inside terminal]
import tweepy

import nltk
# in command line, pip install twython 
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [7]:
# These four codes constitute your authorization to extract data from Twitter
consumer_token = 'please'
consumer_secret = 'use'
access_token = 'your'
access_token_secret = 'own'

# OAuth process, using the keys and tokens
auth = tweepy.OAuthHandler(consumer_token, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Creation of the actual interface, using authentication
# api = tweepy.API(auth)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# Construct dataset

<h3>Downloading, cleaning up, and storing Twitter data in a csv file</h3>

Dataset Criteria:   
1. list of tweets within last 6-9 days. 
  * This time range limit is set by Twitter, see https://dev.twitter.com/rest/public/search
2. tweet mentions one of Chicago's 77 community areas
  * FYI there are two Community Areas inside Chicago with "Englewood" in the name. They're located adjacent to each other. There are also "Englewood" in other states, such as NJ.
3. tweet originates within 15 miles of Chicago, IL
  * Determine this by using geocoding of tweets
  * Identify geocode for Chicago by clicking on latitude/longitude info on Chicago wikipedia page: https://en.wikipedia.org/wiki/Chicago
4. tweet is in English


In [17]:
# import table with column of community area names
SES = pd.read_csv('Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012.csv')
SES = SES.dropna()  # do this so that the last row is dropped, which shows totals for whole city
#SES
#SES.columns
SES['COMMUNITY AREA NAME']

0     Rogers Park           
1     West Ridge            
2     Uptown                
3     Lincoln Square        
4     North Center          
5     Lake View             
6     Lincoln Park          
7     Near North Side       
8     Edison Park           
9     Norwood Park          
10    Jefferson Park        
11    Forest Glen           
12    North Park            
13    Albany Park           
14    Portage Park          
15    Irving Park           
16    Dunning               
17    Montclaire            
18    Belmont Cragin        
19    Hermosa               
20    Avondale              
21    Logan Square          
22    Humboldt park         
23    West Town             
24    Austin                
25    West Garfield Park    
26    East Garfield Park    
27    Near West Side        
28    North Lawndale        
29    South Lawndale        
           ...              
47    Calumet Heights       
48    Roseland              
49    Pullman               
50    South De

In [8]:
# check the status of my search request limit
api.rate_limit_status('search')

{'rate_limit_context': {'access_token': '2382930698-0eCycGIeqv4SUmOvSINQbkhnb2v9hTPDlSpcb8q'},
 'resources': {'search': {'/search/tweets': {'limit': 180,
    'remaining': 180,
    'reset': 1481679991}}}}

In [23]:
# establish lists or "buckets" into which each relevant data point is going to be appended:
community_area_name_list = []
user_screen_name_list = []
user_location_list = []  
user_followers_count_list = []
tweet_coordinates_list = []
tweet_date_time_list = [] 
tweet_content_list = [] 
tweet_num_retweet_list = [] 
tweet_num_liked_list = []

# caution, the following takes a very long time

for name in SES['COMMUNITY AREA NAME']:
    time.sleep(60)  # suspends execution of the current thread for this many seconds
    for tweet in tweepy.Cursor(api.search,
                           q = name,
                           lang = 'en', 
#                           geocode = '41.836944,-87.684722,15mi').items(10):
                           geocode = '41.836944,-87.684722,15mi').items():
        community_area_name_list.append(name.lower())
        user_screen_name_list.append(tweet.user.screen_name)
        user_location_list.append(tweet.user.location)
        user_followers_count_list.append(tweet.user.followers_count)
        tweet_coordinates_list.append(tweet.coordinates)
        tweet_date_time_list.append(tweet.created_at)
        tweet_content_list.append(tweet.text)
        tweet_num_retweet_list.append(tweet.retweet_count)
        tweet_num_liked_list.append(tweet.favorite_count)

df = pd.DataFrame({
        "Community Area Mentioned": community_area_name_list,
        "User Screen Name": user_screen_name_list,
        "User Location": user_location_list,
        "User Number of Followers":user_followers_count_list,
        "Tweet Coordinates": tweet_coordinates_list,
        "Tweet Day and Time": tweet_date_time_list,
        "Tweet Number of Times Retweeted": tweet_num_retweet_list,
        "Tweet Number of Times Liked": tweet_num_liked_list,
        "Tweet Content": tweet_content_list})

Rate limit reached. Sleeping for: 724
Rate limit reached. Sleeping for: 580
Rate limit reached. Sleeping for: 634
Rate limit reached. Sleeping for: 714
Rate limit reached. Sleeping for: 712
Rate limit reached. Sleeping for: 557
Rate limit reached. Sleeping for: 634


In [26]:
df.shape
#df[0:3]
#df.to_csv('df_raw_data.csv')

(18158, 9)

<h3>Applying sentiment analysis to tweets</h3>

In [43]:
df = pd.read_csv('df_raw_data.csv')
df.columns

Index(['Unnamed: 0', 'Community Area Mentioned', 'Tweet Content',
       'Tweet Coordinates', 'Tweet Day and Time',
       'Tweet Number of Times Liked', 'Tweet Number of Times Retweeted',
       'User Location', 'User Number of Followers', 'User Screen Name'],
      dtype='object')

In [44]:
analyzer = SentimentIntensityAnalyzer()
df["Sentiment"] = df['Tweet Content'].apply(lambda tweet: analyzer.polarity_scores(tweet))

In [45]:
df[0:1]

Unnamed: 0.1,Unnamed: 0,Community Area Mentioned,Tweet Content,Tweet Coordinates,Tweet Day and Time,Tweet Number of Times Liked,Tweet Number of Times Retweeted,User Location,User Number of Followers,User Screen Name,Sentiment
0,0.0,rogers park,RT @ChiTribBiz: This Rogers Park building is the latest to attract investor attention as the neighborhood undergoes “a renaissance” https:/…,,2016-12-13 18:48:38,0.0,2.0,,443.0,giovannibacci36,"{'neu': 0.884, 'compound': 0.3612, 'pos': 0.116, 'neg': 0.0}"


In [46]:
df.Sentiment[0]
#df['Sentiment'][0]

# The following examples demonstrate the value of the compound score:
# VADER is smart, handsome, and funny. {'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316} 
# VADER is smart, handsome, and funny! {'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439} 
# VADER is very smart, handsome, and funny. {'neg': 0.0, 'neu': 0.299, 'pos': 0.701, 'compound': 0.8545} 
# VADER is VERY SMART, handsome, and FUNNY. {'neg': 0.0, 'neu': 0.246, 'pos': 0.754, 'compound': 0.9227} 
# VADER is VERY SMART, handsome, and FUNNY!!! {'neg': 0.0, 'neu': 0.233, 'pos': 0.767, 'compound': 0.9342} 
# VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!! {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469} 

{'compound': 0.3612, 'neg': 0.0, 'neu': 0.884, 'pos': 0.116}

In [47]:
# prepare sentiment score for putting into a column of its own

# print(type(df['Sentiment'][0]))
type(df.Sentiment[0])

dict

In [48]:
# convert to string
#df['Sentiment'] = df['Sentiment'].apply(str)
df.Sentiment = df.Sentiment.apply(str)

type(df.Sentiment[0])

str

In [49]:
# now extract sentiment compound scores as additional column

df_temp = df.Sentiment.str.split(',').apply(pd.Series)
#df_temp[0:5]
df_temp[1][0:5]

0     'compound': 0.3612
1     'compound': 0.3612
2     'compound': 0.34  
3     'compound': 0.5574
4     'compound': 0.5574
Name: 1, dtype: object

In [50]:
left_split = df_temp[1].str.split(': ').apply(pd.Series)
#left_split[1][0:5]
compound = left_split[1].apply(float)
df['Compound Sentiment Score'] = compound

In [52]:
#df[0:5]
df.to_csv('df_wCompound.csv')

<h3>Generating neighborhood reputation index scores for each neighborhood</h3>

In [41]:
df = pd.read_csv('df_wCompound.csv')
df.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'Community Area Mentioned',
       'Tweet Content', 'Tweet Coordinates', 'Tweet Day and Time',
       'Tweet Number of Times Liked', 'Tweet Number of Times Retweeted',
       'User Location', 'User Number of Followers', 'User Screen Name',
       'Sentiment', 'Compound Sentiment Score'],
      dtype='object')

In [42]:
del df['Unnamed: 0']
del df['Unnamed: 0.1']
df['Count 1'] = 1
df.shape

(18159, 12)

In [43]:
#df[0:5]

I assume that reputation entails both a sentiment (positive, negative) and its diffusion. Therefore I will compute a neighborhood reputation index using the sentiment data, as well as information on the number of "likes" and "retweets" for each tweet and followers for the author of each tweet.

In [44]:
# I begin by computing the diffusion of each tweet.
# I assume that likers, retweeters, and followers do not overlap although in reality they likely do.

# diffusion to passive readers (Followers)
df['User to Follower Diffusion'] = df['Compound Sentiment Score']*df['User Number of Followers']

# diffusion to more active readers, demonstrated by liking a tweet
df['User to Liker Diffusion'] = df['Compound Sentiment Score']*df['Tweet Number of Times Liked']

# diffusion to even more active readers, demonstrated by retweeting a tweet
df['User to Retweeter Diffusion'] = df['Compound Sentiment Score']*df['Tweet Number of Times Retweeted']

# total diffusion
df['Tweet Total Diffusion'] = df['User to Follower Diffusion'] + df['User to Liker Diffusion'] + df['User to Retweeter Diffusion']

In [45]:
# Now I establish the number of positive vs. negative sentiment tweets per neighborhood. 

tweet_by_neighb = df.groupby('Community Area Mentioned')

# total number of positive tweets per neighborhood:
total_tweet_by_neighb_pos_num = tweet_by_neighb.apply(lambda x: x[x['Compound Sentiment Score'] > 0]['Count 1'].sum())
# total number of negative tweets per neighborhood:
total_tweet_by_neighb_neg_num = tweet_by_neighb.apply(lambda x: x[x['Compound Sentiment Score'] < 0]['Count 1'].sum())
# total number of positive and negative tweets per neighborhood:
total_tweet_by_neighb_num = total_tweet_by_neighb_pos_num + total_tweet_by_neighb_neg_num

In [46]:
# Now I compute summary measures of the sentiment component of a neighborhood's reputation.
# Because I assume that a reputation involves a positive or negative sentiment, 
# I exclude all tweets that score as neutral (compound score = 0, review usign df[df['Compound Sentiment Score'] == 0]).

tweet_by_neighb = df.groupby('Community Area Mentioned')

for neighborhood in tweet_by_neighb:
    sum_pos_tweet_by_neighb = tweet_by_neighb.apply(lambda x: x[x['Compound Sentiment Score'] > 0]['Compound Sentiment Score'].sum())
    sum_neg_tweet_by_neighb = tweet_by_neighb.apply(lambda x: x[x['Compound Sentiment Score'] < 0]['Compound Sentiment Score'].sum())

In [47]:
neighb_level_df = pd.DataFrame({'Total Number of Positive Tweets':total_tweet_by_neighb_pos_num,
                                'Total Number of Negative Tweets':total_tweet_by_neighb_neg_num,
                                'Total Number of Positive and Negative Tweets': total_tweet_by_neighb_num,
                                'Sum Positive Tweet Sentiment':sum_pos_tweet_by_neighb,
                                'Sum Negative Tweet Sentiment':sum_neg_tweet_by_neighb})

In [48]:
neighb_level_df = neighb_level_df.drop(neighb_level_df.index[[0]])  # removes one row with no community area name

In [49]:
neighb_level_df

Unnamed: 0_level_0,Sum Negative Tweet Sentiment,Sum Positive Tweet Sentiment,Total Number of Negative Tweets,Total Number of Positive Tweets,Total Number of Positive and Negative Tweets
Community Area Mentioned,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
albany park,-5.6676,9.7646,16,18,34
archer heights,0.0000,0.5574,0,1,1
ashburn,-10.6303,5.5188,28,15,43
auburn gresham,-5.9133,1.1233,17,2,19
austin,-14.1287,18.5554,34,36,70
avalon park,0.0000,0.0000,0,0,0
avondale,-7.2062,7.6270,18,15,33
belmont cragin,0.0000,0.0000,0,0,0
beverly,-35.3824,174.6288,115,344,459
bridgeport,-16.1919,44.0589,36,87,123


<h4>Computing weighted and evenly weighed reputation scores per neighborhood</h4>

<u>Evenly Weighed Reputation Score:</u> Average is computed evenly, assuming a positive tweet is equal in influence as a negative tweet (except for direction of opinion)

<u>Weighted Reputation Score:</u> Humans have a famous bias for the negative - "even when of equal intensity, things of a more negative nature (e.g. unpleasant thoughts, emotions, or social interactions; harmful/traumatic events) have a greater effect on one's psychological state and processes than do neutral or positive things" (https://en.wikipedia.org/wiki/Negativity_bias). 

To model this negativity bias, I consider a ratio from an HBR article (Folkman, Jack Zenger and Joseph. 2013. “The Ideal Praise-to-Criticism Ratio.” Harvard Business Review. March 15. https://hbr.org/2013/03/the-ideal-praise-to-criticism.), which suggests that the ideal praise-to-criticism ratio in the context of team performance assessments is as follows: 5 positive comments for every negative one. Based on this idea, the weighted reputation score calculation below pretends that there are five times as many negative tweets for every neighborhood. 

In [51]:
for neighborhood in neighb_level_df:
    avg_neg = sum_neg_tweet_by_neighb / total_tweet_by_neighb_neg_num
    avg_pos = sum_pos_tweet_by_neighb / total_tweet_by_neighb_pos_num
    reputation_index_1 = (avg_neg+avg_pos) / 2
    reputation_index_2 = ((5*avg_neg)+avg_pos) / 6
    
neighb_level_df['Average Negative Sentiment'] = avg_neg
neighb_level_df['Average Positive Sentiment'] = avg_pos
neighb_level_df['Unweighted Reputation Score'] = reputation_index_1
neighb_level_df['Weighted Reputation Score'] = reputation_index_2

In [52]:
neighb_level_df.fillna(0, inplace=True)
neighb_level_df.sort_values(by='Unweighted Reputation Score')

Unnamed: 0_level_0,Sum Negative Tweet Sentiment,Sum Positive Tweet Sentiment,Total Number of Negative Tweets,Total Number of Positive Tweets,Total Number of Positive and Negative Tweets,Average Negative Sentiment,Average Positive Sentiment,Unweighted Reputation Score,Weighted Reputation Score
Community Area Mentioned,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
near north side,-0.8020,7.8572,1,16,17,-0.802000,0.491075,-0.155463,-0.586487
oakland,-18.7194,6.1594,32,22,54,-0.584981,0.279973,-0.152504,-0.440822
chicago lawn,-7.2072,5.9058,12,19,31,-0.600600,0.310832,-0.144884,-0.448695
riverdale,-134.1097,8.7716,210,20,230,-0.638618,0.438580,-0.100019,-0.459085
mount greenwood,-15.6672,6.5758,25,14,39,-0.626688,0.469700,-0.078494,-0.443957
west garfield park,-5.0693,4.6487,7,8,15,-0.724186,0.581088,-0.071549,-0.506640
burnside,-3.4787,1.1148,5,2,7,-0.695740,0.557400,-0.069170,-0.486883
hyde park,-79.1596,124.0035,122,239,361,-0.648849,0.518843,-0.065003,-0.454234
chatham,-72.7402,14.6224,114,27,141,-0.638072,0.541570,-0.048251,-0.441465
new city,-68.2218,106.1651,120,224,344,-0.568515,0.473951,-0.047282,-0.394771


In [54]:
neighb_level_df.shape
neighb_level_df.to_csv('df_neighb_level.csv')

<h3>Putting socioeconomic and Twitter data per neighborhood into one table</h3>

In [55]:
df_neighb_level = pd.read_csv('df_neighb_level.csv')
df_SES = pd.read_csv('Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012.csv')
df_SES = df_SES.dropna()  # do this so that the last row is dropped, which totals for the whole city

In [56]:
print(df_neighb_level.shape)
print(df_SES.shape)
# neighborhood level df appears to be missing some community areas, assume that it's bc these were not mentioned in tweets

(72, 10)
(77, 9)


In [57]:
# prepare for merging the two dataframes on df_SES['COMMUNITY AREA NAME'] and df_neighb_level['Community Area Mentioned']
df_SES = df_SES.sort_values(by='COMMUNITY AREA NAME')  # assure both columns are sorted in same direction
df_SES['COMMUNITY AREA NAME'] = df_SES['COMMUNITY AREA NAME'].apply(lambda x: x.lower())  # assure both are in lowercase
df_neighb_level.columns.values[0] ='COMMUNITY AREA NAME'  # assure both columns are named the same

In [58]:
print(df_neighb_level.columns)
print(df_SES.columns)

Index(['COMMUNITY AREA NAME', 'Sum Negative Tweet Sentiment',
       'Sum Positive Tweet Sentiment', 'Total Number of Negative Tweets',
       'Total Number of Positive Tweets',
       'Total Number of Positive and Negative Tweets',
       'Average Negative Sentiment', 'Average Positive Sentiment',
       'Unweighted Reputation Score', 'Weighted Reputation Score'],
      dtype='object')
Index(['Community Area Number', 'COMMUNITY AREA NAME',
       'PERCENT OF HOUSING CROWDED', 'PERCENT HOUSEHOLDS BELOW POVERTY',
       'PERCENT AGED 16+ UNEMPLOYED',
       'PERCENT AGED 25+ WITHOUT HIGH SCHOOL DIPLOMA',
       'PERCENT AGED UNDER 18 OR OVER 64', 'PER CAPITA INCOME ',
       'HARDSHIP INDEX'],
      dtype='object')


In [59]:
df_merged = df_SES.merge(df_neighb_level, 
                         how='left', 
                         on='COMMUNITY AREA NAME')
df_merged = df_merged.sort_values(by='Community Area Number')
df_merged = df_merged.fillna('NA')
df_merged

Unnamed: 0,Community Area Number,COMMUNITY AREA NAME,PERCENT OF HOUSING CROWDED,PERCENT HOUSEHOLDS BELOW POVERTY,PERCENT AGED 16+ UNEMPLOYED,PERCENT AGED 25+ WITHOUT HIGH SCHOOL DIPLOMA,PERCENT AGED UNDER 18 OR OVER 64,PER CAPITA INCOME,HARDSHIP INDEX,Sum Negative Tweet Sentiment,Sum Positive Tweet Sentiment,Total Number of Negative Tweets,Total Number of Positive Tweets,Total Number of Positive and Negative Tweets,Average Negative Sentiment,Average Positive Sentiment,Unweighted Reputation Score,Weighted Reputation Score
60,1.0,rogers park,7.7,23.6,8.7,18.2,27.5,23939,39.0,-8.3172,92.845,25,168,193,-0.332688,0.552649,0.10998,-0.185132
74,2.0,west ridge,7.8,17.2,8.8,20.8,38.5,23040,46.0,-4.8305,14.478,7,24,31,-0.690071,0.60325,-0.0434107,-0.474518
66,3.0,uptown,3.8,24.0,8.9,11.8,22.2,35787,20.0,-57.6967,95.481,147,236,383,-0.392495,0.404581,0.00604298,-0.259649
39,4.0,lincoln square,3.4,10.9,8.2,13.4,25.5,37524,17.0,-2.3563,88.356,7,157,164,-0.336614,0.562777,0.113081,-0.186716
51,5.0,north center,0.3,7.5,5.2,4.5,26.2,57123,6.0,-10.4607,18.5448,21,40,61,-0.498129,0.46362,-0.0172543,-0.337837
37,6.0,lake view,1.1,11.4,4.7,2.6,17.0,60058,5.0,-6.675,16.2563,15,28,43,-0.445,0.580582,0.0677911,-0.27407
38,7.0,lincoln park,0.8,12.3,5.1,3.6,21.5,71551,2.0,-94.7576,310.297,182,582,764,-0.520646,0.533156,0.0062551,-0.345012
47,8.0,near north side,1.9,12.9,7.0,2.5,22.6,88669,1.0,-0.802,7.8572,1,16,17,-0.802,0.491075,-0.155463,-0.586487
22,9.0,edison park,1.1,3.3,6.5,7.4,35.3,40959,8.0,-0.5958,8.7805,2,17,19,-0.2979,0.5165,0.1093,-0.162167
54,10.0,norwood park,2.0,5.4,9.0,11.5,39.5,32875,21.0,-7.7546,4.1417,21,7,28,-0.369267,0.591671,0.111202,-0.20911


In [60]:
# show me the neighborhoods that are not mentioned even once among the tweets downloaded for this round of analysis
df_merged[df_merged['Unweighted Reputation Score'] == 'NA']

Unnamed: 0,Community Area Number,COMMUNITY AREA NAME,PERCENT OF HOUSING CROWDED,PERCENT HOUSEHOLDS BELOW POVERTY,PERCENT AGED 16+ UNEMPLOYED,PERCENT AGED 25+ WITHOUT HIGH SCHOOL DIPLOMA,PERCENT AGED UNDER 18 OR OVER 64,PER CAPITA INCOME,HARDSHIP INDEX,Sum Negative Tweet Sentiment,Sum Positive Tweet Sentiment,Total Number of Negative Tweets,Total Number of Positive Tweets,Total Number of Positive and Negative Tweets,Average Negative Sentiment,Average Positive Sentiment,Unweighted Reputation Score,Weighted Reputation Score
64,30.0,south lawndale,15.2,30.7,15.8,54.8,33.8,10402,96.0,,,,,,,,,
2,34.0,armour square,5.7,40.1,16.7,34.5,38.3,16148,82.0,,,,,,,,,
13,48.0,calumet heights,2.1,11.5,20.0,11.0,44.0,28887,38.0,,,,,,,,,
69,62.0,west elsdon,11.1,15.6,16.7,37.0,37.7,15754,69.0,,,,,,,,,
67,73.0,washington height,1.1,16.9,20.8,13.7,42.6,19713,48.0,,,,,,,,,


In [61]:
df_merged.shape

(77, 18)

In [62]:
df_merged.to_csv('df_merged.csv')

# Develop descriptive statistics

In [63]:
import numpy as np
import matplotlib.pyplot as plt

In [75]:
df_neighb_level = pd.read_csv('df_merged.csv')
del df_neighb_level['Unnamed: 0']
#df_neighb_level
df_tweet_level = pd.read_csv('df_wCompound.csv')  # this contains all of the neutral tweets, remove this
del df_tweet_level['Unnamed: 0']
del df_tweet_level['Unnamed: 0.1']
#df_tweet_level

In [76]:
# remove all tweets that score neutral in terms of compound sentiment score
df_tweet_level = df_tweet_level[df_tweet_level['Compound Sentiment Score'] != 0]
#df_tweet_level[df_tweet_level['Compound Sentiment Score'] == 0]

In [77]:
print(df_tweet_level.columns)
print(df_neighb_level.columns)

Index(['Community Area Mentioned', 'Tweet Content', 'Tweet Coordinates',
       'Tweet Day and Time', 'Tweet Number of Times Liked',
       'Tweet Number of Times Retweeted', 'User Location',
       'User Number of Followers', 'User Screen Name', 'Sentiment',
       'Compound Sentiment Score'],
      dtype='object')
Index(['Community Area Number', 'COMMUNITY AREA NAME',
       'PERCENT OF HOUSING CROWDED', 'PERCENT HOUSEHOLDS BELOW POVERTY',
       'PERCENT AGED 16+ UNEMPLOYED',
       'PERCENT AGED 25+ WITHOUT HIGH SCHOOL DIPLOMA',
       'PERCENT AGED UNDER 18 OR OVER 64', 'PER CAPITA INCOME ',
       'HARDSHIP INDEX', 'Sum Negative Tweet Sentiment',
       'Sum Positive Tweet Sentiment', 'Total Number of Negative Tweets',
       'Total Number of Positive Tweets',
       'Total Number of Positive and Negative Tweets',
       'Average Negative Sentiment', 'Average Positive Sentiment',
       'Unweighted Reputation Score', 'Weighted Reputation Score'],
      dtype='object')


In [80]:
# show basic descriptives stats per neighborhood and per variable 
a = df_tweet_level.groupby('Community Area Mentioned')
# can also use a.mean() and a.std() to examine just mean or standard deviation

b = df_neighb_level.groupby('COMMUNITY AREA NAME')
# b.mean() and b.std()

a.describe()
b.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Compound Sentiment Score,Tweet Number of Times Liked,Tweet Number of Times Retweeted,User Number of Followers
Community Area Mentioned,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
albany park,count,34.000000,34.000000,34.000000,34.000000
albany park,mean,0.120500,0.352941,0.058824,661.647059
albany park,std,0.474977,0.645842,0.342997,873.805605
albany park,min,-0.735100,0.000000,0.000000,30.000000
albany park,25%,-0.340000,0.000000,0.000000,157.000000
albany park,50%,0.283200,0.000000,0.000000,266.500000
albany park,75%,0.560600,0.750000,0.000000,999.000000
albany park,max,0.803400,2.000000,2.000000,4512.000000
archer heights,count,1.000000,1.000000,1.000000,1.000000
archer heights,mean,0.557400,1.000000,0.000000,46.000000


# Develop inferential statistics

# Conclusions