# Sentiment Analysis on SB 827
California, particularly in coastal cities, has a really significant housing crisis. Recently, an ambitious bill was put forward in the California state senate that would have overrided local zoning laws in order to permit the construction of dense housing near frequent public transit stops. Unfortunately the bill died in committee before any real polling could be done to see whether the broader community wanted it. I thought I'd try to do some sentiment analysis on some Tweets on #SB827 in order to gauge support for the bill. Overall, 73% of the tweets in the set seemed to be supportive of the bill, with 72% of tweets from the West Coast and 75% of tweets from elsewhere expressing support of the bill. It seems likely that there is enough broad support that a similar bill could succeed in the future.

In [2]:
#import relevant packages
import tweepy
import sys
import jsonpickle
import os
import json
import numpy as np
import math
import pandas as pd
import textblob
import csv

## Let's Go Grab a Bunch of Messy Data
The goal of this section is to, with the assistance of Tweepy, grab a bunch of the tweets with the Seattle Hash Tag from the last month.

In [4]:
#add all of the authetication information
#also I should try to figure out how to make this non-legible later
consumer_key = ""
consumer_secret = ''
access_token = ""
access_token_secret = ''

In [34]:
#Create authentication object
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
#Set access info
auth.set_access_token(access_token, access_token_secret)
#Make API
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)

In [31]:
#Pull some tweets that hashtag SB 827

housing_query = "#SB827"
housing_tweets = api.search(q=housing_query, lang=language, count=100)

for tweet in housing_tweets:
    print tweet.user.screen_name, "Tweeted:", tweet.text

mtsw Tweeted: RT @mnolangray: #SB827's defeat was a bummer. But there are dozens of other major YIMBY initiatives underway all over the country. My lates…
Sightline Tweeted: RT @mnolangray: #SB827's defeat was a bummer. But there are dozens of other major YIMBY initiatives underway all over the country. My lates…
CaseyJGiven Tweeted: RT @mnolangray: #SB827's defeat was a bummer. But there are dozens of other major YIMBY initiatives underway all over the country. My lates…
thomas_lord Tweeted: @IDoTheThinking @suldrew @daguilarcanabal In the words of the great shootist Kim Carsons[*], TYT.  Take Your Time.… https://t.co/fnkJS77g7u
hamilt0n Tweeted: RT @mnolangray: #SB827's defeat was a bummer. But there are dozens of other major YIMBY initiatives underway all over the country. My lates…
ZaxxonGalaxian Tweeted: RT @mnolangray: #SB827's defeat was a bummer. But there are dozens of other major YIMBY initiatives underway all over the country. My lates…
CarlMuhlstein Tweeted: RT @dillonliam:

Alright, so that technique can't actually get us all that many tweets. Let's try implementing a solution from Bhaskar Karambelkar's Blog in order to get more efficient collection of the tweets.

In [40]:
maxTweets = 50000 #No particular reason for this
tweetsperQ = 100 #maximimum possible Tweets/Query
fName = 'SB827tweets.txt'
sinceId = None #Alter this later if you need a specific start data
max_id = -1L

tweetCount = 0
print ("Downloading max {} tweets".format(maxTweets))
with open(fName, 'w') as f:
    while tweetCount<maxTweets:
        try:
            if(max_id<=0):
                if(not sinceId):
                    new_tweets = api.search(q=housing_query, count=tweetsperQ)
                else:
                    new_tweets = api.search(q=housing_query, count=tweetsperQ, since_id=sinceId)
            else:
                if(not sinceId):
                    new_tweets = api.search(q=housing_query, count=tweetsperQ, max_id=str(max_id-1))
                else:
                    new_tweets = api.search(q=housing_query, count=tweetsperQ, max_id=str(mad_id-1), since_id=sinceId)
            if(not new_tweets):
                print("No more tweets found")
                break
            for tweet in new_tweets:
                f.write(jsonpickle.encode(tweet._json, unpicklable=False)+'\n')
            tweetCount += len(new_tweets)
            print("Downloaded {} tweets".format(tweetCount))
            max_id = new_tweets[-1].id
        except tweepy.TweepError as e:
                print("Oh dear: "+str(e))
                break
print("Downloaded {} tweets".format(tweetCount))
            

Downloading max 50000 tweets
Downloaded 100 tweets
Downloaded 200 tweets
Downloaded 300 tweets
Downloaded 400 tweets
Downloaded 500 tweets
Downloaded 600 tweets
Downloaded 700 tweets
Downloaded 800 tweets
Downloaded 900 tweets
Downloaded 1000 tweets
Downloaded 1100 tweets
Downloaded 1200 tweets
Downloaded 1300 tweets
Downloaded 1400 tweets
Downloaded 1500 tweets
Downloaded 1600 tweets
Downloaded 1700 tweets
Downloaded 1800 tweets
Downloaded 1900 tweets
Downloaded 2000 tweets
Downloaded 2100 tweets
Downloaded 2200 tweets
Downloaded 2278 tweets
Downloaded 2378 tweets
Downloaded 2478 tweets
Downloaded 2578 tweets
Downloaded 2678 tweets
Downloaded 2778 tweets
Downloaded 2878 tweets
Downloaded 2978 tweets
Downloaded 3078 tweets
Downloaded 3178 tweets
Downloaded 3278 tweets
Downloaded 3378 tweets
No more tweets found
Downloaded 3378 tweets


## Explore the tweets
The goal of this section is to unpickle the tweets, try to figure out what information from them is likely to be worth keeping around, and then move the tweets to a dataframe and start looking at factors about how they were posted

In [9]:
#Load in a single tweet
SenateBillParsed = json.loads(f.readline())

In [18]:
#Examine the first level keys within that tweet
SenateBillParsed

{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Fri Apr 27 21:28:03 +0000 2018',
 u'entities': {u'hashtags': [{u'indices': [16, 22], u'text': u'SB827'}],
  u'symbols': [],
  u'urls': [],
  u'user_mentions': [{u'id': 46818898,
    u'id_str': u'46818898',
    u'indices': [3, 14],
    u'name': u'Nolan Gray \U0001f3d7\U0001f310',
    u'screen_name': u'mnolangray'}]},
 u'favorite_count': 0,
 u'favorited': False,
 u'geo': None,
 u'id': 989979514246782982,
 u'id_str': u'989979514246782982',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'is_quote_status': False,
 u'lang': u'en',
 u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'},
 u'place': None,
 u'retweet_count': 27,
 u'retweeted': False,
 u'retweeted_status': {u'contributors': None,
  u'coordinates': None,
  u'created_at': u'Thu Apr 26 17:12:32 +0000 2018',
  u'entities': {u'hash

In [28]:
#print some of the information that might be particularly interesting to keep around
print SenateBillParsed['user']['screen_name']
print SenateBillParsed['user']['time_zone']
print SenateBillParsed['text']
print SenateBillParsed['created_at']
print SenateBillParsed['user']['followers_count']
print SenateBillParsed['user']['location']

colorfulharp233
Eastern Time (US & Canada)
RT @mnolangray: #SB827's defeat was a bummer. But there are dozens of other major YIMBY initiatives underway all over the country. My lates…
Fri Apr 27 21:28:03 +0000 2018
162



In [36]:
#Look at the sub-keys under the user key to see if anything else might be helpful
SenateBillParsed['user'].keys()

[u'follow_request_sent',
 u'has_extended_profile',
 u'profile_use_background_image',
 u'id',
 u'default_profile',
 u'verified',
 u'profile_text_color',
 u'profile_image_url_https',
 u'profile_sidebar_fill_color',
 u'is_translator',
 u'geo_enabled',
 u'entities',
 u'followers_count',
 u'profile_sidebar_border_color',
 u'location',
 u'default_profile_image',
 u'id_str',
 u'is_translation_enabled',
 u'utc_offset',
 u'statuses_count',
 u'description',
 u'friends_count',
 u'profile_link_color',
 u'profile_image_url',
 u'notifications',
 u'profile_background_image_url_https',
 u'profile_background_color',
 u'profile_banner_url',
 u'profile_background_image_url',
 u'name',
 u'lang',
 u'profile_background_tile',
 u'favourites_count',
 u'screen_name',
 u'url',
 u'created_at',
 u'contributors_enabled',
 u'time_zone',
 u'protected',
 u'translator_type',
 u'following',
 u'listed_count']

In [77]:
columnslist = ['screen_name','time_zone','followers_count','location','text']
SBdf = pd.DataFrame(columns = columnslist)
fName = 'SB827tweets.txt'
f = open(fName, 'r+')
for line in f.readlines():
    new_tweet = json.loads(line)
    d = {'screen_name':[new_tweet['user']['screen_name']],
        'time_zone':[new_tweet['user']['time_zone']],
        'followers_count':[new_tweet['user']['followers_count']],
        'location':[new_tweet['user']['location']],
        'text':[new_tweet['text']]}
    new_tweet_df = pd.DataFrame(data=d)
    SBdf = SBdf.append(new_tweet_df, ignore_index=True)

In [78]:
SBdf.size

16890

In [79]:
SBdf.columns

Index([u'followers_count', u'location', u'screen_name', u'text', u'time_zone'], dtype='object')

In [80]:
SBdf.time_zone.unique() #Examining the range of time zones that the tweets come from.

array([None, u'Eastern Time (US & Canada)', u'Pacific Time (US & Canada)',
       u'Central Time (US & Canada)', u'Alaska', u'Arizona',
       u'Atlantic Time (Canada)', u'Adelaide', u'America/New_York',
       u'Quito', u'Tijuana', u'Tehran', u'America/Los_Angeles',
       u'Amsterdam', u'Europe/London', u'Mountain Time (US & Canada)',
       u'Hawaii', u'London', u'Paris', u'Stockholm', u'Berlin', u'Beijing',
       u'Helsinki', u'Bern', u'Chennai', u'Dublin', u'Vilnius',
       u'Pretoria', u'Azores', u'Tokyo', u'International Date Line West',
       u'Central America', u'Kabul', u'Cairo', u'Madrid', u'PDT',
       u'Brasilia', u'Casablanca', u'Perth', u'Karachi', u'Brisbane',
       u'Santiago', u'Rome', u'Midway Island', u'Georgetown', u'Hong Kong',
       u'Lima', u'Sydney', u'America/Chicago'], dtype=object)

In [81]:
SBdf.location.unique() #Examine the locations associated with the accounts

array([u'Beverly Hills, CA', u'', u'San Francisco', u'Arlington, VA',
       u'Free Imperial City of San Francisco', u'San Francisco, CA',
       u'Athens, Georgia', u'Beverly Hills, Stockholm', u'Brooklyn, NY',
       u'Cascadia', u'Washington, D.C.', u'Oakland, Calif.', u'.',
       u'Los Angeles', u'San Diego, CA', u'DC & CA', u'Empire State',
       u'East Bay', u'Los Angeles, CA', u'San Jose, California',
       u'San Diego', u'Washington D.C.', u'Westwood Park, San Francisco',
       u'San Francisco, CA USA', u'USA, USA', u'California',
       u'Los Osos, CA', u'Berkeley, CA', u'West of Oz', u'Charlotte',
       u'New York, NY', u'SFO', u'Oakland, California', u'Oakland, CA',
       u'Lower Haight, San Francisco', u'California, USA',
       u'San Francisco via Adelaide', u'Santa Monica, CA',
       u'Redding, Calif.', u'Washington, DC', u'Utah',
       u'State College, PA', u'Vancouver', u'DC', u'Sacramento, CA',
       u'\xdcT: 41.991823,-70.716601', u'Oakland & California',
   

Save the dataframe before progressing

In [83]:
dffile = 'SB827df.pkl'
SBdf.to_pickle(dffile)

## Label The Tweets

In order to start classifying the tweets, it is first necessary to label a bunch of them as being in favor of, in opposition to, or neutral to the bill. It is assumed in this labeling process that retweets of tweets from known supporters or people from the opposition, even when the wording of the tweet itself is neutral, is associated with the position of the person being retweeted.

In [183]:
#randomize the order of the tweets
SBdf = SBdf.sample(frac=1).reset_index(drop=True)

In [3]:
dffile = 'SB827df_2.pkl'
SBdf = pd.read_pickle(dffile)

In [4]:
print SBdf.shape
num_tweets = SBdf.shape[0]
Approves = np.zeros((num_tweets,1))

(3378, 5)


In [188]:
#Split off a block of tweets to label
Train = SBdf.text[0:num_tweets*3//10-1]

In [189]:
def label_maker(tweetlist):
    tweetlist = tweetlist.reset_index()
    tot_tweet = tweetlist.shape[0]
    concur = np.zeros((tot_tweet,1))
    for i in range(0,tot_tweet):
        print tweetlist.loc[i].text
        concur[i]=input("Does this tweet approve of SB827? 0 for no, 1 for yes, 2 for neutral")
    return concur

In [201]:
#Label a chunk of the tweets. Initially I intended to label all of the tweets in this set,
#but it turns out that labeling hundreds of tweets is, in fact, time consuming.
#So I settled for only labeling 300 tweets.

Train_approve = np.zeros([Train.shape[0],1])
Train_approve[0:5]=label_maker(Train[0:5])

Absolutely. And for all the fighting in SF, the campaign to pass #SB827 was lost in #LosAngeles as I describe in my… https://t.co/JrHEyYJlnl
Does this tweet approve of SB827? 0 for no, 1 for yes, 2 for neutral0
RT @Tracktwentynine: Is it sad that the most transformative housing bill in the country died tonight, the first time anything like it was p…
Does this tweet approve of SB827? 0 for no, 1 for yes, 2 for neutral1
RT @Scott_Wiener: 2/4 Jane Kim claims #SB827 allows “unlimited luxury condo high rises.” That’s false (it’s 4-5 story buildings) &amp; an odd c…
Does this tweet approve of SB827? 0 for no, 1 for yes, 2 for neutral1
#Cities, if you were really upset about what #SB827 was going to do to your local community, now is your chance. Pa… https://t.co/D54J0f9Rcd
Does this tweet approve of SB827? 0 for no, 1 for yes, 2 for neutral1
Housing Is A Human Right &amp; many other housing justice and tenants groups successfully fought to defeat the pro-gent… https://t.co/8N7gItB41o
Does th

In [345]:
np.save('Train_approve.npy',Train_approve) #Saved the labels

In [231]:
#Save the tweets in the order that's associated with the labels
dffile = 'SB827df_2.pkl'
SBdf.to_pickle(dffile)

In [240]:
Train_approve[295:300]=label_maker(Train[295:300])

RT @dillonliam: As GOP Sen. @TedGaines speaks in support, it’s also remarkable that there appears there could be more Republican support fo…
Does this tweet approve of SB827? 0 for no, 1 for yes, 2 for neutral1
RT @dillonliam: It’s officially dead #SB827. Four votes in favor.
Does this tweet approve of SB827? 0 for no, 1 for yes, 2 for neutral1
After co-author Nancy Skinner speaks in favor of #sb827 Sen McGuire, chair of the next committee to which the bill… https://t.co/PdtDEbSJYx
Does this tweet approve of SB827? 0 for no, 1 for yes, 2 for neutral2
#SB827 😢😥😓
Does this tweet approve of SB827? 0 for no, 1 for yes, 2 for neutral1
RT @drvox: #SB827 has died in committee because everyone sucks &amp; everyone prefers the horrific, unsustainable, unjust status quo to a solut…
Does this tweet approve of SB827? 0 for no, 1 for yes, 2 for neutral1


# Train a Classifier

Now we're going to break the labeled tweets into Training, Cross Validation, and Test Sets in order to attempt to train some classifiers on them.

In [None]:
dffile = 'SB827df_2.pkl'
SBdf = pd.read_pickle(dffile) #Load dataframe created in prior section

In [5]:
Train_approve = np.load('Train_approve.npy') #Load labels created in prior section

In [6]:
SBdf

Unnamed: 0,followers_count,location,screen_name,text,time_zone
0,2989,"San Francisco, CA",beyondchron,"Absolutely. And for all the fighting in SF, th...",Pacific Time (US & Canada)
1,2356,"Salt Lake City, Utah",MRC_SLC,RT @Tracktwentynine: Is it sad that the most t...,Mountain Time (US & Canada)
2,1197,"San Francisco, CA",marcus_ismael,RT @Scott_Wiener: 2/4 Jane Kim claims #SB827 a...,Pacific Time (US & Canada)
3,510,"Los Angeles, CA",Asmarcd,"#Cities, if you were really upset about what #...",Pacific Time (US & Canada)
4,661,"California, USA",HousingHumanRt,Housing Is A Human Right &amp; many other hous...,
5,797,"San Diego, CA",msmayarosas,RT @cayimby: Profound thanks to Sen. @Scott_Wi...,
6,1201,"South LA, Boyle Hts, & beyond",sahrasulaiman,RT @dillonliam: It’s officially dead #SB827. F...,Arizona
7,3797,San Francisco,sutrofog,RT @mikevladimer: @MarkLeno @AaronPeskin I gre...,Pacific Time (US & Canada)
8,1677,sutherland Shire,donaldh66287394,RT @AlexSteffen: The single most powerful solu...,
9,911,"Washington, DC",vanessabcalder,"""The larger issue is that too many Democrats h...",


In [34]:
#Split text of labeled tweets into training, cross validation, and test sets
Train = SBdf.text[0:200]
CrossVal = SBdf.text[200:260]
Test = SBdf.text[260:300]
Other = SBdf.text[300:SBdf.shape[0]]

In [39]:
#Split labels into training, cross validation, and test sets
Train_Label = Train_approve[0:200]
CrossVal_Label = Train_approve[200:260]
Test_Label = Train_approve[260:300]

In [40]:
#First let's clean up the text further and get rid of special characters
def clean_up_text(df):
    df = df.str.replace(r"http\S+", "")
    df = df.str.replace(r"http", "")
    df = df.str.replace(r"@\S+", "")
    df = df.str.replace(r"[^A-Za-z0-9(),!?@\'\`\"\_\n]", " ")
    df = df.str.replace(r"@", "at")
    df = df.str.lower()
    return df

In [41]:
def format_for_textblob(Tweets, Labels, fname):
    Tweets = Tweets.reset_index(drop=True)
    Tweets = clean_up_text(Tweets)
    tot_tweets = Tweets.shape[0]
    tot_label = Labels.shape[0]
    if tot_tweets==tot_label:
        with open(fname,'wv') as csvf:
            tweetwriter = csv.writer(csvf)
            for i in range(tot_tweets):
                if Labels[i]==1:
                    label = 'pos'
                else:
                    label = 'neg'
                tweet = Tweets.loc[i]
                new_tweet = str(tweet.encode('utf-8')) #this will deal with any remain weird characters
                tweetwriter.writerow([new_tweet, label])
        print ("Saved file")
    else:
        print "Bad News: Lists weren't of the same size"

In [42]:
format_for_textblob(Train, Train_Label, 'train_with_labels.csv')

Saved file


In [43]:
format_for_textblob(CrossVal, CrossVal_Label, 'crossval_with_labels.csv')

Saved file


A Naive Bayes Classifier is going to be training on the labeled data and used to predict labels for the rest of the tweets. This type of classifier is really basic and easy to interpret. It assumes no dependence between the features (in this case, words) and try to predict the probability that a tweets is in the positive or negative cases based purely on the frequency of the words contained within. It's simplicity and interpretability makes it a good first model. The implementation through textblob conveniently requires minimal preprocessing.

In [44]:
from textblob.classifiers import NaiveBayesClassifier

In [45]:
with open('train_with_labels.csv','r') as fp:
    cl = NaiveBayesClassifier(fp, format='csv')

In [46]:
with open('crossval_with_labels.csv','r') as cross:
    print cl.accuracy(cross, format ='csv')

0.733333333333


In [47]:
cl.show_informative_features(20)

Most Informative Features
           contains(one) = True              neg : pos    =      6.6 : 1.0
       contains(against) = True              neg : pos    =      5.8 : 1.0
         contains(about) = True              pos : neg    =      5.4 : 1.0
         contains(voted) = True              pos : neg    =      5.0 : 1.0
           contains(sen) = True              neg : pos    =      5.0 : 1.0
          contains(says) = True              neg : pos    =      4.5 : 1.0
           contains(san) = True              neg : pos    =      4.5 : 1.0
      contains(building) = True              neg : pos    =      4.5 : 1.0
     contains(francisco) = True              neg : pos    =      4.5 : 1.0
         contains(bills) = True              neg : pos    =      4.5 : 1.0
           contains(pro) = True              neg : pos    =      4.5 : 1.0
           contains(out) = True              pos : neg    =      4.5 : 1.0
            contains(at) = True              neg : pos    =      4.2 : 1.0

In [50]:
def confusion_matrix_stats(Tweets, labels, cla):
    from textblob import TextBlob
    Tweets = Tweets.reset_index(drop=True)
    Tweets = clean_up_text(Tweets)
    tot_tweets = Tweets.shape[0]
    true_pos = 0
    false_pos = 0
    true_neg = 0
    false_neg = 0
    for i in range(tot_tweets):
        tweet = Tweets.loc[i]
        label = labels[i]
        new_tweet = str(tweet.encode('utf-8'))
        blob_tweet = TextBlob(new_tweet, classifier = cla)
        blob_label = blob_tweet.classify()
        if blob_label =='pos':
            if label == 1:
                true_pos += 1
            else:
                false_pos += 1
        elif label ==1:
            false_neg += 1
        else:
            true_neg += 1
    return (true_pos, false_pos, true_neg, false_neg)

In [51]:
(tp, fp, tn, fn) = confusion_matrix_stats(CrossVal, CrossVal_Label, cl)
print tp
print fp
print tn
print fn

36
13
8
3


So that's a modest first attempt. The classifier has okay accuracy and deals fairly well with how inbalanced the classes are, but based on the informative features list it and overrates innocuous pieces of information. Additionally, due to the larger number of false positives than false negatives, it is worth noting that this classifier will likely overstate the amount of support for SB 827.

I also tried some of the other classifiers within the TextBlob package (such as the Decision Tree Classifier below) but ultimately these didn't seems to perform any better.

In [52]:
from textblob.classifiers import DecisionTreeClassifier
with open('train_with_labels.csv','r') as fp:
    treecl = DecisionTreeClassifier(fp, format='csv')

In [54]:
with open('crossval_with_labels.csv','r') as cross:
    print treecl.accuracy(cross, format='csv')

0.683333333333


In [55]:
(tp, fp, tn, fn) = confusion_matrix_stats(CrossVal, CrossVal_Label, treecl)
print tp
print fp
print tn
print fn

34
14
7
5


With more time it could have been useful to use Word2Vec to get a semantic embedding of the tweets and then train a classifier (with the assistance of the Scikit-learn package) from there, but I stuck with the textblob Naive Bayes Classifier for the purposes of this project. After all, I'm just trying to get a vague gauge of the popularity of a dead senate bill. This isn't a model that will be pushed out to a broader use case.

# Classify and Analyze

Now that there's a trained classifier it will be used to classify the other tweets and then doing some very basic analysis of the levels of support for the bill.

In [56]:
def classify_other_tweets(Tweets, cla):
    from textblob import TextBlob
    Tweets = Tweets.reset_index(drop=True)
    tot_tweets = Tweets.shape[0]
    labels = np.zeros([tot_tweets,1])
    for i in range(tot_tweets):
        tweet = Tweets.loc[i]
        new_tweet = str(tweet.encode('utf-8'))
        blob_tweet = TextBlob(new_tweet, classifier = cla)
        blob_label = blob_tweet.classify()
        if blob_label =='pos':
             num_label = 1
        else:
             num_label = 0
        labels[i]=num_label
    return labels

In [57]:
Other_clean=clean_up_text(Other)
Other_label = classify_other_tweets(Other_clean,cl)

In [58]:
print Train_Label.shape
print CrossVal_Label.shape
print Test_Label.shape
print Other_label.shape
Total_label = np.append(Train_Label,CrossVal_Label,axis=0)
Total_label = np.append(Total_label,Test_Label, axis = 0)
Total_label = np.append(Total_label,Other_label, axis=0)

(200, 1)
(60, 1)
(40, 1)
(3078, 1)


In [59]:
print Total_label.shape

(3378, 1)


In [60]:
np.save('Full_Tweet_Labels.npy',Total_label)

In [61]:
SBdf['label']=Total_label

In [62]:
SBdf

Unnamed: 0,followers_count,location,screen_name,text,time_zone,label
0,2989,"San Francisco, CA",beyondchron,"Absolutely. And for all the fighting in SF, th...",Pacific Time (US & Canada),0.0
1,2356,"Salt Lake City, Utah",MRC_SLC,RT @Tracktwentynine: Is it sad that the most t...,Mountain Time (US & Canada),1.0
2,1197,"San Francisco, CA",marcus_ismael,RT @Scott_Wiener: 2/4 Jane Kim claims #SB827 a...,Pacific Time (US & Canada),1.0
3,510,"Los Angeles, CA",Asmarcd,"#Cities, if you were really upset about what #...",Pacific Time (US & Canada),1.0
4,661,"California, USA",HousingHumanRt,Housing Is A Human Right &amp; many other hous...,,0.0
5,797,"San Diego, CA",msmayarosas,RT @cayimby: Profound thanks to Sen. @Scott_Wi...,,1.0
6,1201,"South LA, Boyle Hts, & beyond",sahrasulaiman,RT @dillonliam: It’s officially dead #SB827. F...,Arizona,2.0
7,3797,San Francisco,sutrofog,RT @mikevladimer: @MarkLeno @AaronPeskin I gre...,Pacific Time (US & Canada),1.0
8,1677,sutherland Shire,donaldh66287394,RT @AlexSteffen: The single most powerful solu...,,1.0
9,911,"Washington, DC",vanessabcalder,"""The larger issue is that too many Democrats h...",,1.0


In [63]:
dffile = 'SB827df_with_labels.pkl'
SBdf.to_pickle(dffile)

In [2]:
dffile = 'SB827df_with_labels.pkl'
SBdf = pd.read_pickle(dffile)

In [64]:
SBdf.location.unique()

array([u'San Francisco, CA', u'Salt Lake City, Utah', u'Los Angeles, CA',
       u'California, USA', u'San Diego, CA',
       u'South LA, Boyle Hts, & beyond', u'San Francisco',
       u'sutherland Shire', u'Washington, DC', u'Sacramento, CA',
       u'Berkeley, CA', u'BUF \u27a1\ufe0f LA', u'Oakland, CA',
       u'The Space Force Core', u'from Cleveland, in Oakland', u'Atlanta',
       u'LA LA Land', u'Vilnius', u'Bandis.Stockholm.se',
       u'\xdcT: 34.041042,-118.191894', u'', u'Brooklyn, NY',
       u'L.A.via WashHts/HarlemNYC', u'San Francisco, California',
       u'Mile High, CO', u'Silicon Valley | SF | 39.5K Ft',
       u'Oakland, Calif.', u'upstate New York', u'NW LDN',
       u'Los Angeles- Boston ', u'Minnesota, USA', u'Santa Monica, CA',
       u'California', u'Seattle, WA', u'Phoenix, AZ', u'New Brunswick, NJ',
       u'Olathe, KS', u'Southeast U.S.', u'Austin, TX', u'mourning',
       u'DC by way of Upstate NY', u'Leimert Park, Los Angeles, CA',
       u'Chicago, IL', u'

There's a mixture of Californian locals and people from far-flung regions of the globe in this dataframe. My suspicion would be that the non-locals are overwhelming in support of this measure (otherwise why would they bother tracking it) whereas the Californians may be more divided. So I'm going to try to sort this out.

In [65]:
def west_coast_california(df): #pretends being on the west coast is equivalent to being Californian
    Time_Zones = df.time_zone
    People = df.shape[0]
    Californian = np.zeros([People,1])
    for i in range(People):
        if Time_Zones.loc[i]=="Pacific Time (US & Canada)":
            Californian[i]=1
        else:
            Californian[i]=0
    return Californian        

The West Coast Californian function is pretty rudamentary. This could be improved by first trying to classify the location as containing words that indicate a location in California and then defaulting to using the time zone if no location is provided.

In [66]:
californian_array = west_coast_california(SBdf)

In [67]:
SBdf['californian']=californian_array

In [68]:
SBdf

Unnamed: 0,followers_count,location,screen_name,text,time_zone,label,californian
0,2989,"San Francisco, CA",beyondchron,"Absolutely. And for all the fighting in SF, th...",Pacific Time (US & Canada),0.0,1.0
1,2356,"Salt Lake City, Utah",MRC_SLC,RT @Tracktwentynine: Is it sad that the most t...,Mountain Time (US & Canada),1.0,0.0
2,1197,"San Francisco, CA",marcus_ismael,RT @Scott_Wiener: 2/4 Jane Kim claims #SB827 a...,Pacific Time (US & Canada),1.0,1.0
3,510,"Los Angeles, CA",Asmarcd,"#Cities, if you were really upset about what #...",Pacific Time (US & Canada),1.0,1.0
4,661,"California, USA",HousingHumanRt,Housing Is A Human Right &amp; many other hous...,,0.0,0.0
5,797,"San Diego, CA",msmayarosas,RT @cayimby: Profound thanks to Sen. @Scott_Wi...,,1.0,0.0
6,1201,"South LA, Boyle Hts, & beyond",sahrasulaiman,RT @dillonliam: It’s officially dead #SB827. F...,Arizona,2.0,0.0
7,3797,San Francisco,sutrofog,RT @mikevladimer: @MarkLeno @AaronPeskin I gre...,Pacific Time (US & Canada),1.0,1.0
8,1677,sutherland Shire,donaldh66287394,RT @AlexSteffen: The single most powerful solu...,,1.0,0.0
9,911,"Washington, DC",vanessabcalder,"""The larger issue is that too many Democrats h...",,1.0,0.0


In [69]:
sum(SBdf.californian)

1750.0

In [70]:
sum(SBdf.label)

2532.0

In [71]:
dffile = 'SB827df_with_californians.pkl'
SBdf.to_pickle(dffile)

In [2]:
dffile = 'SB827df_with_californians.pkl'
SBdf = pd.read_pickle(dffile)

In [72]:
SBdf

Unnamed: 0,followers_count,location,screen_name,text,time_zone,label,californian
0,2989,"San Francisco, CA",beyondchron,"Absolutely. And for all the fighting in SF, th...",Pacific Time (US & Canada),0.0,1.0
1,2356,"Salt Lake City, Utah",MRC_SLC,RT @Tracktwentynine: Is it sad that the most t...,Mountain Time (US & Canada),1.0,0.0
2,1197,"San Francisco, CA",marcus_ismael,RT @Scott_Wiener: 2/4 Jane Kim claims #SB827 a...,Pacific Time (US & Canada),1.0,1.0
3,510,"Los Angeles, CA",Asmarcd,"#Cities, if you were really upset about what #...",Pacific Time (US & Canada),1.0,1.0
4,661,"California, USA",HousingHumanRt,Housing Is A Human Right &amp; many other hous...,,0.0,0.0
5,797,"San Diego, CA",msmayarosas,RT @cayimby: Profound thanks to Sen. @Scott_Wi...,,1.0,0.0
6,1201,"South LA, Boyle Hts, & beyond",sahrasulaiman,RT @dillonliam: It’s officially dead #SB827. F...,Arizona,2.0,0.0
7,3797,San Francisco,sutrofog,RT @mikevladimer: @MarkLeno @AaronPeskin I gre...,Pacific Time (US & Canada),1.0,1.0
8,1677,sutherland Shire,donaldh66287394,RT @AlexSteffen: The single most powerful solu...,,1.0,0.0
9,911,"Washington, DC",vanessabcalder,"""The larger issue is that too many Democrats h...",,1.0,0.0


In [73]:
#recategorize neutral as non-support
SBdf['label']=SBdf.label.replace(2.0,0.0)

In [74]:
SBdf

Unnamed: 0,followers_count,location,screen_name,text,time_zone,label,californian
0,2989,"San Francisco, CA",beyondchron,"Absolutely. And for all the fighting in SF, th...",Pacific Time (US & Canada),0.0,1.0
1,2356,"Salt Lake City, Utah",MRC_SLC,RT @Tracktwentynine: Is it sad that the most t...,Mountain Time (US & Canada),1.0,0.0
2,1197,"San Francisco, CA",marcus_ismael,RT @Scott_Wiener: 2/4 Jane Kim claims #SB827 a...,Pacific Time (US & Canada),1.0,1.0
3,510,"Los Angeles, CA",Asmarcd,"#Cities, if you were really upset about what #...",Pacific Time (US & Canada),1.0,1.0
4,661,"California, USA",HousingHumanRt,Housing Is A Human Right &amp; many other hous...,,0.0,0.0
5,797,"San Diego, CA",msmayarosas,RT @cayimby: Profound thanks to Sen. @Scott_Wi...,,1.0,0.0
6,1201,"South LA, Boyle Hts, & beyond",sahrasulaiman,RT @dillonliam: It’s officially dead #SB827. F...,Arizona,0.0,0.0
7,3797,San Francisco,sutrofog,RT @mikevladimer: @MarkLeno @AaronPeskin I gre...,Pacific Time (US & Canada),1.0,1.0
8,1677,sutherland Shire,donaldh66287394,RT @AlexSteffen: The single most powerful solu...,,1.0,0.0
9,911,"Washington, DC",vanessabcalder,"""The larger issue is that too many Democrats h...",,1.0,0.0


In [75]:
total_support = sum(SBdf.label)/SBdf.shape[0]
total_support

0.73357015985790408

In [76]:
Californian_support = sum(SBdf.label * SBdf.californian)/sum(SBdf.californian)
Californian_support

0.71828571428571431

In [77]:
Non_Californian_support = sum(SBdf.label * (1-SBdf.californian))/sum(1-SBdf.californian)
Non_Californian_support

0.75

In [78]:
total_support*.8 #this may give us some rought sense of where the true support might be with a better classifier

0.58685612788632324

# Conclusions

SB 827 seemed to have broad support on twitter with 73% of tweets supporting the bill. This likely overestimates its offline support due to the false positive that would arise from the simple classifier used as well as due to the fact that the demographics of twitter skew towards the young, well-educated urbanites who stand to benefit the most from its implementation. Nonetheless, it does suggest that there is significant support for this type of policy and that we may see an emergence of similar bills in the not too distant future.