The first step in this project was to gather Twitter data and quantify the sentiment of each tweet. The Twitter API only allowed me to retrieve history for a max of 7 days. This wasn't enough data so the data needed to be scraped. Luckily, while searching around I found a package called Get Old Tweets helped me do this. I also used another package called TextBlob to quantify the sentiment of each tweet.

In [1]:
import datetime
import pandas as pd
import requests
from blk import cfg
# https://github.com/Jefferson-Henrique/GetOldTweets-python
from get_old_tweets import got3 as got
from textblob import TextBlob
from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer
from requests_oauthlib import OAuth1
from pandas.io.json import json_normalize

### Create a function to quantify sentiment
Creating a custom sentiment classification algorithm would be huge investment of time so I decided to use a package that was ready out of the box. This package provides different sentiment metrics: polarity, classification, probabibilty of positive sentiment, and probability of negative sentiment. After some initial exploration, the probability of positive sentiment seemed to have the most accurate data

In [2]:
def sentiment(df):
    tb = Blobber(analyzer=NaiveBayesAnalyzer())
    df['polarity'] = df['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
    df['classification'] = df['text'].apply(lambda x: tb(x).sentiment.classification)
    df['p_pos'] = df['text'].apply(lambda x: tb(x).sentiment.p_pos)
    df['p_neg'] = df['text'].apply(lambda x: tb(x).sentiment.p_neg)

### Create a function that pulls historical twitter data and computes sentiment
Once the data is scraped, it is stored in a pandas dataframe and ran through the sentiment function.

In [3]:
def query_hist(query, start_date, end_date):
    tweet_criteria = got.manager.TweetCriteria().setQuerySearch(query).setSince(start_date).setUntil(end_date)
    tweets = got.manager.TweetManager.getTweets(tweet_criteria)

    for i in range(len(tweets)):
        d = {'index':tweets[i].date, 'text':tweets[i].text, 'id':tweets[i].id, 'username':tweets[i].username,
             'retweets':tweets[i].retweets, 'favorites':tweets[i].favorites,  'mentions':tweets[i].mentions,
             'hashtags':tweets[i].hashtags, 'geo':tweets[i].geo, 'permalink':tweets[i].permalink}
        if i == 0:
            df = pd.DataFrame.from_dict(d,orient='index').T
            df.index = df['index']
            df = df.drop('index', axis=1)
        else:
            df2 = pd.DataFrame.from_dict(d,orient='index').T
            df2.index = df2['index']
            df2 = df2.drop('index', axis=1)
            df = df.append(df2)
    sentiment(df)
    
    return df

### Sample Twitter history data pull
Below is a sample query and it's output. This isn't the exact data I used as I used data for all of 2016.

In [4]:
query = '$eem'
start_date = '2017-01-03'
end_date = '2017-01-05'

In [5]:
df = query_hist(query, start_date, end_date)
df.head()

Unnamed: 0_level_0,permalink,mentions,retweets,geo,hashtags,text,username,favorites,id,polarity,classification,p_pos,p_neg
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2017-01-04 15:54:50,https://twitter.com/ETFFunds/status/8167950507...,,0,,#ETF #ETFs #Stocks,Stock Market Strategies: Why You Should Buy Em...,ETFFunds,1,816795050784919552,0.0,pos,0.619072,0.380928
2017-01-04 15:48:02,https://twitter.com/1MinuteStock/status/816793...,,0,,#HOLD,Recommendation #HOLD for $ EEM with UB rating ...,1MinuteStock,0,816793339257376769,0.0,neg,0.428831,0.571169
2017-01-04 15:39:42,https://twitter.com/moniology/status/816791245...,,0,,,Kind've liking Mexico here. Possible multiple-...,moniology,0,816791245112614916,0.3,pos,0.6164,0.3836
2017-01-04 14:46:31,https://twitter.com/DS_Investools/status/81677...,,6,,#Investools,1/4/17 - View today's Market Forecast here: ht...,DS_Investools,17,816777858655907841,0.0,pos,0.993135,0.006865
2017-01-04 14:41:29,https://twitter.com/MikeZaccardi/status/816776...,,2,,,Everything up in 17. Just like 2016. Through 2...,MikeZaccardi,3,816776594547179523,0.0,neg,0.451618,0.548382


### Twitter API data
I was also interested in seeing if I add features not present in the scraped Twitter data. Using the Twitter API, we can pull features for each username, such as followers that can be used as a proxy for influence. I created a function that allowed me to conveniently query the Twitter API.

In [7]:
def username_lookup(un_list):
    df = pd.DataFrame()

    for i in range(0,len(un_list),100):
        working_list = un_list[i:i+100]
        usernames = ''
        for name in working_list:
            if name == working_list[-1]:
                usernames = usernames + name
                break
            else:
                string = name + '%2C'
                usernames = usernames + string

        url = 'https://api.twitter.com/1.1/users/lookup.json?screen_name=%s' % usernames
        auth = OAuth1(cfg.API_KEY, cfg.API_SECRET, cfg.ACCESS_TOKEN, cfg.ACCESS_TOKEN_SECRET)
        r = requests.get(url, auth=auth)
        for i in range(len(working_list)):
            try:
                name_df = json_normalize(r.json()[i])
                df = df.append(name_df)
            except:
                pass
    return df

### Twitter API query
With the function above, we can pass a list of twitter usernames and obtain all of the username features provided. I then took all unique usernames from the data scrape above and passed them to the username_lookup function.

In [8]:
un_list= df['username'].unique()
un_df = username_lookup(un_list)
un_df.head()

Unnamed: 0,contributors_enabled,created_at,default_profile,default_profile_image,description,entities.description.urls,entities.url.urls,favourites_count,follow_request_sent,followers_count,...,status.retweeted_status.truncated,status.source,status.text,status.truncated,statuses_count,time_zone,translator_type,url,utc_offset,verified
0,False,Thu Apr 24 08:33:51 +0000 2014,False,False,New! Just Launched ! All About Exchange Traded...,[],"[{'expanded_url': 'http://TopETFFunds.com', 'i...",3,False,356,...,,"<a href=""http://12stocks.com/energy"" rel=""nofo...",Resilient Global Growth Supports Equities.. ht...,False,75040,Atlantic Time (Canada),none,http://t.co/95gnxY4OKK,-14400,False
0,False,Tue Jan 27 03:39:41 +0000 2015,False,False,"Flash, 1 minute stock insights about trending ...",[],"[{'expanded_url': 'http://unicornbay.com', 'in...",21,False,1969,...,,"<a href=""https://unicornbay.com"" rel=""nofollow...",Do you know that #Book Value for $TIF is $23.3...,False,172121,Pacific Time (US & Canada),none,https://t.co/kUdxipjwKL,-28800,False
0,False,Wed Sep 15 05:20:19 +0000 2010,False,False,,[],,306,False,522,...,,"<a href=""http://stocktwits.com"" rel=""nofollow""...",Watch what happens when 100 breaks to the upsi...,False,6557,Pacific Time (US & Canada),none,,-28800,False
0,False,Sat Oct 30 18:07:33 +0000 2010,False,False,I work at @Investools from TD Ameritrade Holdi...,"[{'expanded_url': 'http://bit.ly/1Ml2kKW', 'in...",[{'expanded_url': 'https://www.youtube.com/use...,8059,False,2967,...,False,"<a href=""http://twitter.com/download/iphone"" r...",RT @benmillr: We're looking for somebody with ...,False,7166,Mountain Time (US & Canada),none,https://t.co/qV1yLnks5T,-25200,False
0,False,Thu Feb 26 22:35:08 +0000 2009,False,False,MISO Power and Energy Trader. Chart Czar. Weat...,[],[{'expanded_url': 'http://www.seeitmarket.com/...,4434,False,2616,...,,"<a href=""https://about.twitter.com/products/tw...",@commoditywx whoa! The never-say-die blowtorch...,False,55318,Eastern Time (US & Canada),none,http://t.co/RkGuTCu6Nl,-18000,False


### List of username features provided by Twitter API

In [11]:
for c in un_df.columns:
    print(c)

contributors_enabled
created_at
default_profile
default_profile_image
description
entities.description.urls
entities.url.urls
favourites_count
follow_request_sent
followers_count
following
friends_count
geo_enabled
has_extended_profile
id
id_str
is_translation_enabled
is_translator
lang
listed_count
location
name
notifications
profile_background_color
profile_background_image_url
profile_background_image_url_https
profile_background_tile
profile_banner_url
profile_image_url
profile_image_url_https
profile_link_color
profile_sidebar_border_color
profile_sidebar_fill_color
profile_text_color
profile_use_background_image
protected
screen_name
status.contributors
status.coordinates
status.created_at
status.entities.hashtags
status.entities.media
status.entities.symbols
status.entities.urls
status.entities.user_mentions
status.extended_entities.media
status.favorite_count
status.favorited
status.geo
status.id
status.id_str
status.in_reply_to_screen_name
status.in_reply_to_status_id
status.i