# Capstone Project 2: Twitter Public Opinion Network Analysis for Bungie

This project's goal is to collect and analyze the available Twitter data for Bungie and their main game, Destiny 2.

The data are tweets about Bungie and Destiny 2 and the timelines (tweets made by) the Bungie and Destiny 2 official accounts.

This notebook is the machine learning analysis for the project.

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
import math
from datetime import datetime
import seaborn as sns
sns.set()

%load_ext watermark
%watermark

2019-04-18T10:16:31

CPython 2.7.15
IPython 5.8.0

compiler   : MSC v.1500 64 bit (AMD64)
system     : Windows
release    : 10
machine    : AMD64
processor  : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
CPU cores  : 4
interpreter: 64bit


In [2]:
# These are the datetime columns
# Pandas can convert them all as-is
datetime_cols = ['created_at', 'quoted_status.created_at', 'retweeted_status.created_at']

In [3]:
destiny_2_tweet_min_df = pd.read_csv("../data/destiny_2_tweets_min.csv",
                                     encoding='utf-8', index_col=0, parse_dates=datetime_cols)
bungie_tweet_min_df = pd.read_csv("../data/bungie_tweets_min.csv",
                                  encoding='utf-8', index_col=0, parse_dates=datetime_cols)
bungie_timeline_min_df = pd.read_csv("../data/bungie_timeline_min.csv",
                                     encoding='utf-8', index_col=0, parse_dates=datetime_cols)
destiny_2_timeline_min_df = pd.read_csv("../data/destiny_2_timeline_min.csv",
                                        encoding='utf-8', index_col=0, parse_dates=datetime_cols)

## Models

#### Natural Language Processing

This should give the tweets a sentiment rating, which can be feature generation for other models.

Debating between NLTK and spaCy packages for sentiment analysis. Pick NLTK, it has a sentiment package.

https://www.nltk.org/api/nltk.sentiment.html?highlight=sentiment#module-nltk.sentiment

https://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/

Take note of tfidf? Might be good for topic generation.

In [11]:
import nltk
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\sebia\AppData\Roaming\nltk_data...


True

VADER sentiment analysis tools are from a paper. They were made for social media analysis and can handle lots of unusual text.

Documentation: https://www.nltk.org/api/nltk.sentiment.html#module-nltk.sentiment.vader

Paper: http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

I want to try the functions polarity_scores, score_valence, and sentiment_valence

In [13]:
vader = SentimentIntensityAnalyzer()

In [17]:
destiny_2_tweet_min_df.loc[0, 'text']

u"RT @vinnydafreshest: Alright. I guess I'll watch what Bungie has to say about Destiny 2 on Twitch."

Wow. That is really easy to use. Let's do that for all of the tweets when we get the data ready for machine learning.

In [18]:
vader.polarity_scores(destiny_2_tweet_min_df.loc[0, 'text'])

{'compound': 0.25, 'neg': 0.0, 'neu': 0.875, 'pos': 0.125}

The values aren't in the same order as they are in the dictionary.

In [38]:
vader.polarity_scores(destiny_2_tweet_min_df.loc[0, 'text']).values()

[0.0, 0.875, 0.125, 0.25]

In [45]:
vader.polarity_scores(destiny_2_tweet_min_df.loc[516, 'text'])

{'compound': -0.1779, 'neg': 0.171, 'neu': 0.683, 'pos': 0.147}

Okay. values() puts them in the order neg, neu, pos, comp

In [46]:
vader.polarity_scores(destiny_2_tweet_min_df.loc[516, 'text']).values()

[0.171, 0.683, 0.147, -0.1779]

#### Linear Regression

Try to estimate retweet counts, then see what is important. I think I won't attempt to predict favorites, since that is a less accurate (I think) metric of engagement. Retweeting more directly shows a tweet to others, and is thus more important in terms of social media reach.

Predict on follower count, inverse / exponential decay of time since last official tweet, sentiment rating, tweet posting frequency (?)(per user), 

Let's set up the data. I will need to redo some steps from the exploratory data analysis, mostly the time_since_last_official_tweet loop.

In [21]:
normal_tweet_df = pd.concat([destiny_2_tweet_min_df, bungie_tweet_min_df]).reset_index(drop=True)
official_tweet_df = pd.concat([bungie_timeline_min_df, destiny_2_timeline_min_df]).reset_index(drop=True)
print(len(normal_tweet_df))

46168


Remove official account tweets from normal tweet dataframe

In [22]:
for user_id in official_tweet_df['user.id'].unique():
    normal_tweet_df = normal_tweet_df[normal_tweet_df['user.id'] != user_id]
print(len(normal_tweet_df))

46076


In [23]:
def set_time_since_official_tweet(original_df, official_df, time_col, new_col_name):
    """
    This function makes a column for the original dataframe
    This column will show the time since the most recent tweet in the official dataframe
    """
    
    # Make sure the original and official dfs are time-ordered
    original_df = original_df.sort_values(time_col)
    official_df = official_df.sort_values(time_col)
        
    # Loop
    for og_idx, tweet in original_df.iterrows():
        
        # There might be some tweets in the original df that are older than the oldest in the official df
        # In that case, just set the time diff to be it's posting time - the oldest time in official_df
        if (tweet[time_col] < official_df.iloc[0, 1]):
            original_df.loc[og_idx, new_col_name] = tweet[time_col] - official_df.iloc[0, 1]
            continue
        
        # This variable will hold the time differences between tweets
        # This is the max positive time difference, a good starting value for the minimum
        min_time_since_official_tweet = tweet[time_col] - official_df.iloc[-1, 1]
        
        # Find the closest tweet in official_df with a time less than this one
        for off_idx, official_tweet in official_df.iterrows():
            
            # Check if we've found an official tweet posted after the tweet in question
            if (official_tweet[time_col] > tweet[time_col]):                
                # The tweet before this official one is what we want
                break
                
            # Only set this if we didn't overstep
            min_time_since_official_tweet = tweet[time_col] - official_tweet[time_col]
            
        # Set the time difference
        original_df.loc[og_idx, new_col_name] = min_time_since_official_tweet
            
    return original_df

In [24]:
%%time
normal_tweet_df = set_time_since_official_tweet(normal_tweet_df,
                                                official_tweet_df,
                                                'created_at',
                                                'time_since_official_tweet')

Wall time: 8min 56s


Columns I want to keep

Debating if favorite count should be in there. I don't think so, since it is another metric of engagement. I don't want to predict with it, I should be predicting it.

Consider another time col? Will see how well the time since an official tweet works (as linear, as inverse, and as exponential decay).

In [81]:
x_cols = ['user.followers_count', 'user.friends_count', 'user.statuses_count',
          'user.verified', 'is_quote_status', 'text', 'time_since_official_tweet']

In [82]:
y = normal_tweet_df['retweet_count']

In [83]:
X = normal_tweet_df[x_cols]

In [84]:
X.head()

Unnamed: 0,user.followers_count,user.friends_count,user.statuses_count,user.verified,is_quote_status,text,time_since_official_tweet
21438,648.0,664.0,50382.0,False,0.0,RT @MassivelyOP: The MMO Bookclub devotes Marc...,-1 days +18:11:31
21437,1724.0,500.0,120258.0,False,0.0,RT @forsungwoon_: 190303 My Moment Fansign\r\r...,-1 days +18:18:37
21436,213109.0,1072.0,17796.0,False,0.0,What did you guys think of the new Destiny 2 V...,-1 days +18:50:10
21435,10.0,18.0,3686.0,False,0.0,RT @MrRoflWaffles: What did you guys think of ...,-1 days +18:50:35
21434,123.0,317.0,64673.0,False,0.0,RT @forsungwoon_: 190303 My Moment Fansign\r\r...,-1 days +18:51:48


Change time_since_official_tweet to be in minutes

In [85]:
X.loc[:, 'time_since_official_tweet'] = X['time_since_official_tweet'].astype('timedelta64[m]')

Get the sentiment values from the tweet text

In [86]:
def get_vader_sentiment_scores(df, text_col, new_col_names):
    """
    For a dataframe, get the sentiment scores for the text_col using VADER
    Assign results to the new columns with new_col_names
    """
    assert len(new_col_names) == 4
    
    # Set up new columns
    for col_name in new_col_names:
        df.loc[:, col_name] = 0.0
    
    df.loc[:, new_col_names] = [vader.polarity_scores(text).values() for text in df[text_col]]
    #for idx, row in df.iterrows():
        #df.loc[idx, new_col_names] = vader.polarity_scores(row[text_col]).values()
    
    return df

Version with the for loop and iterrows took probably an hour and a half

Version with the list comprehension took 8 seconds

In [87]:
%%time
X = get_vader_sentiment_scores(X, 'text', ['neg_score', 'neu_score', 'pos_score', 'comp_score'])

Wall time: 8.26 s


In [88]:
X.head()

Unnamed: 0,user.followers_count,user.friends_count,user.statuses_count,user.verified,is_quote_status,text,time_since_official_tweet,neg_score,neu_score,pos_score,comp_score
21438,648.0,664.0,50382.0,False,0.0,RT @MassivelyOP: The MMO Bookclub devotes Marc...,-349.0,0.0,0.86,0.14,0.3818
21437,1724.0,500.0,120258.0,False,0.0,RT @forsungwoon_: 190303 My Moment Fansign\r\r...,-342.0,0.0,0.764,0.236,0.7574
21436,213109.0,1072.0,17796.0,False,0.0,What did you guys think of the new Destiny 2 V...,-310.0,0.0,0.89,0.11,0.3527
21435,10.0,18.0,3686.0,False,0.0,RT @MrRoflWaffles: What did you guys think of ...,-310.0,0.0,0.895,0.105,0.3527
21434,123.0,317.0,64673.0,False,0.0,RT @forsungwoon_: 190303 My Moment Fansign\r\r...,-309.0,0.0,0.764,0.236,0.7574


Sanity checks for the sentiment scores

The negative scores seem mostly accurate. FIGHT!! is not a bad tweet, since this is a game with violence. And No problem should be neutral.

In [93]:
X.sort_values('neg_score', ascending=False).head(5)['text']

29327                                     @Bungie Cancer
39449                                    @Bungie FIGHT!!
24587                  @LuCKyy_and_BW @Bungie STOP LYING
7521     @TheRadicalJawa No wonder Destiny 2 sucked ass.
41881                  @NikTheGamerDK @Bungie No problem
Name: text, dtype: object

The positive scores look accurate

In [94]:
X.sort_values('pos_score', ascending=False).head(5)['text']

43357    @DestinyTheGame @Bungie Yes yes yes yes YES!
13358                            god i love destiny 2
44910               Brilliant https://t.co/wOPuzbMUQe
36698                                 @Bungie Perfect
9315                   Destiny 2 is pretty wonderful!
Name: text, dtype: object

The most neutral scores seem accurate. Mostly full of proper nouns. If I had more time, I'd do some named entity recognition to filter those out.

In [99]:
X.sort_values('neu_score', ascending=False)['text'].head(5).values

array([ u'RT @DestinyTheGame: Season of the Drifter is underway and the latest Bungie ViDoc outlines what to expect all season long.\r\r\r\n\r\r\r\n\U0001f4a0 https://t.co/\u2026',
       u'RT @chibbyranjo: More Destiny 2 right now! Say hi! https://t.co/JAMDkkgXPi',
       u'@Bungie Full Reaper set, just need Titan mark for full invader set. https://t.co/abtL1X081f',
       u'Finally got my Legend of Acrius upgraded in Destiny 2.',
       u'@quickhakker @eurogamer Not really. Most have evolved into similar shapes. And bungie designed the ps4 controller.'], dtype=object)

Scale data. Weird options for the time since last official tweet? Some negative values.

Put into linear regression. Try out the stock LR with the 3 options for the time_since_last_official_tweet.

Hyperparameter tuning.

#### Classification Algorithm (Maybe)

See what kinds of users there are? Put it more data, see what separates users with huge reach from the average, no-reach Twitter user.