# Thinned Data Statistical Analysis

We started collecting about 20x fewer tweets in the middle of 06/25 in order to reduce file size and clumsiness. The original data only captures about 1% of Twitter traffic. We want to test and see if the reduced files are statistically different from the larger data set.

In order to do so, we have a few options for statistical tests. Two common tests that we want to run are:
1. **ANOVA Test** - Analysis of variance test. This test analyzes the difference between the means of more than two groups.
2. **Independent Samples t-test** - This test analyzes the difference between the population means of two groups.

The difference between tests exists in how we define a "group" of data. It makes sense to define two groups: full data vs. reduced data which suggests a t-test. However, we could create more groups for an ANOVA test by separating by day. Therefore, group 1 could be 06/24 full data, group 2 could be 06/26 reduced data, group 3 - 06/27 reduced data, etc.

We want to test a few dependent variables in order to really check for statistical difference. I propose the following set of dependent variables:
- ratio of tweets vs. (retweets, quotes, replies)
- ratio of retweets vs. (tweets, quotes, replies)
- ratio of quotes vs. (tweets, retweets, replies)
- ratio of replies vs. (tweets, retweets, quotes)
- ratio of common stop words vs. all other words
- ratio of selected keywords (Trump, Biden) vs. all other words ratio

Our thinned tweets do not include the tweet's timestamp. Therefore, I plan to split each day of data into data points by grouping by a certain size. Ideally I would like > 30 tweets in a chunk to assume normality by the Central Limit Theorem.

We can assume independent samples because our data will be examined from different days. Therefore, they will be different tweets. We also have a random sample due to how Twitter allows us to collect data.

For the t-test, our null and alternative hypotheses are as follows:
$$H_0 : \mu_1 = \mu_2$$
$$H_1 : \mu_1 \neq \mu_2$$
The null states that the population means are equal, and the alternative states that the population means are different.

## Imports

In [202]:
# These paths should be updated so that no files need to be downloaded...
import os
import sys
shared_path = '/Users/sarah/Downloads/TwitterResearch2020'
sys.path.append(shared_path)
import thinned_tweet_obj

In [134]:
import json
import pickle
import subprocess
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
"""
From general_utilities.py

Basic reading and writing of pickled objects or json objects to/from a file. These functions will not
check to see if the file exists (for reading) and will overwrite (when writing). 
"""
def read_pkl(fname):
    """
    Read a pickled object from a file with path name fname. Returns the object after closing the file.
    :param fname: Path name of the file containing the pickle
    :return: The object contained in the pickle file
    """
    fout = open(fname, 'rb')
    ret_o = pickle.load(fout)
    fout.close()
    return ret_o

## Normality investigation


In [36]:
# Data from June, 26 is the first full day of reduced tweets
tweets_0626 = read_pkl('combined_tweets-2020-06-26.pkl')
len(tweets_0626)

154000

In [60]:
# Data from June, 24 is the larger data set of tweets
tweets_0624 = read_pkl('/Users/sarah/Downloads/combined_tweets-2020-06-24.pkl')
len(tweets_0624)

4784443

In [114]:
def append_ratio(mean_tracker, numerator, denom):
    '''
    @param mean_tracker: List of means
    @param numerator: Numerator for the new mean to calculate
    @param denom: Denominator for the new mean to calculate
    '''
    mean_tracker.append(numerator / denom)
    
def get_tweet_type_distribution(obj_lst):
    '''
    Divide a day's worth of combined_tweets into `total_groups` and find the relevant ratios per group 
    to investigate the distribution of tweet types.
    
    @param obj_lst: List of combined, thinned tweet objects read from the pkl files
    @return: Three lists of length approximately 'total_groups' containing the population means for each 
    group. The first list contains the ratio of retweets, the second contains the ratio of replies, and 
    the third contains the ratio of quotes.
    '''
    lst_len = len(obj_lst)
    # One group per minute of the day (we might want to pick a smarter value than this)
    total_groups = 60 * 24
    group_size = lst_len // total_groups
    # 'xxx_mean_tracker' tracks the ratio of xxx vs. [(tweets, retweets, quotes, replies) - xxx]
    retweet_mean_tracker = []
    reply_mean_tracker = []
    quote_mean_tracker = []
    
    tweet_count = 0
    retweet_count = 0
    reply_count = 0
    quote_count = 0
    for thin_obj in obj_lst:
        if tweet_count == group_size:
            append_ratio(retweet_mean_tracker, retweet_count, tweet_count)
            append_ratio(reply_mean_tracker, reply_count, tweet_count)
            append_ratio(quote_mean_tracker, quote_count, tweet_count)
            
            tweet_count = 0
            retweet_count = 0
            reply_count = 0
            quote_count = 0
        
        tweet_count += 1
        if thin_obj.is_retweet: retweet_count += 1
        if thin_obj.is_reply(): reply_count += 1
        if thin_obj.quote_status: quote_count += 1
            
    return retweet_mean_tracker, reply_mean_tracker, quote_mean_tracker

In [128]:
def get_dist_stats(obj_lst):
        '''
    Print the percentage of retweets, replies, and quotes per one day of combined_tweets.
    
    @param obj_lst: List of combined, thinned tweet objects read from the pkl files
    @return: Three lists of length approximately 'total_groups' containing the population means for each 
    group. The first list contains the ratio of retweets, the second contains the ratio of replies, and 
    the third contains the ratio of quotes.
    '''
    retweet_means, reply_means, quote_means = get_tweet_type_distribution(obj_lst)
    print("percentage retweets: ", np.array(retweet_means).mean())
    print("percentage replies: ", np.array(reply_means).mean())
    print("percentage quotes: ", np.array(quote_means).mean())
    return retweet_means, reply_means, quote_means

In [129]:
print("----- Thinned data stats -----")
retweet_means_thin, reply_means_thin, quote_means_thin = get_dist_stats(tweets_0626)

print("----- Full data stats -----")
retweet_means_full, reply_means_full, quote_means_full = get_dist_stats(tweets_0624)

----- Thinned data stats -----
percentage retweets:  0.6245257029991164
percentage replies:  1.0
percentage quotes:  0.2569520245334997
----- Full data stats -----
percentage retweets:  0.6217230249515018
percentage replies:  1.0
percentage quotes:  0.25006124991638234


**Note:** There appears to be an error with the `is_reply` field.

In the future, we may want to consider different ways of constructing groups within a day.

## Perform t-test

In [137]:
# https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f

def t_test(data_lst_thin, data_lst_full):
    '''
    Runs a t-test to compare the population means between the reduced and larger combined_tweet data.
    
    @param data_lst_thin: List of population means from the reduced tweet data.
    @param data_lst_full: List of population means from the larger tweet data.
    '''
    N = min(len(data_lst_thin), len(data_lst_full))
    data_lst_thin = np.array(data_lst_thin[:N])
    data_lst_full = np.array(data_lst_full[:N])
    var_thin = data_lst_thin.var(ddof=1)
    var_full = data_lst_full.var(ddof=1)
    st_dev = np.sqrt((var_thin + var_full) / 2)
    t_stat = (data_lst_thin.mean() - data_lst_full.mean()) / (st_dev * np.sqrt(2 / N))
    # Degrees of freedom
    df = 2 * N - 2
    # p-value after comparison with the Student t distribution
    p = 1 - stats.t.cdf(t_stat, df=df)
    
    print("t_stat = " + str(t_stat))
    # Reject the null if the p-value is < alpha (0.05)
    print("p_val = " + str(2 * p))

In [139]:
print("----- retweets t-test -----")
t_test(retweet_means_thin, retweet_means_full)
print("----- replies t-test -----")
t_test(quote_means_thin, reply_means_full)
print("----- quotes t-test -----")
t_test(quote_means_thin, quote_means_full)

----- retweets t-test -----
t_stat = 1.1778701790635677
p_val = 0.2389457525069174
----- replies t-test -----
t_stat = -419.45586433151715
p_val = 2.0
----- quotes t-test -----
t_stat = 3.5594451365156825
p_val = 0.0003776509718413923


We have to throw replies out of our analysis because of the error in the `is_reply` field. Otherwise, we can conclude that the ratio of retweets are not statistically different between the reduced and larger data. We do reject the null for the ratio of quotes, instead, concluding that the ratio of quotes *are* statistically different between the reduced and larger data.

Now, let's try re-running the t-test's with fewer subgroups.

In [251]:
def reduce_subgroups_2x(lst):
    lst_merged = []
    i = 0
    s = 0
    for val in lst:
        if i == 1:
            s += val
            lst_merged.append(s / 2)
            s = 0
            i = 0
        else:
            s += val
            i += 1
            
    return lst_merged

def reduce_subgroups(lst, factor):
    for i in range(factor):
        lst = reduce_subgroups_2x(lst)
        
    return lst

def run_t_test_reduced(factor):
    print("Number of groups:", len(reduce_subgroups(retweet_means_thin, factor)))
    print()
    print("----- retweets t-test -----")
    t_test(reduce_subgroups(retweet_means_thin, factor), reduce_subgroups(retweet_means_full, factor))
    print("----- replies t-test -----")
    t_test(reduce_subgroups(quote_means_thin, factor), reduce_subgroups(reply_means_full, factor))
    print("----- quotes t-test -----")
    t_test(reduce_subgroups(quote_means_thin, factor), reduce_subgroups(quote_means_full, factor))

In [259]:
run_t_test_reduced(2)
print()
run_t_test_reduced(3)
print()
run_t_test_reduced(5)

Number of groups: 363

----- retweets t-test -----
t_stat = 0.7022445391933957
p_val = 0.4827542601214765
----- replies t-test -----
t_stat = -251.25316023753558
p_val = 2.0
----- quotes t-test -----
t_stat = 2.049535521920288
p_val = 0.04077258635862613

Number of groups: 181

----- retweets t-test -----
t_stat = 0.5171379470537968
p_val = 0.6053792140263239
----- replies t-test -----
t_stat = -186.4929970888013
p_val = 2.0
----- quotes t-test -----
t_stat = 1.5058637869152451
p_val = 0.13298394062677543

Number of groups: 45

----- retweets t-test -----
t_stat = 0.2907302998655875
p_val = 0.7719416746731556
----- replies t-test -----
t_stat = -106.22773289858971
p_val = 2.0
----- quotes t-test -----
t_stat = 0.8432925927446651
p_val = 0.4013512144672402


Notice that with groups = 181, we can accept the null in all cases.

### Ratio of words in tweet

Let's look at the actual words in a tweet to determine if the data is statistically different. First, we will examine the ratio of stopwords. Stopwords are English words that occur frequently such as 'the,' 'and,' 'a,' etc. We would like to see a similar ratio of stopwords between the reduced and larger data because this can help us conclude that the actual tweets are statistically similar.

We will also examine how many times a Trump or Biden related word appears as well as the ratio of tweets that mention Trump or Biden. While the number of times a Trump or Biden word may not be a great indicator of statistical similarity, the number of tweets that mention either candidate may be useful. This measure will allow us to glance at the virality of each candidate and will help us determine if the data that we wish to study about the tweets is statistically similar.

In [158]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [159]:
# Common english words (ex. 'the,' 'and,' 'a,' etc.)
stop_words = set(stopwords.words('english'))

In [212]:
def get_word_ratios(obj_lst):
    '''
    Divide a day's worth of combined_tweets into `total_groups` and find the relevant ratios per group 
    to investigate the distribution of tweet types.
    
    @param obj_lst: List of combined, thinned tweet objects read from the pkl files
    @return: Three lists of length approximately 'total_groups' containing the population means for each 
    group. The first list contains the ratio of stopwords, the second contains the ratio of trump-related words, and 
    the third contains the ratio of biden-related words.
    '''
    lst_len = len(obj_lst)
    total_groups = 60 * 24
    group_size = lst_len // total_groups
    stopword_mean_tracker = []
    trump_words_mean_tracker = []
    trump_tweet_mean_tracker = []
    biden_words_mean_tracker = []
    biden_tweet_mean_tracker = []

    stopword_counter = 0
    trump_counter = 0
    trump_related_tweet_counter = 0
    biden_counter = 0
    biden_related_tweet_counter = 0
    tweet_word_count = 0
    tweet_count = 0
    for thin_obj in obj_lst:
        if tweet_count == group_size:
            append_ratio(stopword_mean_tracker, stopword_counter, tweet_word_count)
            append_ratio(trump_words_mean_tracker, trump_counter, tweet_word_count)
            append_ratio(biden_words_mean_tracker, biden_counter, tweet_word_count)
            append_ratio(trump_tweet_mean_tracker, trump_related_tweet_counter, tweet_count)
            append_ratio(biden_tweet_mean_tracker, biden_related_tweet_counter, tweet_count)
            
            tweet_count = 0
            tweet_word_count = 0
            stopword_counter = 0
            trump_counter = 0
            trump_related_tweet_counter = 0
            biden_counter = 0
            biden_related_tweet_counter = 0
            
        tweet_count += 1
        if thin_obj.is_retweet: thin_obj = thin_obj.retweet
        word_tokens = word_tokenize(thin_obj.text)
        tweet_word_count += len(word_tokens)
        
        trump_related_tweet_flag = False
        biden_related_tweet_flag = False
        for w in word_tokens:
            w = w.lower()
            if w in stop_words: stopword_counter += 1
            if w in ['trump', 'realdonaldtrump', '@realdonaldtrump']: 
                trump_counter += 1
                if not trump_related_tweet_flag:
                    trump_related_tweet_flag = True
                    trump_related_tweet_counter += 1
            if w in ['biden', 'joebiden', '@joebiden']: 
                biden_counter += 1
                if not biden_related_tweet_flag:
                    biden_related_tweet_flag = True
                    biden_related_tweet_counter += 1
                
    return stopword_mean_tracker, trump_words_mean_tracker, trump_tweet_mean_tracker, biden_words_mean_tracker, biden_tweet_mean_tracker


In [215]:
def get_word_stats(obj_lst):
    stopword_mean_tracker, trump_mean_tracker, trump_tweet_mean_tracker, biden_mean_tracker, biden_tweet_mean_tracker = get_word_ratios(obj_lst)
    print("percentage of words that are stopwords: ", np.array(stopword_mean_tracker).mean())
    print()
    print("percentage of words that are trump related: ", np.array(trump_mean_tracker).mean())
    print("percentage of tweets that mention trump: ", np.array(trump_tweet_mean_tracker).mean())
    print()
    print("percentage of words that are biden related: ", np.array(biden_mean_tracker).mean())
    print("percentage of tweets that mention biden: ", np.array(biden_tweet_mean_tracker).mean())
    return stopword_mean_tracker, trump_mean_tracker, trump_tweet_mean_tracker, biden_mean_tracker, biden_tweet_mean_tracker

In [220]:
print("----- Thinned data stats -----")
stopword_means_thin, trump_means_thin, trump_mentions_means_thin, biden_means_thin, biden_mentions_means_thin = get_word_stats(tweets_0626)

print("----- Full data stats -----")
stopword_means_full, trump_means_full, trump_mentions_means_full, biden_means_full, biden_mentions_means_full = get_word_stats(tweets_0624)


----- Thinned data stats -----
percentage of words that are stopwords:  0.27571611620491726

percentage of words that are trump related:  0.0077525441164554764
percentage of tweets that mention trump:  0.15056655751338427

percentage of words that are biden related:  0.001802221817679685
percentage of tweets that mention biden:  0.03489006705130204
----- Full data stats -----
percentage of words that are stopwords:  0.2882371591545534

percentage of words that are trump related:  0.006473343776062667
percentage of tweets that mention trump:  0.12229915044484581

percentage of words that are biden related:  0.0014778323257913111
percentage of tweets that mention biden:  0.028117474413004218


In [221]:
print("----- stopwords t-test -----")
t_test(stopword_means_thin, stopword_means_full)
print()
print("----- trump words t-test -----")
t_test(trump_means_thin, trump_means_full)
print("----- tweets that mention trump t-test -----")
t_test(trump_mentions_means_thin, trump_mentions_means_full)
print()
print("----- biden words t-test -----")
t_test(biden_means_thin, biden_means_full)
print("----- tweets that mention biden t-test -----")
t_test(biden_mentions_means_thin, biden_mentions_means_full)

----- stopwords t-test -----
t_stat = -10.29123494922213
p_val = 2.0

----- trump words t-test -----
t_stat = 7.423273895486341
p_val = 1.496580637194711e-13
----- tweets that mention trump t-test -----
t_stat = 8.589418849462756
p_val = 0.0

----- biden words t-test -----
t_stat = 6.084520959546995
p_val = 1.3231316042805474e-09
----- tweets that mention biden t-test -----
t_stat = 6.831629876800464
p_val = 1.0204725953144589e-11


We can conclude that the ratio of stopwords are not statistically different between the reduced and larger data. This may be a good metric to determine that the tweets themselves are not statistically different. 

However, notice that we reject the null for the ratio of times trump related words appear and tweets that mention trump, concluding that these *are* statistically different between the reduced and larger data. We also conclude that the ratio of times biden related words appear and the ratio of tweets that mention biden are statistically different between the reduced and larger data. 

If we reduce the amount of groups (increasing the number of tweets per group), we do not reject the null and can then claim that the data is not statistically different. This raises a question on how we wish to determine subgroups.

In [260]:
def run_words_t_test_reduced(factor):
    print("Number of groups:", len(reduce_subgroups(trump_means_thin, factor)))
    print()
    print("----- stopwords t-test -----")
    t_test(reduce_subgroups(stopword_means_thin, factor), reduce_subgroups(stopword_means_full, factor))
    print()
    print("----- trump words t-test -----")
    t_test(reduce_subgroups(trump_means_thin, factor), reduce_subgroups(trump_means_full, factor))
    print("----- tweets that mention trump t-test -----")
    t_test(reduce_subgroups(trump_mentions_means_thin, factor), reduce_subgroups(trump_mentions_means_full, factor))
    print()
    print("----- biden words t-test -----")
    t_test(reduce_subgroups(biden_means_thin, factor), reduce_subgroups(biden_means_full, factor))
    print("----- tweets that mention biden t-test -----")
    t_test(reduce_subgroups(biden_mentions_means_thin, factor), reduce_subgroups(biden_mentions_means_full, factor))

In [264]:
run_words_t_test_reduced(3)
print()
run_words_t_test_reduced(4)
print()
run_words_t_test_reduced(5)

Number of groups: 181

----- stopwords t-test -----
t_stat = -3.8327474871784104
p_val = 1.999850344544159

----- trump words t-test -----
t_stat = 2.755516145136178
p_val = 0.006159620609314409
----- tweets that mention trump t-test -----
t_stat = 3.1831926196199594
p_val = 0.0015841691150355608

----- biden words t-test -----
t_stat = 2.580276291112289
p_val = 0.010269836394463328
----- tweets that mention biden t-test -----
t_stat = 2.9257240900210904
p_val = 0.0036557635217886464

Number of groups: 90

----- stopwords t-test -----
t_stat = -2.7420306880134615
p_val = 1.9932701271338433

----- trump words t-test -----
t_stat = 1.9821914288366276
p_val = 0.0489967400614415
----- tweets that mention trump t-test -----
t_stat = 2.2857990179052283
p_val = 0.023444685488855033

----- biden words t-test -----
t_stat = 1.900846992229877
p_val = 0.058938029453291296
----- tweets that mention biden t-test -----
t_stat = 2.1542933064411227
p_val = 0.032561541012167794

Number of groups: 45

-

Notice that with groups = 45, we can accept the null in all cases.

## Conclusion

After performing several t-tests to compare the reduced and larger data sets of combined, thinned tweets, we can make several conclusions. First, we can accept the null hypothesis and conclude that the data sets are not statistically different for the ratio of retweets. We must reject this conclusion for the ratio of quotes. We cannot make any conclusion for the ratio of replies due to an error with the `is_reply` field. It would be worthwhile to rerun the t-tests with a more thoughtful choice for subgroups.

The ratio of stopwords are not statistically different between the reduced and larger data. This may be a good metric to determine that the tweets themselves are not statistically different since stopwords are common words between English statements. 

However, using the current subgroup size of 1440, we reject the null for all other scenarios, concluding that the tweets *are* statistically different between the reduced and larger data. If we reduce the amount of groups (increasing the number of tweets per group), we do not reject the null and can then claim that the data is not statistically different. This raises a question on how we wish to determine subgroups.

It is a little surprisingly to see the lack of mentions to Trump and Biden. Even more so, Biden is mentioned 5x as frequently as Trump. This raises questions on how we can determine whether a tweet is about Trump or Biden. We should establish a robust criteria and re-run the above t-tests. 

**Future considerations:**
- Determine a smarter criteria for subgroup size
- Establish a robust criteria to determine a Trump or Biden related tweet
- Model the retweet vs. reply vs. quote as a multinomial distribution