## Russian Bot Discussion

The name of our project is: Russian Bot Discussion
    
Email Addresses: David Kes: (ddkes@dons.usfca.edu) Stephen Hsu: (sjhsu@dons.usfca.edu)
            
Link to project: https://github.com/stephenjhsu/msan622viz

## Background and Motivation

With the fervor of the Presidential election being skewed by Russian probing’s as well as the notorious Facebook / Cambridge Analytics scandal still topping daily domestic headlines, it became clear to us that “fake news” and Russian bots are still prevalent, yet vague concepts. What exactly are these bots saying? How are people being fooled by these Tweets? How exactly are they influencing people and spreading propaganda? Therefore, with our backgrounds in natural language processing, data visualization, and interest in the combination of technology and politics, it was only natural to examine the Russian bot Tweet data with Python and Plotly. The following is our process of using NBC data found at https://www.nbcnews.com/tech/social-media/now-available-more-200-000-deleted-russian-troll-tweets-n844731 and visualizing them for topical modeling, sentiment analysis, and more. 

In [1]:
import pandas as pd
import numpy as np

#nlp
import spacy
import re
from textblob import TextBlob

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwords 

#LDA / topical modeling
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

from nltk.corpus import stopwords
stop = stopwords.words('english')

import warnings
warnings.filterwarnings('ignore')

In [2]:
def add_datepart(df, fldname, drop=True, time=True):
    """add_datepart converts a column of df from a datetime64 to many columns containing
    the information from the date. This applies changes inplace.
    Parameters:
    -----------
    df: A pandas data frame. df gain several new columns.
    fldname: A string that is the name of the date column you wish to expand.
        If it is not a datetime64 series, it will be converted to one with pd.to_datetime.
    drop: If true then the original date column will be removed.
    time: If true time features: Hour, Minute, Second will be added.
    Examples:
    ---------
    >>> df = pd.DataFrame({ 'A' : pd.to_datetime(['3/11/2000', '3/12/2000', '3/13/2000'], infer_datetime_format=False) })
    >>> df
        A
    0   2000-03-11
    1   2000-03-12
    2   2000-03-13
    >>> add_datepart(df, 'A')
    >>> df
        AYear AMonth AWeek ADay ADayofweek ADayofyear AIs_month_end AIs_month_start AIs_quarter_end AIs_quarter_start AIs_year_end AIs_year_start AElapsed
    0   2000  3      10    11   5          71         False         False           False           False             False        False          952732800
    1   2000  3      10    12   6          72         False         False           False           False             False        False          952819200
    2   2000  3      11    13   0          73         False         False           False           False             False        False          952905600
    """
    fld = df[fldname]
    if not np.issubdtype(fld.dtype, np.datetime64):
        df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True)
    targ_pre = re.sub('[Dd]ate$', '', fldname)
    attr = ['Year', 'Month', 'Day', 'Dayofweek', 'Dayofyear']
    if time: attr = attr + ['Hour', 'Minute']
    for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
    df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
    if drop: df.drop(fldname, axis=1, inplace=True)

In [3]:
#read in the data
tweets = pd.read_csv('../../finalprojdata/tweets.csv')
users = pd.read_csv('../../finalprojdata/users.csv')

In [4]:
users = users.drop_duplicates('id')

In [5]:
len(users)

394

In [6]:
tweets.drop(['tweet_id','retweeted_status_id', 'in_reply_to_status_id', 'created_at', 'expanded_urls'], axis=1, inplace=True)

In [7]:
fulltweets = tweets.merge(users, how='left', left_on='user_id', right_on='id')

In [8]:
fulltweets = fulltweets[pd.notnull(fulltweets['user_id'])]
fulltweets = fulltweets[pd.notnull(fulltweets['created_str'])]
fulltweets = fulltweets[pd.notnull(fulltweets['friends_count'])]
fulltweets = fulltweets[pd.notnull(fulltweets['time_zone'])]

In [9]:
len(fulltweets)

185160

In [10]:
fulltweets.isnull().sum()

user_id                  0
user_key                 0
created_str              0
retweet_count       134149
retweeted           134149
favorite_count      134149
text                     0
source              134149
hashtags                 0
posted                   0
mentions                 0
id                       0
location             20155
name                     0
followers_count          0
statuses_count           0
time_zone                0
verified                 0
lang                     0
screen_name              0
description          11813
created_at               0
favourites_count         0
friends_count            0
listed_count             0
dtype: int64

In [11]:
#fix time
add_datepart(fulltweets, 'created_str')

In [12]:
## Create a sentiment column
def analyze_sentiment(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    '''
    analysis = TextBlob(tweet)
    return analysis.sentiment.polarity

In [13]:
## Create a subjectivity column
def analyze_sentiment2(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    '''
    analysis = TextBlob(tweet)
    return analysis.sentiment.subjectivity

In [14]:
tweetsonly2 = fulltweets.text.copy().astype(str)
tweetsonly2 = tweetsonly2.str.replace('[^\w\s]','')
tweetsonly2 = tweetsonly2.str.replace('[\\r|\\n|\\t|_]',' ')
tweetsonly2 = tweetsonly2.str.strip()

fulltweets2 = fulltweets.copy()
fulltweets2.text = tweetsonly2

In [15]:
stop += ['rt']
fulltweets2.text = fulltweets2.text.apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in (stop)]))
fulltweets2['Sentiment'] = np.array([analyze_sentiment(str(tweet)) for tweet in fulltweets2.text.values])
fulltweets2['Polarity'] = np.array([analyze_sentiment2(str(tweet)) for tweet in fulltweets2.text.values])


In [16]:
fulltweets2.text = fulltweets2.text.apply(lambda x: ' '.join([word.lower() for word in x.split() if len(word) > 3]))

In [17]:
fulltweets2.to_csv('/Users/shsu/Downloads/fulltweets2.csv')