# CMSC320 Final Project
## Kinsey Smith, Sarah Bullard, Yiwen Shen

<img src="trump_twitter_image.jpg">

### Introduction

Our project is surrounding the Twitter account of the United States' current president, Donald Trump (@realDonaldTrump). We focused on the sentiment of the tweets of this account versus each individual tweet's replies. Our intention was to find out the difference between the sentiment of the tweet and the sentiment of its replies, and how it would reflect our current political climate. Our hypothesis was that Donald Trump's account would have more negative replies to his positive tweets, since the current political climate does not favor Donald Trump.

Sentiment analysis is a way of classifying a text as having a positive, negative, or neutral sentiment using text analysis.  However, because of the complex and sarcastic components of the English language, sentiment analysis is not a sole way of categorizing something as positive or negative. Because of this, we needed other factors to tell the sentiment of a reply. In order to do this, we created a feature vector and used that to classify the sentiment of the reply and used SVM machine learning in order to have the machine classify it for us.

The features in our feature vector are as follows: 

Our first feature was the original sentiment analysis, because although it is not reliably conclusive on its own, it can tell us something about the mood of the sentence. 

The second feature we worked on focused on the user who posted the reply to the specific tweet. We checked whether or not the user was following other accounts that aligned with Donald Trump's or BLM's views, including politicians of either party.

The third feature we worked on also focused on the user who posted the reply to the specific tweet. We compared the user's hashtags for the last year to known Trump-positive and BLM-positive hashtags, and noted numerically the number of hashtags that were similar for each user. 

This notebook will be organized into four parts: Data Extraction, Data Manipulation, Data Analysis, and Data Visualization. Each part will show how we manipulated the Twitter's API in order to get the tweets that we need and come to the conclusion that we have. 

In [2]:
# All of the imports that we need for the project.

!pip install tweepy 
!pip install unidecode
!pip install unicodecsv
!pip install textblob

import tweepy
from tweepy import Cursor
import textblob
import numpy as np
import pandas as pd
import collections
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
import unicodecsv
from unidecode import unidecode

Collecting textblob
  Downloading textblob-0.15.0-py2.py3-none-any.whl (631kB)
[K    100% |████████████████████████████████| 634kB 907kB/s eta 0:00:01
[?25hCollecting nltk>=3.1 (from textblob)
  Downloading nltk-3.2.5.tar.gz (1.2MB)
[K    100% |████████████████████████████████| 1.2MB 561kB/s eta 0:00:01
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/18/9c/1f/276bc3f421614062468cb1c9d695e6086d0c73d67ea363c501
Successfully built nltk
Installing collected packages: nltk, textblob
Successfully installed nltk-3.2.5 textblob-0.15.0




### Accessing Twitter's Data

In order to access Twitter's API, we had to create applications and personally get authentication tokens. Even though anyone that has a Twitter account is allowed access to Twitter's data as long as they fill out an Application form, we cannot give out these confidential tokens on this public notebook becuase it is a privacy risk. In order to get past this hurdle, we created a function that would pull from our own files on our own machines for these tokens. In the cell below is a copy of credentials.py, without the confidential information. See below the code we used to access Twitter's API in order to get the data. 

#### An example of credentials.py 
#This is a file that holds confidential information about a Twitter user and their authentication tokens. Please do not read further if you are not authorized.
  
CONSUMER_KEY = ' '

CONSUMER_SECRET = ' '

ACCESS_TOKEN = ' '

ACCESS_SECRET = ' '


In [1]:
from credentials import *
#A function that takes these credentials and sets up the API.
def api_setup():
    #application authentication allows more data retrival 
    auth = tweepy.AppAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    return api
# Extracting the tweets
extract_tweets = api_setup()

In [None]:
# Donald Trump Replies
tweet_ids_donald = []
for page in tweepy.Cursor(extract_tweets.user_timeline,screen_name="realDonaldTrump").pages(20):
    for item in page:
        tweet_ids_donald.append(item.id_str)

In [None]:
query = tweepy.Cursor(extract_tweets.search,q="to:realDonaldTrump").items(5000)

In [None]:
query2 = tweepy.Cursor(extract_tweets.search,q="to:Blklivesmatter").items(20000)

In [None]:
for tweet in query:
    if replies_donald.get(tweet.in_reply_to_status_id_str) != None:
        #add it to the csv

In [None]:
for tweet in query2:
    if replies_blm.get(tweet.in_reply_to_status_id_str) != None:
        #add it to the csv

### Adding the data to their individual CSVs

Once we accessed the data from Twitter's API, we started querying. However, since Twitter has a rule of a maximum of 15 queries in 15 minutes, and we were querying for replies of more than 15 tweets, we had to find another way to access the data when working with it. Twitter also has a rule where you can only access tweets 2 weeks before the current date, so we set that as our time limit for both tweets and their replies to measure the current poltical climate. We added the tweets and their replies dated from <b> DATE OF TWEETS AND REPLIES </b> to individual CSVs to work with them. We slowly added them to these CSVs to circumvent the 15-queries-maximum  rule Tweets.

<b> Getting Donald Trump Tweets. </b>

This piece of code referenced from nicolewhite will find all trump's tweets for a year and add it to a csv called trump_tweets.csv. 

In [6]:
#https://gist.github.com/nicolewhite/167828e51d8f2b6fad75
users = ["realDonaldTrump"]
with open('trump_tweets.csv', 'wb') as file:
    writer = unicodecsv.writer(file, delimiter = ',', quotechar = '"')
    # Write header row.
    writer.writerow(["name",
                    "username",
                    "followers_count",
                    "listed_count",
                    "following",
                    "favorites",
                    "verified",
                    "default_profile",
                    "location",
                    "time_zone",
                    "statuses_count",
                    "description",
                    "geo_enabled",
                    "contributors_enabled",
                    "tweet_id",
                    "tweet_time",
                    "tweet_text",
                    "tweet_lat",
                    "tweet_long",
                    "tweet_source",
                    "tweet_in_reply_to_screen_name",
                    "tweet_direct_reply",
                    "tweet_retweet_status",
                    "tweet_retweet_count",
                    "tweet_favorite_count",
                    "tweet_hashtags",
                    "tweet_hashtags_count",
                    "tweet_urls",
                    "tweet_urls_count",
                    "tweet_user_mentions",
                    "tweet_user_mentions_count",
                    "tweet_media_type",
                    "tweet_contributors"])

    for user in users:
        user_obj = api.get_user(user)

        # Gather info specific to the current user.
        user_info = [user_obj.name,
                     user_obj.screen_name,
                     user_obj.followers_count,
                     user_obj.listed_count,
                     user_obj.friends_count,
                     user_obj.favourites_count,
                     user_obj.verified,
                     user_obj.default_profile,
                     user_obj.location,
                     user_obj.time_zone,
                     user_obj.statuses_count,
                     user_obj.description,
                     user_obj.geo_enabled,
                     user_obj.contributors_enabled]

        # Get 5000 most recent tweets for the current user.
        # api.search, q="to:"+user  api.user_timeline, screen_name = user 

        
        for tweet in Cursor(api.user_timeline, screen_name = user).items(5000):

                    # Latitude and longitude stored as array of floats within a dictionary.
                    lat = tweet.coordinates['coordinates'][1] if tweet.coordinates != None else None
                    long = tweet.coordinates['coordinates'][0] if tweet.coordinates != None else None
                    # If tweet is not in reply to a screen name, it is not a direct reply.
                    direct_reply = True if tweet.in_reply_to_screen_name != "" else False
                    # Retweets start with "RT ..."
                    retweet_status = True if tweet.text[0:3] == "RT " else False

                    # Get info specific to the current tweet of the current user.
                    tweet_info = [tweet.id,
                                  tweet.created_at,
                                  unidecode(tweet.text),
                                  lat,
                                  long,
                                  tweet.source,
                                  tweet.in_reply_to_screen_name,
                                  direct_reply,
                                  retweet_status,
                                  tweet.retweet_count,
                                  tweet.favorite_count]

                    # Below entities are stored as variable-length dictionaries, if present.
                    hashtags = []
                    hashtags_data = tweet.entities.get('hashtags', None)
                    if(hashtags_data != None):
                        for i in range(len(hashtags_data)):
                            hashtags.append(unidecode(hashtags_data[i]['text']))

                    urls = []
                    urls_data = tweet.entities.get('urls', None)
                    if(urls_data != None):
                        for i in range(len(urls_data)):
                            urls.append(unidecode(urls_data[i]['url']))

                    user_mentions = []
                    user_mentions_data = tweet.entities.get('user_mentions', None)
                    if(user_mentions_data != None):
                        for i in range(len(user_mentions_data)):
                            user_mentions.append(unidecode(user_mentions_data[i]['screen_name']))

                    media = []
                    media_data = tweet.entities.get('media', None)
                    if(media_data != None):
                        for i in range(len(media_data)):
                            media.append(unidecode(media_data[i]['type']))

                    contributors = []
                    if(tweet.contributors != None):
                        for contributor in tweet.contributors:
                            contributors.append(unidecode(contributor['screen_name']))

                    more_tweet_info = [', '.join(hashtags),
                                       len(hashtags),
                                       ', '.join(urls),
                                       len(urls),
                                       ', '.join(user_mentions),
                                       len(user_mentions),
                                       ', '.join(media),
                                       ', '.join(contributors)]

                    # Write data to CSV.
                    writer.writerow(user_info + tweet_info + more_tweet_info)

        # Show progress.
        print("Wrote tweets by %s to CSV." % user)

### Collecting the Data in a Functional Way

Now that the data is in a csv and collected, we can put it into a dataframe to work with. 

We also deleted unnecessary columns in the tweet json, like followers_count and following, which don't matter in regards to what we are doing. Since the Twitter json structure doesn't include the tweet id itself and (if it is a reply) the id of the tweet it is replying to, we had to put that information as well as the text of the tweet (to match them) into a separate CSV and compare them to add both sets of information to the dataframe.

In [7]:
# The Donald Tweets
data_donald = pd.read_csv("trump_tweets.csv")
del data_donald["name"]
del data_donald["followers_count"]
del data_donald["listed_count"]
del data_donald["following"]
del data_donald["favorites"]
del data_donald["verified"]
del data_donald["default_profile"]
del data_donald["statuses_count"]
del data_donald["description"]
del data_donald["geo_enabled"]
del data_donald["contributors_enabled"]
del data_donald["tweet_lat"]
del data_donald["tweet_long"]
del data_donald["tweet_source"]
del data_donald["tweet_in_reply_to_screen_name"]
del data_donald["tweet_direct_reply"]
del data_donald["tweet_retweet_count"]
del data_donald["tweet_favorite_count"]
del data_donald["tweet_hashtags_count"]
del data_donald["tweet_urls"]
del data_donald["tweet_urls_count"]
del data_donald["tweet_user_mentions_count"]
data_donald=data_donald[1:15]
data_donald.head()

Unnamed: 0,username,location,time_zone,tweet_id,tweet_time,tweet_text,tweet_retweet_status,tweet_hashtags,tweet_user_mentions,tweet_media_type,tweet_contributors
1,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),939642796289470464,2017-12-09 23:47:55,Great Army - Navy Game. Army wins 14 to 13 and...,False,,,,
2,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),939634404267380736,2017-12-09 23:14:34,.@daveweigel of the Washington Post just admit...,False,,daveweigel,,
3,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),939616077356642304,2017-12-09 22:01:44,.@DaveWeigel @WashingtonPost put out a phony p...,False,,"daveweigel, washingtonpost",,
4,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),939564681743814661,2017-12-09 18:37:31,"Have a great game today, @USArmy and @USNavy -...",False,,"USArmy, USNavy",,
5,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),939560154269405184,2017-12-09 18:19:31,It was my great honor to celebrate the opening...,False,,,,


<b> Getting replies to Donald Trump Tweets. </b>

The following code will collect replies to trump's tweet to a csv file. Due to twitter's rate limit on search api we only collect 20 replies. With the reply_id we will be able to see which Trump's tweet above the user is replying to.

In [None]:
tweet_id_list = data_donald['tweet_id'].tolist()

with open('trump_replies.csv', 'wb') as file:
    writer = unicodecsv.writer(file, delimiter = ',', quotechar = '"')
    writer.writerow([
                    "username",
                    "id",
                    "reply_id",
                    "date",
                    "tweet"
                    ])
                     

    for otweet_id in tweet_id_list:
        
        i=0
        print(otweet_id)
        for tweet in tweepy.Cursor(extractor.search, q="to:realDonaldTrump", since_id= otweet_id).items():
                
                
                if str(tweet.in_reply_to_status_id) == str(otweet_id):
                    i=i+1
                    print(i)
                    if(i==20):
                        break

                    info = [tweet.user.screen_name,
                             tweet.id,
                             tweet.in_reply_to_status_id,
                             tweet.created_at,
                             unidecode(tweet.text),
                            ]


                    writer.writerow(info)

            



Below is the resulting csv. As we notice due to the constraint of the search api, the replies we got are all very recent as in 12-11 compared with the 12-9 in the original tweets. This is due to the search api searching from most recent and we have no comntrol of. This may affect the result as the recent tweets may be affected by recent events. This can be solbved by uisng a better search engine that dosen't have rate limit and can sample search result

In [9]:
# Replies of Trump
data_donald_replies = pd.read_csv("trump_replies.csv")

data_donald_replies

Unnamed: 0,username,id,reply_id,date,tweet
0,ZoltanCaptain,940075314398486528,939642796289470464,2017-12-11 04:26:35,@realDonaldTrump Resign before you are impeach...
1,nodictatorinusa,940073512282218496,939642796289470464,2017-12-11 04:19:25,@realDonaldTrump https://t.co/tuQNvmQYEZ
2,1MikeMerica,940073344891699200,939642796289470464,2017-12-11 04:18:45,@realDonaldTrump Now go play roulette in Russia
3,rondavt74,940072314086887424,939642796289470464,2017-12-11 04:14:40,@realDonaldTrump https://t.co/sIPqjPoX3s
4,TMHansen0528,940071620823040001,939642796289470464,2017-12-11 04:11:54,@realDonaldTrump You know nothing about football!
5,msbakerdw,940071578154418177,939642796289470464,2017-12-11 04:11:44,@realDonaldTrump HOW ABOUT THEM COWBOYS
6,ThomasH78937573,940068345604333570,939642796289470464,2017-12-11 03:58:53,@realDonaldTrump Yippee and you also have a tr...
7,MRMitchnet,940067661358104576,939642796289470464,2017-12-11 03:56:10,@realDonaldTrump @Don_Vito_08 Everything is be...
8,jshoe2111,940067075728424961,939642796289470464,2017-12-11 03:53:51,@realDonaldTrump Was with an Army guy yesterda...
9,rosiegee818,940065699598184451,939642796289470464,2017-12-11 03:48:23,@realDonaldTrump Yah i said bcus u &amp; ur ru...


<b> Feature vectors </b>

The follwing code will find what the users from data_donald_replies are following and will put that into feature vectors in a cloumn called score. Start at 0, if they’re following people to the right wing hanldes +1, if they are following people to the left -1. Again,due to api's rate limit this took very long time and we have to store the result in csv

In [None]:
right_wing_handles_string = ",@AnnCoulter,@StefanMolyneux,@MarkSteynOnline,@ezralevant,@nntaleb,@Lauren_Southern,@RealJamesWoods,@IngrahamAngle,@benshapiro,@charliekirk11,@jihadwatchRS,@scrowder,@RubinReport,@Nigel_Farage,@michellemalkin,@PrisonPlanet,@ScottAdamsSays,@andrewklavan,@Gavin_McInnes,@Cernovich,@TuckerCarlson,@mitchellvii,@NolteNC,@JamesOKeefeIII,@DLoesch,@JackPosobiec,@BuckSexton,@KatiePavlich,@marklevinshow,@seanhannity,@guypbenson,@JimDeMint,@BrentBozell,@larryelder,@BillOReilly,@limbaugh"
right_wing_handles = right_wing_handles_string.split(",@")   
left_wing_handles_string = ",@People4Bernie,@BuzzfeedBen,@jonlovett,@ChrisMurphyCT,@HeerJeet,@deray,@chrislhayes,@ezraklein,@jbouie,@cjane87,@julietlapidos,@meaganmday,@emmaogreen,@ObsoleteDogma,@billmaher,@BillNye,@iamjohnoliver,@kmcnuggets,@StephenAtHome,@donnabrazile,@PPact,@iraglass,@chelseahandler,@kamalaharris,@elizabethwarren,@sensherrodbrown"
left_wing_handles = left_wing_handles_string.split(",@")  

right_wing_handles.pop(0)
left_wing_handles.pop(0)

#getting user's following list
fl_dict = {}
for u_name in username_list:
    fl =[]
    users = tweepy.Cursor(extractor.friends, screen_name=u_name, count = 200).items()
    for u in users:
        fl.append(u.screen_name)
     
    fl_dict[u_name] = fl
    
#for each user, find how many they are following that are in right wing handles and that for left wing as well
#the score will be right minus left
score_list=[]
for user in fl_dict:
        user_set = set(fl_dict[user])
        score= len(list(user_set & right_wing_set)) - len(list( user_set & left_wing_set))


#the data is then saved to csv called
data_donald_replies['following_feature_vector'] = score_list
data_donald_replies.to_csv("trump_replies_with_following_feature.csv", sep=',', index = False)

Reading from the replies csv  with following feature. As we can see, the following_feature_vector are mostly negative meaning that they follow more right wing handles than right wing ones. 

In [8]:
data_donald_replies_with_following_feature = pd.read_csv("trump_replies_with_following_feature.csv")
data_donald_replies_with_following_feature

Unnamed: 0,username,id,reply_id,date,tweet,following_feature_vector
0,ZoltanCaptain,940075314398486528,939642796289470464,2017-12-11 04:26:35,@realDonaldTrump Resign before you are impeach...,-8
1,nodictatorinusa,940073512282218496,939642796289470464,2017-12-11 04:19:25,@realDonaldTrump https://t.co/tuQNvmQYEZ,-7
2,1MikeMerica,940073344891699200,939642796289470464,2017-12-11 04:18:45,@realDonaldTrump Now go play roulette in Russia,-1
3,rondavt74,940072314086887424,939642796289470464,2017-12-11 04:14:40,@realDonaldTrump https://t.co/sIPqjPoX3s,-3
4,TMHansen0528,940071620823040001,939642796289470464,2017-12-11 04:11:54,@realDonaldTrump You know nothing about football!,-3
5,msbakerdw,940071578154418177,939642796289470464,2017-12-11 04:11:44,@realDonaldTrump HOW ABOUT THEM COWBOYS,1
6,ThomasH78937573,940068345604333570,939642796289470464,2017-12-11 03:58:53,@realDonaldTrump Yippee and you also have a tr...,-10
7,MRMitchnet,940067661358104576,939642796289470464,2017-12-11 03:56:10,@realDonaldTrump @Don_Vito_08 Everything is be...,-3
8,jshoe2111,940067075728424961,939642796289470464,2017-12-11 03:53:51,@realDonaldTrump Was with an Army guy yesterda...,-4
9,rosiegee818,940065699598184451,939642796289470464,2017-12-11 03:48:23,@realDonaldTrump Yah i said bcus u &amp; ur ru...,3


#### Sentiment Analysis

As explained before, sentiment analysis is a basic way of analyzing a text to tell if it is of positive, negative, or neutral sentiment. NLTK is the library we will use to do this section. This will be one of the features for the feature vector. 

## 784 words at this point. 

In [11]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [12]:
sia = SentimentIntensityAnalyzer()

In [14]:
#Getting the sentiment analysis for each type of tweet
sentiment_nltk_donald = []
for _,x in data_donald.iterrows():
    sentiment_nltk_donald.append((x["tweet_text"],sia.polarity_scores(x['tweet_text'])))
sentiment_nltk_donald_replies = []
for _,x in data_donald_replies.iterrows():
    sentiment_nltk_donald_replies.append((x["tweet_text"],sia.polarity_scores(x['tweet_text'])))

In [15]:
i = 0
while i < 6:
    print(sentiment_nltk_donald_replies[i])
    i = i + 1

('@realDonaldTrump @WhiteHouse  https://t.co/Po44ehzuU4', {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0})
('@realDonaldTrump @POTUS Moron!', {'neg': 0.636, 'neu': 0.364, 'pos': 0.0, 'compound': -0.5411})
("@realDonaldTrump I'm not convinced that the majority of Americans believe that. Heck, he helped you win by what he... https://t.co/OhZPoFZoLC", {'neg': 0.094, 'neu': 0.748, 'pos': 0.158, 'compound': 0.3699})
('@realDonaldTrump Awesome job!  Keep it up!', {'neg': 0.0, 'neu': 0.516, 'pos': 0.484, 'compound': 0.6892})
('@realDonaldTrump No you did not you moron', {'neg': 0.224, 'neu': 0.509, 'pos': 0.267, 'compound': 0.1098})
('@realDonaldTrump Coherent, grammatical, appropriate capitalization ... obviously ghostwritten.', {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0})


## 30 nonredundant lines of code. 