# CMSC320 Final Project
## Kinsey Smith, Sarah Bullard, Yiwen Shen

In [4]:
from IPython.display import HTML, display
display(HTML("<table><tr><td><img src='blm_twitter_image.jpg'></td><td><img src='Twitter_bird_logo_2012.svg.png'></td><td><img src='trump_twitter_image.jpg'></td></tr></table>"))

### Introduction
Our project is surrounding the Twitter accounts of the Black Lives Matter Movement (@Blklivesmatter) and Donald Trump (@realDonaldTrump). We focused on the sentiment of the tweets of each account versus each individual tweet's replies. Our question was to find out what account would have more stark of a difference between tweets and their replies - in other words, whether positive tweets by BLM or Donald Trump would have more negative replies. Our hypothesis was that Donald Trump's account would have more negative replies to his positive tweets.

Sentiment analysis is a way of classifying a text as having a positive, negative, or neutral sentiment using text analysis.  However, because of the complex and sarcastic components of the English language, sentiment analysis is not a sole way of categorizing something as positive or negative. Because of this, we needed other factors to tell the sentiment of a reply. In order to do this, we created a feature vector and used that to classify the sentiment of the reply and used SVM machine learning in order to have the machine classify it for us.

The features in our feature vector are as follows. 

Our first feature was the original sentiment analysis, because although it is not reliably conclusive on its own, it can tell us something about the mood of the sentence. 

The second feature we worked on focused on the user who posted the reply to the specific tweet. We checked whether or not the user was following other accounts that aligned with Donald Trump's or BLM's views, including politicians of either party.

The third feature we worked on also focused on the user who posted the reply to the specific tweet. We compared the user's hashtags for the last year to known Trump-positive and BLM-positive hashtags, and noted numerically the number of hashtags that were similar for each user. 


In [5]:
import tweepy
import textblob
import numpy as np
import pandas as pd
import collections
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

### Accessing Twitter's Data

In order to access Twitter's API, we had to create applications and personally get authentication tokens. Even though we are all allowed access to Twitter, we cannot give out these confidential tokens on this public notebook. In order to get past this hurdle, we created a function that would pull from our own files on our own machines for these tokens. In the cell below is a copy of credentials.py, without the confidential information. 

Once we did that, we started querying. However, since Twitter has a rule of a maximum of 15 queries in 15 minutes, and we were querying for replies of more than 15 tweets, we had to find another way to access the data when doing our project. We added the tweets of both accounts and their replies dated from December 1st, 2016 to December 8th, 2017 to individual CSVs to work with them. We slowly added them to these CSVs to circumvent the 15-queries-maximum rule. See below the code we used to access Twitter's API in order to get the data. 

#### An example of credentials.py 
#This is a file that holds confidential information about a Twitter user and their authentication tokens. Please do not read further if you are not authorized.
  
CONSUMER_KEY = ' '
CONSUMER_SECRET = ' '
ACCESS_TOKEN = ' '
ACCESS_SECRET = ' '

In [1]:
from credentials import *
#A function that takes these credentials and sets up the API.
def api_setup():
    authentication = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    authentication.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
    api = tweepy.API(authentication)
    return api
# Extracting the tweets
extract_tweets = api_setup()

In [None]:
# Donald Trump Replies
tweet_ids_donald = []
for page in tweepy.Cursor(extract_tweets.user_timeline,screen_name="realDonaldTrump").pages(20):
    for item in page:
        tweet_ids_donald.append(item.id_str)

In [None]:
query = tweepy.Cursor(extract_tweets.search,q="to:realDonaldTrump").items(5000)

In [None]:
query2 = tweepy.Cursor(extract_tweets.search,q="to:Blklivesmatter").items(20000)

In [None]:
for tweet in query:
    if replies_donald.get(tweet.in_reply_to_status_id_str) != None:
        #add it to the csv

In [None]:
for tweet in query2:
    if replies_blm.get(tweet.in_reply_to_status_id_str) != None:
        #add it to the csv

### Adding the data to their individual CSVs

<b> Put comments here about how you added the data to the CSVs and what purpose it had. </b>

In [6]:
# Put code here about how you added the data to the CSVs and what purpose it had.

### Collecting the Data in a Functional Way

Now that the data is in a csv and collected, we can put it into a dataframe to work with. 

This dataframe will only include those tweets, as well as their replies, stemming from a year ago until now, as the most controversial times for both accounts are within that time frame. Specifically, Donald Trump became President of the United States in January and was controversial in the fall leading up to the election. Even though the Black Lives Matter movement was created in 2013 and became nationally recognized in 2014 after the shootings of Michael Brown and Eric Garner, it would be unrelated to our topic to compare the sentiment analysis of Donald Trump before he became controversial in the magnitude that he is now.

We also deleted unnecessary columns in the tweet json, like followers_count and following, which don't matter in regards to what we are doing. Since the Twitter json structure doesn't include the tweet id itself and (if it is a reply) the id of the tweet it is replying to, we had to put that information as well as the text of the tweet (to match them) into a separate CSV and compare them to add both sets of information to the dataframe. 

In [7]:
# The Donald Tweets
data_donald = pd.read_csv("tweets_trump.csv")
del data_donald["name"]
del data_donald["followers_count"]
del data_donald["listed_count"]
del data_donald["following"]
del data_donald["favorites"]
del data_donald["verified"]
del data_donald["default_profile"]
del data_donald["statuses_count"]
del data_donald["description"]
del data_donald["geo_enabled"]
del data_donald["contributors_enabled"]
del data_donald["tweet_lat"]
del data_donald["tweet_long"]
del data_donald["tweet_source"]
del data_donald["tweet_in_reply_to_screen_name"]
del data_donald["tweet_direct_reply"]
del data_donald["tweet_retweet_count"]
del data_donald["tweet_favorite_count"]
del data_donald["tweet_hashtags_count"]
del data_donald["tweet_urls"]
del data_donald["tweet_urls_count"]
del data_donald["tweet_user_mentions_count"]
data_donald.head()

Unnamed: 0,username,location,time_zone,tweet_time,tweet_text,tweet_retweet_status,tweet_hashtags,tweet_user_mentions,tweet_media_type,tweet_contributors
0,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),2017-12-07 21:10:28,"Across the battlefields, oceans, and harrowing...",False,,,,
1,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),2017-12-07 20:52:49,"Today, as we Remember Pearl Harbor, it was an ...",False,,,,
2,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),2017-12-07 20:04:20,"Today, the U.S. flag flies at half-staff at th...",False,,WhiteHouse,,
3,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),2017-12-07 16:16:19,"Today, our entire nation pauses to REMEMBER PE...",False,,,,
4,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),2017-12-07 15:04:54,"National Pearl Harbor Remembrance Day - ""A day...",False,,,,


In [8]:
# Replies of Trump
data_donald_replies = pd.read_csv("tweets_trump_replies.csv")
del data_donald_replies["name"]
del data_donald_replies["followers_count"]
del data_donald_replies["listed_count"]
del data_donald_replies["following"]
del data_donald_replies["favorites"]
del data_donald_replies["verified"]
del data_donald_replies["default_profile"]
del data_donald_replies["statuses_count"]
del data_donald_replies["description"]
del data_donald_replies["geo_enabled"]
del data_donald_replies["contributors_enabled"]
del data_donald_replies["tweet_lat"]
del data_donald_replies["tweet_long"]
del data_donald_replies["tweet_source"]
del data_donald_replies["tweet_in_reply_to_screen_name"]
del data_donald_replies["tweet_direct_reply"]
del data_donald_replies["tweet_retweet_count"]
del data_donald_replies["tweet_favorite_count"]
del data_donald_replies["tweet_hashtags_count"]
del data_donald_replies["tweet_urls"]
del data_donald_replies["tweet_urls_count"]
del data_donald_replies["tweet_user_mentions_count"]
data_donald_replies.head()

Unnamed: 0,username,location,time_zone,tweet_time,tweet_text,tweet_retweet_status,tweet_hashtags,tweet_user_mentions,tweet_media_type,tweet_contributors
0,kucha688,"Toronto, Ontario, Canada",,2017-12-08 06:26:09,@realDonaldTrump @WhiteHouse https://t.co/Po4...,False,,"realDonaldTrump, WhiteHouse",photo,
1,Janez40,,,2017-12-08 06:26:09,@realDonaldTrump @POTUS Moron!,False,,"realDonaldTrump, POTUS",,
2,MassingillCindy,,Pacific Time (US & Canada),2017-12-08 06:26:08,@realDonaldTrump I'm not convinced that the ma...,False,,realDonaldTrump,,
3,Christo52218608,"Washington, USA",,2017-12-08 06:26:07,@realDonaldTrump Awesome job! Keep it up!,False,,realDonaldTrump,,
4,KlimpCarolyn,tucson,,2017-12-08 06:26:04,@realDonaldTrump No you did not you moron,False,,realDonaldTrump,,


In [9]:
# BLM tweets
data_blm = pd.read_csv("tweets_blm.csv")
del data_blm["name"]
del data_blm["followers_count"]
del data_blm["listed_count"]
del data_blm["following"]
del data_blm["favorites"]
del data_blm["verified"]
del data_blm["default_profile"]
del data_blm["statuses_count"]
del data_blm["description"]
del data_blm["geo_enabled"]
del data_blm["contributors_enabled"]
del data_blm["tweet_lat"]
del data_blm["tweet_long"]
del data_blm["tweet_source"]
del data_blm["tweet_in_reply_to_screen_name"]
del data_blm["tweet_direct_reply"]
del data_blm["tweet_retweet_count"]
del data_blm["tweet_favorite_count"]
del data_blm["tweet_hashtags_count"]
del data_blm["tweet_urls"]
del data_blm["tweet_urls_count"]
del data_blm["tweet_user_mentions_count"]
data_blm.head()

Unnamed: 0,username,location,time_zone,tweet_time,tweet_text,tweet_retweet_status,tweet_hashtags,tweet_user_mentions,tweet_media_type,tweet_contributors
0,Blklivesmatter,worldwide,,2017-12-08 00:56:59,RT @KofiAdemola: #HandsOffJerusalemChi #FreePa...,True,"HandsOffJerusalemChi, FreePalestine","KofiAdemola, BLMChi, Blklivesmatter",,
1,Blklivesmatter,worldwide,,2017-12-08 00:51:59,RT @KofiAdemola: HandsOffJerusalemChi #FreePal...,True,FreePalestine,"KofiAdemola, BLMChi, Blklivesmatter",,
2,Blklivesmatter,worldwide,,2017-12-07 22:51:43,RT @BLMChi: Join us next Thursday!!! https://t...,True,,BLMChi,photo,
3,Blklivesmatter,worldwide,,2017-12-07 22:46:42,RT @BLMLA: A MUST READ on @Blklivesmatter! Fea...,True,,"BLMLA, Blklivesmatter, OsopePatrisse, DocMelly...",,
4,Blklivesmatter,worldwide,,2017-12-07 22:41:41,"RT @DocMellyMel: One of the most in-depth, tho...",True,,"DocMellyMel, Blklivesmatter",,


In [10]:
# Replies of BLM
data_blm_replies = pd.read_csv("tweets_blm_replies.csv")
del data_blm_replies["followers_count"]
del data_blm_replies["listed_count"]
del data_blm_replies["following"]
del data_blm_replies["favorites"]
del data_blm_replies["verified"]
del data_blm_replies["default_profile"]
del data_blm_replies["statuses_count"]
del data_blm_replies["description"]
del data_blm_replies["geo_enabled"]
del data_blm_replies["contributors_enabled"]
del data_blm_replies["tweet_lat"]
del data_blm_replies["tweet_long"]
del data_blm_replies["tweet_source"]
del data_blm_replies["tweet_in_reply_to_screen_name"]
del data_blm_replies["tweet_retweet_count"]
del data_blm_replies["tweet_favorite_count"]
del data_blm_replies["tweet_hashtags_count"]
del data_blm_replies["tweet_urls"]
del data_blm_replies["tweet_urls_count"]
del data_blm_replies["tweet_user_mentions_count"]
data_blm_replies.head()

Unnamed: 0,name,username,location,time_zone,tweet_time,tweet_text,tweet_direct_reply,tweet_retweet_status,tweet_hashtags,tweet_user_mentions,tweet_media_type,tweet_contributors
0,Krista,flowersandxtc,,,2017-12-08 05:10:29,@Blklivesmatter will advocate for BLM every mo...,True,False,,Blklivesmatter,,
1,Krista,flowersandxtc,,,2017-12-08 05:01:13,@Blklivesmatter My entire life I've faced a ra...,True,False,,Blklivesmatter,,
2,Cody,CRWW12,,,2017-12-08 03:24:42,@Blklivesmatter #whitelivesmatter #MAGA #FUCKYOU,True,False,"whitelivesmatter, MAGA, FUCKYOU",Blklivesmatter,,
3,vicks,anghiari34,,Hawaii,2017-12-08 01:20:51,@Blklivesmatter @womensmarch @womensmediacnt...,True,False,,"Blklivesmatter, womensmarch, womensmediacntr, ...",,
4,Ray Stern,raystern,"Phoenix, AZ",Arizona,2017-12-08 00:11:08,@Blklivesmatter Ex-Mesa cop Philip Brailsford ...,True,False,,Blklivesmatter,,


#### Sentiment Analysis

As explained before, sentiment analysis is a basic way of analyzing a text to tell if it is of positive, negative, or neutral sentiment. NLTK is the library we will use to do this section. This will be one of the features for the feature vector. 

## 784 words at this point. 

In [11]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [12]:
sia = SentimentIntensityAnalyzer()

In [14]:
#Getting the sentiment analysis for each type of tweet
sentiment_nltk_donald = []
for _,x in data_donald.iterrows():
    sentiment_nltk_donald.append((x["tweet_text"],sia.polarity_scores(x['tweet_text'])))
sentiment_nltk_blm = []
for _,x in data_blm.iterrows():
    sentiment_nltk_blm.append((x["tweet_text"],sia.polarity_scores(x['tweet_text'])))
sentiment_nltk_donald_replies = []
for _,x in data_donald_replies.iterrows():
    sentiment_nltk_donald_replies.append((x["tweet_text"],sia.polarity_scores(x['tweet_text'])))
sentiment_nltk_donald_replies = []
for _,x in data_donald_replies.iterrows():
    sentiment_nltk_donald_replies.append((x["tweet_text"],sia.polarity_scores(x['tweet_text'])))

In [15]:
i = 0
while i < 6:
    print(sentiment_nltk_donald_replies[i])
    i = i + 1

('@realDonaldTrump @WhiteHouse  https://t.co/Po44ehzuU4', {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0})
('@realDonaldTrump @POTUS Moron!', {'neg': 0.636, 'neu': 0.364, 'pos': 0.0, 'compound': -0.5411})
("@realDonaldTrump I'm not convinced that the majority of Americans believe that. Heck, he helped you win by what he... https://t.co/OhZPoFZoLC", {'neg': 0.094, 'neu': 0.748, 'pos': 0.158, 'compound': 0.3699})
('@realDonaldTrump Awesome job!  Keep it up!', {'neg': 0.0, 'neu': 0.516, 'pos': 0.484, 'compound': 0.6892})
('@realDonaldTrump No you did not you moron', {'neg': 0.224, 'neu': 0.509, 'pos': 0.267, 'compound': 0.1098})
('@realDonaldTrump Coherent, grammatical, appropriate capitalization ... obviously ghostwritten.', {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0})


## 30 nonredundant lines of code. 