# CMSC320 Final Project
## Kinsey Smith, Sarah Bullard, Yiwen Shen

<img src="trump_twitter_image.jpg">

### Introduction

Our project is surrounding the Twitter account of the United States' current president, Donald Trump (@realDonaldTrump). We focused on the sentiment of the tweets of this account versus each individual tweet's replies. Our intention was to find out the difference between the sentiment of the tweet and the sentiment of its replies, and how it would reflect our current political climate. Our hypothesis was that Donald Trump's account would have more negative replies to his positive tweets, since the current political climate does not favor Donald Trump.

Sentiment analysis is a way of classifying a text as having a positive, negative, or neutral sentiment using text analysis.  However, because of the complex and sarcastic components of the English language, sentiment analysis is not a sole way of categorizing something as positive or negative. Because of this, we needed other factors to tell the sentiment of a reply. In order to do this, we created a feature vector and used that to classify the sentiment of the reply and used SVM machine learning in order to have the machine classify it for us.

The features in our feature vector are as follows: 

Our first feature was the original sentiment analysis, because although it is not reliably conclusive on its own, it can tell us something about the mood of the sentence. 

The second feature we worked on focused on the user who posted the reply to the specific tweet. We checked whether or not the user was following other accounts that aligned with Donald Trump's or BLM's views, including politicians of either party.

The third feature we worked on also focused on the user who posted the reply to the specific tweet. We compared the user's hashtags for the last year to known Trump-positive and BLM-positive hashtags, and noted numerically the number of hashtags that were similar for each user. 

This notebook will be organized into four parts: Data Extraction, Data Manipulation, Data Analysis, and Data Visualization. Each part will show how we manipulated the Twitter's API in order to get the tweets that we need and come to the conclusion that we have. 

In [2]:
# All of the imports that we need for the project.
import tweepy
import textblob
import numpy as np
import pandas as pd
import collections
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

### Accessing Twitter's Data

In order to access Twitter's API, we had to create applications and personally get authentication tokens. Even though anyone that has a Twitter account is allowed access to Twitter's data as long as they fill out an Application form, we cannot give out these confidential tokens on this public notebook becuase it is a privacy risk. In order to get past this hurdle, we created a function that would pull from our own files on our own machines for these tokens. In the cell below is a copy of credentials.py, without the confidential information. See below the code we used to access Twitter's API in order to get the data. 

#### An example of credentials.py 
#This is a file that holds confidential information about a Twitter user and their authentication tokens. Please do not read further if you are not authorized.
  
CONSUMER_KEY = ' '

CONSUMER_SECRET = ' '

ACCESS_TOKEN = ' '

ACCESS_SECRET = ' '


In [1]:
from credentials import *
#A function that takes these credentials and sets up the API.
def api_setup():
    authentication = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    authentication.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
    api = tweepy.API(authentication)
    return api
# Extracting the tweets
extract_tweets = api_setup()

In [None]:
# Donald Trump Replies
tweet_ids_donald = []
for page in tweepy.Cursor(extract_tweets.user_timeline,screen_name="realDonaldTrump").pages(20):
    for item in page:
        tweet_ids_donald.append(item.id_str)

In [None]:
query = tweepy.Cursor(extract_tweets.search,q="to:realDonaldTrump").items(5000)

In [None]:
query2 = tweepy.Cursor(extract_tweets.search,q="to:Blklivesmatter").items(20000)

In [None]:
for tweet in query:
    if replies_donald.get(tweet.in_reply_to_status_id_str) != None:
        #add it to the csv

In [None]:
for tweet in query2:
    if replies_blm.get(tweet.in_reply_to_status_id_str) != None:
        #add it to the csv

### Adding the data to their individual CSVs

Once we accessed the data from Twitter's API, we started querying. However, since Twitter has a rule of a maximum of 15 queries in 15 minutes, and we were querying for replies of more than 15 tweets, we had to find another way to access the data when working with it. Twitter also has a rule where you can only access tweets 2 weeks before the current date, so we set that as our time limit for both tweets and their replies to measure the current poltical climate. We added the tweets and their replies dated from <b> DATE OF TWEETS AND REPLIES </b> to individual CSVs to work with them. We slowly added them to these CSVs to circumvent the 15-queries-maximum rule.

<b> Put comments here about how you added the data to the CSVs and what purpose it had. </b>

In [6]:
# Put code here about how you added the data to the CSVs and what purpose it had.

### Collecting the Data in a Functional Way

Now that the data is in a csv and collected, we can put it into a dataframe to work with. 

We also deleted unnecessary columns in the tweet json, like followers_count and following, which don't matter in regards to what we are doing. Since the Twitter json structure doesn't include the tweet id itself and (if it is a reply) the id of the tweet it is replying to, we had to put that information as well as the text of the tweet (to match them) into a separate CSV and compare them to add both sets of information to the dataframe.

In [3]:
# The Donald Tweets
data_donald = pd.read_csv("trump3.csv")
del data_donald["name"]
del data_donald["followers_count"]
del data_donald["listed_count"]
del data_donald["following"]
del data_donald["favorites"]
del data_donald["verified"]
del data_donald["default_profile"]
del data_donald["statuses_count"]
del data_donald["description"]
del data_donald["geo_enabled"]
del data_donald["contributors_enabled"]
del data_donald["tweet_lat"]
del data_donald["tweet_long"]
del data_donald["tweet_source"]
del data_donald["tweet_in_reply_to_screen_name"]
del data_donald["tweet_direct_reply"]
del data_donald["tweet_retweet_count"]
del data_donald["tweet_favorite_count"]
del data_donald["tweet_hashtags_count"]
del data_donald["tweet_urls"]
del data_donald["tweet_urls_count"]
del data_donald["tweet_user_mentions_count"]
data_donald.head()

Unnamed: 0,username,location,time_zone,tweet_id,tweet_time,tweet_text,tweet_retweet_status,tweet_hashtags,tweet_user_mentions,tweet_media_type,tweet_contributors
0,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),939680422493073408,2017-12-10 02:17:25,No American should be separated from their lov...,False,,,,
1,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),939642796289470464,2017-12-09 23:47:55,Great Army - Navy Game. Army wins 14 to 13 and...,False,,,,
2,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),939634404267380736,2017-12-09 23:14:34,.@daveweigel of the Washington Post just admit...,False,,daveweigel,,
3,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),939616077356642304,2017-12-09 22:01:44,.@DaveWeigel @WashingtonPost put out a phony p...,False,,"daveweigel, washingtonpost",,
4,realDonaldTrump,"Washington, DC",Eastern Time (US & Canada),939564681743814661,2017-12-09 18:37:31,"Have a great game today, @USArmy and @USNavy -...",False,,"USArmy, USNavy",,


In [8]:
# Replies of Trump
data_donald_replies = pd.read_csv("trump_replies_7.csv")
del data_donald_replies["name"]
del data_donald_replies["followers_count"]
del data_donald_replies["listed_count"]
del data_donald_replies["following"]
del data_donald_replies["favorites"]
del data_donald_replies["verified"]
del data_donald_replies["default_profile"]
del data_donald_replies["statuses_count"]
del data_donald_replies["description"]
del data_donald_replies["geo_enabled"]
del data_donald_replies["contributors_enabled"]
del data_donald_replies["tweet_lat"]
del data_donald_replies["tweet_long"]
del data_donald_replies["tweet_source"]
del data_donald_replies["tweet_direct_reply"]
del data_donald_replies["tweet_retweet_status"]
del data_donald_replies["tweet_retweet_count"]
del data_donald_replies["tweet_favorite_count"]
del data_donald_replies["tweet_hashtags_count"]
del data_donald_replies["tweet_urls"]
del data_donald_replies["tweet_urls_count"]
del data_donald_replies["tweet_user_mentions_count"]
data_donald_replies.head()

Unnamed: 0,username,location,time_zone,tweet_id,tweet_time,tweet_text,tweet_in_reply_to_screen_name,tweet_hashtags,tweet_user_mentions,tweet_media_type,tweet_contributors
0,ToskovVeselin,,,9.398499e+17,2017-12-10 13:35:28,@realDonaldTrump https://t.co/42XcWGph5Y,realDonaldTrump,,realDonaldTrump,photo,
1,Charles5077,"Tennessee, USA",Pacific Time (US & Canada),9.398499e+17,2017-12-10 13:35:28,@realDonaldTrump For now,realDonaldTrump,,realDonaldTrump,,
2,imagerm69,"Texas, USA",,9.398499e+17,2017-12-10 13:35:28,@realDonaldTrump Humpty Dumpty news had a grea...,realDonaldTrump,,realDonaldTrump,,
3,dergalev,"Chelyabinsk, Russia",Asia/Yekaterinburg,9.398499e+17,2017-12-10 13:35:28,@realDonaldTrump Even when you're watching TV ...,realDonaldTrump,FakeNewsMedia,realDonaldTrump,,
4,KevinLoudin,"Cumming, GA",Eastern Time (US & Canada),9.398499e+17,2017-12-10 13:35:28,@realDonaldTrump Thanks Obama!,realDonaldTrump,,realDonaldTrump,,


#### Sentiment Analysis

As explained before, sentiment analysis is a basic way of analyzing a text to tell if it is of positive, negative, or neutral sentiment. NLTK is the library we will use to do this section. This will be one of the features for the feature vector. 

## 784 words at this point. 

In [11]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [12]:
sia = SentimentIntensityAnalyzer()

In [14]:
#Getting the sentiment analysis for each type of tweet
sentiment_nltk_donald = []
for _,x in data_donald.iterrows():
    sentiment_nltk_donald.append((x["tweet_text"],sia.polarity_scores(x['tweet_text'])))
sentiment_nltk_donald_replies = []
for _,x in data_donald_replies.iterrows():
    sentiment_nltk_donald_replies.append((x["tweet_text"],sia.polarity_scores(x['tweet_text'])))

In [15]:
i = 0
while i < 6:
    print(sentiment_nltk_donald_replies[i])
    i = i + 1

('@realDonaldTrump @WhiteHouse  https://t.co/Po44ehzuU4', {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0})
('@realDonaldTrump @POTUS Moron!', {'neg': 0.636, 'neu': 0.364, 'pos': 0.0, 'compound': -0.5411})
("@realDonaldTrump I'm not convinced that the majority of Americans believe that. Heck, he helped you win by what he... https://t.co/OhZPoFZoLC", {'neg': 0.094, 'neu': 0.748, 'pos': 0.158, 'compound': 0.3699})
('@realDonaldTrump Awesome job!  Keep it up!', {'neg': 0.0, 'neu': 0.516, 'pos': 0.484, 'compound': 0.6892})
('@realDonaldTrump No you did not you moron', {'neg': 0.224, 'neu': 0.509, 'pos': 0.267, 'compound': 0.1098})
('@realDonaldTrump Coherent, grammatical, appropriate capitalization ... obviously ghostwritten.', {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0})


## 30 nonredundant lines of code. 