# Hate Speech on Twitter
## Deliverable 04
## Amir ElTabakh
## 3/22/2022

Since we are working with the Elevated Product, I can only access tweets shared within the last week. So I query for `#covid19` to get a fair sample of tweets. There are only about 6,600 tweets posted with that hashtag within the last week.

The entire pipeline consists of the following:
- Scraping tweets
- Coverting JSON object to a DataFrame
- Selecting only important columns
- De-Truncating the tweets: The API as is only gives me access to the first 140 characters of any tweet. Some tweets may have up to 280 tweets. To get the remaining text per tweet I have to access the API again and get the whole tweet while passing in the tweet ID.
- Cleaning all the tweets, that is removing punctuation, hyperlinks, stopwords, and casting the string to lowercase. This improves the performance of any text processing techniques.
- Converting the timestamp of all tweets from UTC to EST
- Performing sentiment analysis, that is to find the positivity [0, 1], negativity [0, 1], neutrality [0, 1], and a metric called compound [-1, 1] which aggregates positivity, negativity and neutrality. I also gather polarity [-1, 1] and subjectivity [0, 1].
- Gather the coordinates of each tweet provided the `user.location` column provided by the API. These coordinates are helpful for some geometric mapping.

For 6,600 tweets this entire pipeline takes 2 hours, 36 minutes, and 28 seconds. This poses a problem, this pipeline is not scalable.
- For 2 million tweets the pipeline will require 31.5 days of continuous running.
- For 10 million tweets the pipeline will require 5.2 months of continuous running.

In general the pipeline itself is generally quite rapid. There are just two portions that are consuming so much time, de-truncating the tweets, and collecting coordinates for each tweet. Both processes have to connect to the API and make the request for each individual tweet. For now I removed these two portions from the pipeline. Moving forward I'll be looking into ways to optimize the pipeline.

The pipeline below will return everything except the tweets will not be de-truncated, that is there may be tweets that are not complete (which may affect the sentiment analysis and any text analysis to be done in the future) and coordinates.

In [None]:
# Libraries you will need in order to run the cell below
# You only need to run this cell once on your device

!python -m pip install git+https://github.com/tweepy/tweepy@master # install tweepy
!pip install nltk # natural language library
!pip install vaderSentiment # sentiment analysis
!pip install geopy # get coordinates
!pip install geopandas # plot coordinates

In [None]:
###########################################
# Connecting to Twitter API
###########################################

# importing dependencies
import tweepy as tw
import pandas as pd

# Twitter Keys, tokens, and secrets are saved in seperate config file on my local device
from config import consumer_key, consumer_secret, access_token, access_secret, bearer_token

# authenticate
auth = tw.OAuthHandler(consumer_key, consumer_secret)
api = tw.API(auth, wait_on_rate_limit=True)


###########################################
# Getting Tweets
###########################################

# get tweets from the API
search_query = "#covid19 -filter:retweets"

# center of united states
latitude = "37.09024"
longitude = "-95.712891"
radius = "791mi"
location = f"{latitude},{longitude},{radius}"
num_of_tweets = 10000

tweets = tw.Cursor(api.search_tweets,
                  q = search_query,
                  lang = "en",
                  geocode = location).items(num_of_tweets) # getting 10,000 tweets

# store the API responses in a list
tweets_copy = []
for tweet in tweets:
    tweets_copy.append(tweet)
    
print(f"{num_of_tweets} tweets have been scraped.")

###########################################
# Saving scraped data as a DataFrame object
###########################################
    
# Lets create a dataframe for our scraped tweets
df = pd.DataFrame(columns = tweets_copy[0]._json.keys())

for i in range(len(tweets_copy)):
    df = df.append(pd.json_normalize(tweets_copy[i]._json))
    
list_of_features_to_keep = ['created_at', 'id', 'id_str', 'text', 'truncated', 'entities.hashtags', 'entities.user_mentions',
                            'source', 'user.location', 'user.id', 'user.id_str', 'user.name',
                            'user.screen_name', 'user.description', 'user.url', 
                            'coordinates', 'place', 'contributors', 'is_quote_status', 'user.statuses_count',
                            'user.followers_count', 'user.friends_count', 'user.listed_count', 'user.created_at',
                            'user.favourites_count', 'retweet_count', 'favorite_count', 'lang', 'possibly_sensitive'
                            ]
df = df[list_of_features_to_keep]

# reset index
df = df.reset_index()
del df['index']

print(f"DataFrame has been created with {len(df)} scraped tweets.")

###########################################
# Cleaning DataFrame
###########################################
        
# clean text using regex (remove hyperlines, punctuation, remove stop words and convert to lowercase)
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stopwords = stopwords.words('english')

tweets_cleaned = []

for tweet in df['text']:
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.\S+', "", tweet)

    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)

    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    
    # remove punctuation
    tweet = tweet.translate(str.maketrans('', '', string.punctuation))
    
    # convert to lower case
    tweet = tweet.lower()
    
    # remove stop words
    tweet_tokens = word_tokenize(tweet)
    tweet_without_sw = [word for word in tweet_tokens if not word in stopwords]
    tweet = " ".join(tweet_without_sw)
    
    # convert to lower case
    tweet = tweet.lower()
    
    # append tweet text to list of cleaned tweet texts
    tweets_cleaned += [tweet]

# convert list to Pandas Series object
tweets_cleaned_series = pd.Series(tweets_cleaned)

# Add text_cleaned as column to df
df.insert(df.columns.get_loc("text") + 1, "text_cleaned", tweets_cleaned_series)

print("All tweets have been cleaned.")


# Convert timestamp from UTC to EST
from pytz import timezone
from datetime import datetime

def utc_to_est(tweet_timestamp):
        '''
        Utility function to convert timestamps from UTC to EST.
        '''
        
        eastern = timezone('US/Eastern')
        utc = timezone('UTC')
        created_at = datetime.strptime(tweet_timestamp, '%a %b %d %H:%M:%S +0000 %Y')
        utc_created_at = utc.localize(created_at)
        est_created_at = utc_created_at.astimezone(eastern)
        
        return est_created_at
    
# tweet created at
for i in range(len(df)):
    df['created_at'][i] = utc_to_est(df['created_at'][i])
    
# user account created at
for i in range(len(df)):
    df['user.created_at'][i] = utc_to_est(df['user.created_at'][i])
    
print("Timestamps have been converted to EST.")
    
###########################################
# Performing Sentiment Analysis
# - Generating new columns for df
###########################################
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# calulcating polarity and subjectivity for all tweets
df[['polarity', 'subjectivity']] = df['text_cleaned'].apply(lambda Text: pd.Series(TextBlob(Text).sentiment))

for index, row in df['text_cleaned'].iteritems():
    # using Vader to get sentiment scores
    score = SentimentIntensityAnalyzer().polarity_scores(row)
    neg = score['neg']
    neu = score['neu']
    pos = score['pos']
    comp = score['compound']
    
    # determine sentiment
    if neg > pos:
        df.loc[index, 'sentiment'] = "negative"
    elif pos > neg:
        df.loc[index, 'sentiment'] = "positive"
    else:
        df.loc[index, 'sentiment'] = "neutral"
    
    # populating neg, neutral, positive, and compound columns
    df.loc[index, 'neg'] = neg
    df.loc[index, 'neu'] = neu
    df.loc[index, 'pos'] = pos
    df.loc[index, 'compound'] = comp
    
print("Sentiment Analysis Complete.")

###########################################
# output DataFrame to csv file
###########################################

filename = "covid19_tweets_01.csv"
df.to_csv(filename)

print(f"File has been outputted to {filename}")