# Natural Language Processing on Tweets: Sentiment Analysis and text classification

In [52]:
from __future__ import print_function
from textblob import TextBlob, Word
import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
import string
import pandas as pd
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import operator
from random import shuffle
from collections import Counter
import sys
import re
import json
from prettytable import PrettyTable
from IPython.display import HTML, display

# Set Seaborn defaults
sns.set()
%matplotlib inline
pd.set_option("display.precision", 6)
mpl.rcParams['figure.dpi'] = 100
mpl.rcParams['savefig.dpi'] = 150
mpl.rcParams['figure.autolayout'] = True

# Global variables
data_dir = "../data"
pictures_path = os.path.join("../Pictures", "11.NLP")
tweets_path = "../lib/GetOldTweets-python/out/completed"

In [2]:
# Model classes
class Tweet:
    def __init__(self, tweet_id, tweet_dict):
        self.tweet_id = tweet_id
        self.tweet_dict = tweet_dict
        
    def __eq__(self, other):
        if isinstance(other, Tweet):
            return self.tweet_id == other.tweet_id
        return NotImplemented
    
    def __ne__(self, other):
        x = self.__eq__(other)
        if x is not NotImplemented:
            return not x
        return NotImplemented
    
    def __hash__(self):
        return hash(self.tweet_id)

In [3]:
# Utility functions
def get_relative_percentage(n,m):
    return n*100.0/m

def read_large_file(file_object):
    while True:
        data = file_object.readline()
        if not data:
            break
        yield data.rstrip('\n')
        
# Extract tweets given a specific hashtag (include also retweeted/quoted tweets)
def get_tweets(hashtag):
    tweets_filename = os.path.join(tweets_path,"tweets_#" + hashtag + "_2013-09-01_2016-12-31.json")
    tweets = set()
    with open(tweets_filename) as fin:
        for line in read_large_file(fin):
            tweet_dict = json.loads(line)
            tweet_id = np.int64(tweet_dict["id_str"])
            t = Tweet(tweet_id, tweet_dict)
            for special in ["retweeted_status","quoted_status"]:
                if special in tweet_dict:
                    tweets.add(Tweet(np.int64(tweet_dict[special]["id_str"]), tweet_dict[special]))
                    t.tweet_dict.pop(special)
            tweets.add(t)
    print("Imported %d tweets from %s" %(len(tweets),tweets_filename))
    return tweets


Sentiment Analysis and Tweets text classification is a common task among NLP-ers although highly subject to misclassification risks due to the ambiguous nature of the text content and the actual sentiment they want to transmit. It's not uncommon, indeed, to see sarcastic Tweets whose actual sentiment is not trivial to be inferred by a classifier that has been trained on pure positive/negative/neutral emotions. 

With my analysis, I want to achieve the following goals:
1. **Compute Sentiment Analysis Score for different sets of hashtags**: Pick sets of Tweets of hashtags that are strongly left/right leaning (e.g. #ImWithHer and #BLM are left leaning or democrats/liberals, whereas #MAGA and #ALM are right-leaning or republicans/conservatives) and compute average sentiment score (first by using the built-in classifier provided by TextBlob pre-trained on movies reviews, secondly using a custom classifier trained ad-hoc on actual Tweets) for each of them and show Sentiment Score distribution with boxplots where observations are individual tweets; show basic stats (e.g. percentage of tweets with positive/negative sentiment, show the top positive/negative tweets);
2. **Political Text classification**: Build corpus of tweet texts for each category (Right, Left for simplicity) and build classifier to predict political orientation (polarity) by content of tweet, using bag-of-words/TF-IDF approaches. This would not be a perfect political tendency classifier, mainly because the content of each tweet related to a certain hashtag might not entirely reflect an opinion on *that* hashtag.

Predicting political tendency of user based both on the content of the tweets and its role and properties in the network structure is a task that requires more complicated steps and an accurate methodology for which I have no time left to be spent.

## 1. Sentiment Analysis
I've chosen to work mainly with the `TextBlob` Python package, which basically exposes a wrapper for the well-known `NLTK` (Natural Language ToolKit) package, largely adopted for Natural Language Processing tasks.

### 1.1 Pre-Processing
Tweets often include noisy data that ideally should be cleaned up prior to any analysis. Luckily, for the sake of the sentiment analysis, `TextBlob` comes handy as it offers most of the required pre-processing tasks out of the box and are already automatically performed when calculating the sentiment score. The pre-processing tasks I would consider include (in order):
- Converting the text to lowercase
- Translation of non-english tweets *(ignored because of performance degradation)*
- URL stripping *(this has already been applied by the script used to download the Tweets)*
- Removing non-ASCII characters, symbols, numbers
- Removing User mentions
- Spell-checking and correction *(ignored because of performance degradation)*
- Removing stopwords
- Filtering out short words (e.g. $len(w) \leq 3$)
- Emoticon detection (already nicely handled by default by `TextBlob`)
- Stemming and lemmatisation, to group together the inflected forms of a word so they can be analyzed as a single item
- Perform POS (Part-Of-Speech) tagging to select only relevant features/tokens (like nouns, verbs, adjectives)

`TextBlob`, luckily, already performs most of the text parsing and pre-processing listed above before computing the sentiment score. However, we still want to perform all of these in order to perform further text analysis. In order to perform sentiment analysis without loosing information provided by punctuation, I define two functions to process the tweets text as defined above. One would be the *baseline cleaner function* (enough prior to carrying out the SA task) and the other one the *advanced cleaning function* (to be used prior to the text classification task).

In [7]:
%%time
# Read tweets
hashtags = set(["imwithher", "makeamericagreatagain"])
hashtag_to_tweets = {h:get_tweets(h) for h in hashtags}

Imported 1745238 tweets from ../lib/GetOldTweets-python/out/completed/tweets_#makeamericagreatagain_2013-09-01_2016-12-31.json
Imported 3189115 tweets from ../lib/GetOldTweets-python/out/completed/tweets_#imwithher_2013-09-01_2016-12-31.json
CPU times: user 8min 3s, sys: 12.5 s, total: 8min 16s
Wall time: 8min 16s


Let's first show some basic stats on the tweets, for example which are the most common hashtags that are used along with the main ones:

In [5]:
%%time
extra_hashtags = {}
for h in hashtag_to_tweets:
    extra_hashtags[h] = {}
    for t in hashtag_to_tweets[h]:
        for hashtag in t.tweet_dict["entities"]["hashtags"]:
            if hashtag.lower() == h:
                continue
            if hashtag.lower() in extra_hashtags[h]:
                extra_hashtags[h][hashtag.lower()] += 1
            else:
                extra_hashtags[h][hashtag.lower()] = 1

CPU times: user 1.5 s, sys: 56 ms, total: 1.56 s
Wall time: 1.47 s


In [6]:
for h in extra_hashtags:
    top_20 = sorted(extra_hashtags[h].items(), key=operator.itemgetter(1), reverse=True)[:20]
    t = PrettyTable(['Supporting Hashtags (%s)' %h, 'Frequency'])
    for el in top_20:
        t.add_row(el)
    print(t)

+-------------------------------------+-----------+
| Supporting Hashtags (jesuischarlie) | Frequency |
+-------------------------------------+-----------+
|             charliehebdo            |   44384   |
|               montréal              |   25791   |
|                polqc                |   17424   |
|             islamisation            |   16507   |
|                paris                |   15503   |
|               letter4u              |    8421   |
|              letter4all             |    8399   |
|              musulmans              |    7564   |
|               hadiths               |    7471   |
|             jesuisahmed             |    6188   |
|                france               |    5449   |
|          noussommescharlie          |    4599   |
|              jesuisjuif             |    4119   |
|             prisedotage             |    3100   |
|             jesuisparis             |    2906   |
|                islam                |    2899   |
|           

Let's now define all the functions needed for the pre-processing step:

In [70]:
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

def clean_tweet_baseline(tweet):
    # Convert to lowercase
    tweet["text"] = tweet["text"].lower()
    
    # Convert non-English texts to English: translation takes up a lot of execution time
    '''if tweet["lang"] != "en":
        try:
            tweet["text"] = TextBlob(tweet["text"]).translate(from_lang=tweet["lang"], to="en").string
        except:
            pass'''
        
    # Remove handles (user mentions)
    tweet["text"] = remove_pattern(tweet["text"], r"@[\w]*")
    
    # Remove hashtag symbol as prefix for hashtags
    tweet["text"] = tweet["text"].replace('#','')
    
    # Fix classic slang/internet abbreviations
    abbreviations = [(r'\bthats\b','that is'),(r'\bive\b','i have'),(r'\bim\b','i am'),(r'\bya\b','yeah'),
                     (r'\bcant\b','can not'),(r'\bwont\b','will not'),(r'\bid\b','i would'),(r'wtf','what the fuck'),(r'\bwth\b','what the hell'),
                     (r'\br\b','are'),(r'\bu\b','you'),(r'\bk\b','OK'),(r'\bsux\b','sucks'),(r'\bno+\b','no'),(r'\bcoo+\b','cool'),
                     (r'\blol\b','lot of laughs')]
    for abb in abbreviations:
        tweet['text'] = re.sub(abb[0], abb[1], tweet["text"])
        
    # Strip non ASCII characters
    tweet["text"] = "".join((c for c in tweet["text"] if 0 < ord(c) < 127))
    
    # Remove numbers
    tweet["text"] = re.sub(r'\d+', '', tweet["text"])
    
    # Remove HTML symbols (like '&', tabs etc.)
    tweet["text"] = re.sub(r'&\w+;', '', tweet["text"])
    
    # Remove short words (len <= 3)
    tweet["text"] = " ".join([w for w in tweet["text"].split() if len(w.translate({ord(c): None for c in string.punctuation}))>3])
    
    # Remove stop words
    tweet["text"] = " ".join([w for w in tweet["text"].split() if w.translate({ord(c): None for c in string.punctuation}) not in set(stopwords.words('english'))])
    
    tweet["text"] = tweet["text"].strip()
    return tweet
    
def clean_tweet_advanced(tweet):
    # Singularize words (remove plurals)
    tweet["text"] = " ".join([w.singularize() for w in TextBlob(tweet["text"]).words])
    
    # Filter by relevant POS tags and lemmatize
    pos_tags = [(w,get_wordnet_pos(pos_tag)) for w,pos_tag in TextBlob(tweet["text"]).pos_tags if get_wordnet_pos(pos_tag) != '']
    tweet["text"] = " ".join([w.lemmatize(pos_tag) for w,pos_tag in pos_tags])
    
    tweet["text"] = tweet["text"].strip()
    return tweet

In order to speed up execution times, I'm forced to skip tweet text translation for non-english tweets, as it takes up a lot of execution time due to the network communication required by the translator. Furthermore, only tweets that actually express a logically complete and reasonable thought may be useful for our analysis, hence short tweets may be discarded. Let's then filter the tweets and include only the ones with english text and that have at least 10 words:

In [48]:
for h in hashtag_to_tweets:
    len_before = len(hashtag_to_tweets[h])
    print("%s: Number of tweets before filtering: %d" %(h,len_before))
    hashtag_to_tweets[h] = {t for t in hashtag_to_tweets[h] if t.tweet_dict["lang"] == "en" and len(TextBlob(t.tweet_dict["text"]).words) >= 10}
    len_after = len(hashtag_to_tweets[h])
    print("%s: Number of tweets after filtering: %d (%.2f%% of %d)" %(h,len_after,get_relative_percentage(len_after,len_before),len_before))

makeamericagreatagain: Number of tweets before filtering: 1745238
makeamericagreatagain: Number of tweets after filtering: 1017104 (58.28% of 1745238)


And finally clean the tweets:

In [73]:
%%time
hashtag_to_cleaned_tweets = {}
for h in hashtag_to_tweets:
    hashtag_to_cleaned_tweets[h] = set()
    count = 0
    for t in hashtag_to_tweets[h]:
        t.tweet_dict = clean_tweet_baseline(t.tweet_dict)
        hashtag_to_cleaned_tweets[h].add(t)
        count += 1
        if count % 1000 == 0:
            print("Cleaned %d tweets" %count)

#hashtag_to_cleaned_tweets = {h:{clean_tweet_baseline(t.tweet_dict) for t in hashtag_to_tweets[h]} for h in hashtag_to_tweets} # Equivalent 1-liner

Cleaned 1000 tweets
Cleaned 2000 tweets
Cleaned 3000 tweets
Cleaned 4000 tweets
Cleaned 5000 tweets


KeyboardInterrupt: 