# KnightHack 4 ~ Climate Change Tweets Sentiment Analysis
## A Survey of Different Models to do Sentiment Analysis

by: [John Muchovej](john.muchovej.com)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
from pathlib import Path
data = {}
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        data[filename] = Path(dirname) / filename

# Any results you write to the current directory are saved as output.
data

In [None]:
from IPython.display import (
    Markdown as md,
    Latex,
    HTML,
)
from tqdm.auto import tqdm

# Cursory Analysis

We'll start out by loading up the Twitter Sentiment Data and doing a bit of exploration to get a feel for what's going on with the data.

In [None]:
tweets = pd.read_csv(data["twitter_sentiment_data.csv"])

In [None]:
display(tweets.shape)

`pd.DataFrame.shape` returns a tuple of (# rows, # columns, ...). This tells us that we have ~44K Tweets (or rows) and each Tweet has 3 features (or columns).

In [None]:
value_counts = tweets["sentiment"].value_counts()
value_counts.name = "Raw Number"

value_normd = tweets["sentiment"].value_counts(normalize=True)
value_normd.name = "Percentage"

display(pd.concat([value_counts, value_normd], axis=1))

`pd.DataFrame["column"].value_counts` returns an enumeration over all the unique values and how many times that value appears in the `pd.DataFrame`.

In [None]:
display(tweets.head())

`pd.DataFrame.head` gives us the first 5 (by default) rows of the `tweets` DataFrame. This gives us a preview of the kinds of data we have in `tweets`.

# EDA (Exploratory Data Analysis)

**Before we pick up on our analysis, let's make a copy of the `pd.DataFrame` so we can feed `tweets` into our models later.**

We're going to start an Exploratory Data Analysis **(EDA)**. The first step of any Machine Learning project is to develop an understanding of your data, as that will help with model selection later on.

In [None]:
from copy import deepcopy
eda = deepcopy(tweets)
# tqdm.pandas()

First up, I have a strong aversion to keeping track of numeric keys. So let's replace all numeric values with the appropriate **labels**, given by the dataset.

In [None]:
sentiment_num2name = {
    -1: "Anti",
     0: "Neutral",
     1: "Pro",
     2: "News",
}
eda["sentiment"] = eda["sentiment"].apply(lambda num: sentiment_num2name[num])
eda.head()

In [None]:
from matplotlib import pyplot as plt
from matplotlib import style

import seaborn as sns

sns.set(font_scale=1.5)
style.use("seaborn-poster")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 10), dpi=100)

sns.countplot(eda["sentiment"], ax=axes[0])
labels = list(sentiment_num2name.values())

axes[1].pie(eda["sentiment"].value_counts(),
            labels=labels,
            autopct="%1.0f%%",
            startangle=90,
            explode=tuple([0.1] * len(labels)))

fig.suptitle("Distribution of Tweets", fontsize=20)
plt.show()

Next, since Twitter uses Hashtags almost like a summarization feature (at least in the sense of highlighting core ideas). So let's look at some of top hashtags for each of the classes of `sentiment`. We'll then make "word clouds" to visualize their prominence.

In [None]:
import re
import nltk
import itertools

In [None]:
top15 = {}

by_sentiment = eda.groupby("sentiment")
for sentiment, group in tqdm(by_sentiment):
    hashtags = group["message"].apply(lambda tweet: re.findall(r"#(\w+)", tweet))
    hashtags = itertools.chain(*hashtags)
    hashtags = [ht.lower() for ht in hashtags]
    
    frequency = nltk.FreqDist(hashtags)
    
    df_hashtags = pd.DataFrame({
        "hashtags": list(frequency.keys()),
        "counts": list(frequency.values()),
    })
    top15_htags = df_hashtags.nlargest(15, columns=["counts"])
    
    top15[sentiment] = top15_htags.reset_index(drop=True)

display(pd.concat(top15, axis=1).head(n=10))

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(28, 20))
counter = 0

for sentiment, top in top15.items():
    sns.barplot(data=top, y="hashtags", x="counts", palette="Blues_d", ax=axes[counter // 2, counter % 2])
    axes[counter // 2, counter % 2].set_title(f"Most frequent Hashtags by {sentiment} (Visually)", fontsize=25)
    counter += 1
plt.show()

### Observations:

- The most popular hashtags are, broadly, **climate** and **climatechange**. Which is expected, given the topic; but also, among the top 3 are relating to **trump** and his campaign slogan **maga**.
- The **BeforeTheFlood** hashtag refers to a 2016 documentary where Leonardo DiCaprio met with scientists, activists, and word leaders to discuss the dangers of climate and and possible solutions.
- **COP22**, **ParisAgreement**, and **Trump** in the **Pro** `sentiment` are likely related to the formal process Trump's administrastion began to exit the Paris Agreements, where north of 200 nations pledged to reduce greenhour gas emissions, assist developing nations, and assist [poor] nations struggling with the consequences of a warming Earth.
- Interestingly, **auspol** (short for Australian Politics) made the shortlist of the **Pro** `sentiment`. This is likeyl attributed to an assessment published quantifying the role of climate change in Australian brushfires and their increaseed risk of occuring.

In [None]:
def cleaner(tweet):
    tweet = tweet.lower()
    
    to_del = [
        r"@[\w]*",  # strip account mentions
        r"http(s?):\/\/.*\/\w*",  # strip URLs
        r"#\w*",  # strip hashtags
        r"\d+",  # delete numeric values
        r"U+FFFD",  # remove the "character note present" diamond
    ]
    for key in to_del:
        tweet = re.sub(key, "", tweet)
    
    # strip punctuation and special characters
    tweet = re.sub(r"[,.;':@#?!\&/$]+\ *", " ", tweet)
    # strip excess white-space
    tweet = re.sub(r"\s\s+", " ", tweet)
    
    return tweet.lstrip(" ")

In [None]:
eda["message"] = eda["message"].apply(cleaner)
eda.head()

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords, wordnet  

In [None]:
def lemmatizer(df):
    df["length"] = df["message"].str.len()
    df["tokenized"] = df["message"].apply(word_tokenize)
    df["parts-of-speech"] = df["tokenized"].apply(nltk.tag.pos_tag)
    
    def str2wordnet(tag):
        conversion = {"J": wordnet.ADJ, "V": wordnet.VERB, "N": wordnet.NOUN, "R": wordnet.ADV}
        try:
            return conversion[tag[0].upper()]
        except KeyError:
            return wordnet.NOUN
    
    wnl = WordNetLemmatizer()
    df["parts-of-speech"] = df["parts-of-speech"].apply(
        lambda tokens: [(word, str2wordnet(tag)) for word, tag in tokens]
    )
    df["lemmatized"] = df["parts-of-speech"].apply(
        lambda tokens: [wnl.lemmatize(word, tag) for word, tag in tokens]
    )
    df["lemmatized"] = df["lemmatized"].apply(lambda tokens: " ".join(map(str, tokens)))
    
    return df

In [None]:
eda = lemmatizer(eda)
eda.head()

In [None]:
plt.figure(figsize=(15, 15))
sns.boxplot(x="sentiment", y="length", data=eda, palette=("Blues_d"))
plt.title("Tweet Length Distribution for each Sentiment")
plt.show()

## The Buzzwords

Below, we'll compute the frequency of words for each `sentiment`. Following that, we'll build `WordCloud`s to visualize these words.

`WordCloud`s convey importance through opacity, so the more translucent a word, the less frequently it appears.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
frequency = {}

by_sentiment = eda.groupby("sentiment")
for sentiment, group in tqdm(by_sentiment):
    cv = CountVectorizer(stop_words="english")
    words = cv.fit_transform(group["lemmatized"])
    
    n_words = words.sum(axis=0)
    word_freq = [(word, n_words[0, idx]) for word, idx in cv.vocabulary_.items()]
    word_freq = sorted(word_freq, key=lambda x: x[1], reverse=True)
    
    freq = pd.DataFrame(word_freq, columns=["word", "freq"])
    
    frequency[sentiment] = freq.head(n=25)

to_view = pd.concat(frequency, axis=1).head(n=25)
display(to_view)

Now that we've computing the frequency, let's generate and plot the WordClouds for each `sentiment`.

In [None]:
words = {sentiment: " ".join(frequency[sentiment]["word"].values) for sentiment in sentiment_num2name.values()}

cmaps = {
    "Anti": ("Reds", 110),
    "Pro" : ("Greens", 73),
    "News": ("Blues", 0),
    "Neutral": ("Oranges", 10),
}

from wordcloud import WordCloud

wordclouds = {}
for sentiment, (cmap, rand) in tqdm(cmaps.items()):
    wordclouds[sentiment] = WordCloud(
        width=800, height=500, random_state=rand,
        max_font_size=110, background_color="white",
        colormap=cmap
    ).generate(words[sentiment])
    
fig, axes = plt.subplots(2, 2, figsize=(28, 20))
counter = 0

for sentiment, wordcloud in wordclouds.items():
    axes[counter // 2, counter % 2].imshow(wordcloud)
    axes[counter // 2, counter % 2].set_title(sentiment, fontsize=25)
    counter += 1
    
for ax in fig.axes:
    plt.sca(ax)
    plt.axis("off")

plt.show()

### Observations:

- The top 3 buzzwords are **climate**, **change**, and **rt** (retweet). This seems to indicate that a lot of the same information is being shared/viewed – this applies across all `sentiments`. While we can't conclude that's a result of the "filter bubble", it certainly seems like that might be a latent (hidden) cause.
- Interestingly, **trump** occurs across all cases. This may not be surprising given his presidency during the timeframe the Tweets were recorded – this is something that likely warrants further investigation especially along the axes of **Neutral** and **Pro**.
- Words like **real**, **believe**, **think**, and **fight** occur quite frequently in the **Pro** `sentiment`. Interestingly, both the **Pro** and **Anti** sentiment seem to be saying **science** and **scientist**, which seems indicative that both sides believe their quoting accurate, reproduced, research. 
- Take a look at the table above, you'll see the **http** actually shows up in the **Pro** `sentiment` quite frequently. This would imply that links are being shared alongside the Tweets quite frequently. Contrast that with the other `sentiment`s – particularly, **News**. Why might this be the case?

## Some Crude Entity Extraction

So, this is an entire field of NLP – entity extraction. We're going to use `spacy`, a pretty great NLP library. We're to extract the following:
- People
- Geopolitical Regions
- Organizations

For this particular dataset, we're looking to these factors as there's probably some causal relationship between them. Importantly, this might not tell us how these Tweeters would land on the spectrum of support, but it can tell us the most highly focused organizations, geopolitical regions, and influencers/people in advocating for/against "Human-driven Climate Change".

In [None]:
import spacy
spacy_en = spacy.load('en')

In [None]:
def crude_entities(tweet):
    as_words = tweet.apply(spacy_en)
    
    def by_label(words, label):
        filtered = [word.text for word in words.ents if word.label_ == label]
        return filtered
    
    def get_top(label, n=10):
        thing = as_words.apply(lambda x: by_label(x, label))
        flattened = itertools.chain(*thing.values)
        
        counter = Counter(flattened)
        topN = counter.most_common(n)
        
        topN_things = [thing for thing, _ in topN]
        
        return thing
    
    entities = pd.DataFrame()
    entities["people"] = get_top("PERSON", n=10)
    entities["geopolitics"] = get_top("GPE", n=10)
    entities["organizations"] = get_top("ORG")
    
    return entities

In [None]:
from collections import Counter

In [None]:
entities = {}

by_sentiment = eda.groupby("sentiment")

for sentiment, group in tqdm(by_sentiment):
    entities[sentiment] = crude_entities(group["lemmatized"])
    
display(pd.concat(entities, axis=1).head(n=10))

# Modeling!

Time to build our sentiment classifiers! We'll be "vectorizing" our text data before passing it through to our model. We need to vectorize our data for similar reasons to why we have ASCII and Unicode. Machines don't understand letters and words, but numeric values are their bread-and-butter.

We'll start out by looking at 5 models:
- Random Forests
- Naïve Bayes
- K-Nearest Neighbors
- Logistic Regression
- Support Vector Machines (Linear SVC)

In [None]:
# Preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.model_selection import train_test_split, RandomizedSearchCV

# Building classification models
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Model evaluation
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, f1_score, precision_score, recall_score

## Your professors don't give you test answers, there's a reason

As with every Supervised Learning task, we need to split our data into (at least) Training and Validation sets. Typically, data will be given to you as a `Training` and `Testing` sets; but in our case, we have one massive CSV, so we need to make that split ourselves.

These splits allow us to train our model, but also give us the ability to test it's performance on data it _shouldn't have seen_. (This is a problem known as "data leakage" – try to avoid it.)

In [None]:
X_all = tweets["message"]
y_all = tweets["sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.25, random_state=1337)

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.25, random_state=1337)

**What is TFIDF?** Essentially, it assigns word frequency scores. These scores _try_ to highlight words of greater interest – you can get at this idea by looking at in-document frequency vs across-document frequency. The `TFIDFVectorizer` will tokenize the documents, learn the vocabulary and "inverse document frequency wegihtings", and allow you to encode new documents.

In [None]:
tfidf = TfidfVectorizer()
tfidf.fit_transform(X_train)

The following functions `train`, `grade`, and `train_and_grade` are helper functiosn to make life easier and practice DRY.

In [None]:
def train(tfidf, model, train_data, train_labels, test_data):
    model.fit(tfidf.transform(train_data), train_labels)
    preds = model.predict(tfidf.transform(test_data))
    
    return preds

In [None]:
def grade(model, preds, test_labels):
    print(metrics.classification_report(test_labels, preds))
    
    cm = confusion_matrix(test_labels, preds)
    cm_normd = cm / cm.sum(axis=1).reshape(-1, 1)
    
    heatmap_kwargs = dict(
        cmap="YlGnBu",
        xticklabels=model.classes_,
        yticklabels=model.classes_,
        vmin=0.,
        vmax=1.,
        annot=True,
        annot_kws={"size": 10},
    )
    
    sns.heatmap(cm_normd, **heatmap_kwargs)
    
    plt.title(f"{model.__class__.__name__} Classification")
    plt.ylabel("Ground-truth labels")
    plt.xlabel("Predicted labels")
    plt.plot()

In [None]:
def train_and_grade(tfidf, model, train_data, train_labels, test_data, test_labels):
    preds = train(tfidf, model, train_data, train_labels, test_data)
    grade(model, preds, test_labels)

## Random Forests

Random Forests are a tree-based Machine Learning algorithm that leverages the power of multiple Decision Trees. Decision trees work essentially like `if-elif-else` control flow, but the metric for each decision boundary is "information gain". The Forest component is pretty lackluster, you're taking a bunch of Decision Trees and "planting them together" to build a Forest.

A visual representation of Random Forests:

![](https://kevintshoemaker.github.io/NRES-746/rf.png)

Retrieved from [here](https://kevintshoemaker.github.io/NRES-746).

In [None]:
rf = RandomForestClassifier(max_depth=5, n_estimators=100)
train_and_grade(tfidf, rf, X_train, y_train, X_valid, y_valid)

### Observations:

- From the Confusion Matrix above, you can see that the model strictly predicts the **Pro** `sentiment`. This is likely due to the balance of data, but since we haven't tested that, we can't quite conclude that.
- Looking at the Precision/Recall/F1 Score, for the **Anti**, **Neutral**, and **News** `sentiment`s, you'll see they're all 0.
- Tree-based methods are prone to overfitting on imbalanced data, like what we have. However, we could potentially re-sample so the training data has a more uniform spread of each `sentiment` to test if that's truly the problem with our `RandomForestClassifier`.
- Our overall F1 score is 0.52, which if you recall from our earlier visualizations, matches the %-age of **Pro** `sentiment` Tweets.

## Naïve Bayes

Naïve Bayes leverages Bayes Theorem to make classifications. This assumes that independent variables are statistically independent from one another.

$$P(A | B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Let's break this down:
- $P(A|B)$ is the posterior – our prediction of the likelihood of $A$, given we've observed $B$
- $P(A)$ is the likelihood of $A$
- $P(B|A)$ is the likelihood of $B$, given we've observed $A$
- $P(B)$ is the likelihood of $B$

So, in summary, we're taking a known, $P(B|A)$ combining it with the liklihood that $A$ even happens, then we're "re-normalizing it" in terms of $B$.

### Naïve Bayes' 3 Classification Methods
- **Gaussian**: often used in classfication tasks and *assumes* a Normal Distribution (the "bell curve")
- **Bernoulli**: a "binomial" model – this is useful if you have a binary classification (e.g. `True`/`False`)
- **Multinomial**: used for discrete counts. In our case, instead of looking at "is the word in the document" (a Bernoulli view), we instead cound the frequency of the word in the document.

In [None]:
nb = MultinomialNB()
train_and_grade(tfidf, nb, X_train, y_train, X_valid, y_valid)

### Observations:

- An improvement of Random Forests, but it still performs pretty poorly.
- It still classifies most Tweets with the **Pro** `sentiment`.
- Precision, Accuracy, and F1 Scores, though, have signifcantly improved across the other `sentiments`.
- While Naïve Bayes performs better, it's performance is likely hampered by the balance of data we have.

## K-Nearest Neighbors

KNN uses "feature similarity" to predict the values of new data points. Basically, it looks at the $K$ nearest points of the data point given, and computes a similarity between them.

You can compute the similarity with a variety of measures, e.g. Euclidean, Manhattan (good for Continuous), and Hamming (good for Categorial) distances.

![](https://adrianromano.com/wp-content/uploads/2019/02/A-typical-example-of-a-KNN-classification-for-a-two-class-problem-ie-the-pink-and.png)

In [None]:
knn = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)
train_and_grade(tfidf, knn, X_train, y_train, X_valid, y_valid)

### Observations:

- KNN improves over both Naïve Bayes and Random Forests.
- It still leans **Pro** on classification, but you'll notice that it actually has greater diversity in classification, overall.

## [Multinomial] Logistic Regression (Classification)

Multinomial Logistic Regression is a generalization of Logistic Regression, so that it can handle multiple classes. Typically Logistic Regression does well when you linearly separate the classes in question. Like Naïve Bayes and Random Forests, it's very sensitive to the class balance.

In [None]:
logreg = LogisticRegression(C=1, class_weight="balanced", max_iter=1000)
train_and_grade(tfidf, logreg, X_train, y_train, X_valid, y_valid)

### Observations:

- Logistic Regression does quite well, especially compared to the previous models.
- The Precision, Recall, and F1-scores of all non-**Pro** classes is still trending upwards, which is good.

## Support Vector Machines (Linear SVC)

With SVMs we plot our data in $n$-dimensional space ($n$ is the number of features) so that each feature is created as a coordinate on an axis. The goal of SVMs to create the best decision boundary between all the features (this gets hard to visualize past $n=3$. This decision boundary is also called the hyperplane.

SVM typically uses extreme points/vectors to help in creating the Hyperplane. These vectors are called "Support Vectors". Peep the image below for an idea of what's going on.

![](https://static.javatpoint.com/tutorial/machine-learning/images/support-vector-machine-algorithm.png)

In [None]:
svm_lsvc = LinearSVC(class_weight="balanced")
train_and_grade(tfidf, svm_lsvc, X_train, y_train, X_valid, y_valid)

### Observations:

- SVM is able to quite successfully classify Tweets.
- Based on the CM above, you can see there are pretty clear boundaries across all the `sentiments`.
- Interestly, the SVM seems more confused about what should be classified as **Pro** than even Logistic Regression.
- The trade-off of classifying **Pro** tweets, though, still leads to gains in properly classifying the majority of our data correctly.