# Text-as-data methods

In this tutorial, we will do some traditional as well as new-fangled NLP, by looking into sentiment analysis. The dataset that we will use is from an NLP shared task: https://alt.qcri.org/semeval2017/task4/index.php?id=data-and-tools
and available here: https://github.com/leelaylay/TweetSemEval

In shared tasks, many researchers participate to build models on a given dataset and are ranked based on the performance of these models. This particular shared task is detailed in the following report: [SemEval-2017 Task 4: Sentiment Analysis in Twitter](https://aclanthology.org/S17-2088/).

In [None]:
import pandas as pd

In [None]:
datapath = 'https://raw.githubusercontent.com/leelaylay/TweetSemEval/master/dataset/train/twitter-2013train-A.txt'
data = pd.read_csv(datapath, sep = '\t', names = ['id', 'sentiment', 'text'])
data

### Let's explore the data a bit

In [None]:
data.groupby('sentiment').size().plot(kind = 'bar')

In [None]:
# check for duplicates
data['text'].nunique()

In [None]:
data['text'].head().values

In [None]:
# check for social media specific content, such as urls, @mention, retweets, emojis etc
from nltk.tokenize import TweetTokenizer

tt = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <-- @remy: This is waaaaayyyy too much for you!!!!!!"
print(tt.tokenize(tweet))

In [None]:
tweet = data['text'][4]
print(tt.tokenize(tweet))

In [None]:
# a function to count the number of urls in a tweet
def url_count(row):
    count = 0
    tokens = tt.tokenize(row['text'])
    for token in tokens:
        if token.startswith('http'):
            count += 1
    return count

url_count(data.iloc[4])

In [None]:
data['url_count'] = data.apply(url_count, axis = 1)
data.head(5)

In [None]:
data.groupby('url_count').size()

### Ex 1.1: Get the count of @mentions in the tweets and find the top 5 most frequent @mentions in all tweets

## Let's create a bag-of-words representation

For this, we will use Scikit Learn, the standard Python Library for Machine Learning, at least for non-deep learning models. While DL models are much more popular now, Scikit Learn is still used for evaluation and splitting datasets.

For getting a Bag-of-Words (BoW) representation, we will use the Count Vectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() # initialize the vectorizer

In [None]:
# get all the dialogues
sentences = data['text'].values

# create the representation
vectorizer.fit(sentences)

In [None]:
len(vectorizer.vocabulary_)

In [None]:
vectorizer.vocabulary_['does']

In [None]:
bag_of_words = vectorizer.transform(sentences)
bag_of_words.shape

In [None]:
bag_of_words.toarray() # a lot of this is empty

## Let's train a classifier to detect sentiment using different approaches

Since this is a supervised problem, we need to split our data into training and test sets.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(data, test_size = 0.3)

In [None]:
X_train = train['text'].values
y_train = train['sentiment'].values
X_test = test['text'].values
y_test = test['sentiment'].values

### Approach 1:  BoW representation

### Step 1: Build the BoW Representation

In [None]:
vectorizer = CountVectorizer()
X_train_v = vectorizer.fit_transform(X_train)

**fit**: Learn a vocabulary dictionary of all tokens in the raw documents.

**transform**: Transform documents to document-term matrix.

In [None]:
X_test_v = vectorizer.transform(X_test) # why transform and not fit_transform?

Count Vectorizer has some other parameters such as whether it should consider case or not, whether it should remove stopwords or not.

One of the most important parameters for this use case is the ngram_range which determines which types of n-grams will be modeled.

### Step 2: Classify

We will use a Support Vector Machine Classifier.

In [None]:
from sklearn.svm import LinearSVC
sv = LinearSVC(random_state=0, tol=1e-5)

In [None]:
# fit the classifier, i.e., learn the parameters
clf = sv.fit(X_train_v.toarray(), y_train)

In [None]:
# predict the labels of the test set
y_pred = clf.predict(X_test_v.toarray())

Let's have a cursory look at the results

In [None]:
pd.DataFrame([X_test[:5], y_pred[:5], y_test[:5]]).T

In [None]:
anti_biden_tweet = "Ugh, this was true yesterday and it's also true now: Biden is an idiot"

In [None]:
clf.predict(vectorizer.transform([anti_biden_tweet]))

## Evaluate how well the classifier is doing

Scikit Learn also has several implementations of other types of metrics

In [None]:
from sklearn.metrics import classification_report # good for computing these metrics

print(classification_report(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

In [None]:
# you can also visualize this as a heatmap

import seaborn as sns
sns.heatmap(confusion_matrix(y_test, y_pred))

In [None]:
# or with sklearn directly
from sklearn.metrics import ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_).plot()

In [None]:
# let's store these predictions for later
test['bow_sentiment'] = y_pred

### Approach 2: Word Embeddings

In [None]:
#! pip install fasttext

In [None]:
import fasttext
import fasttext.util

# Load pre-trained model (e.g., English Wikipedia embeddings)
#fasttext.util.download_model('en', if_exists='ignore')  # English
model = fasttext.load_model('cc.en.300.bin')

In [None]:
# Get embeddings for a single text
text = "This is a test sentence."
embedding = model.get_sentence_vector(text)
print(embedding)

# Get embeddings for multiple texts
texts = ["This is a test sentence.", "Another sentence for testing."]
embeddings = [model.get_sentence_vector(text) for text in texts]
print(embeddings)

Instead of the BoW representation, we can use the more semantically grounded Fasttext representation in our classifier.

In [None]:
X_train = [model.get_sentence_vector(text) for text in train['text'].values]
X_test = [model.get_sentence_vector(text) for text in test['text'].values]

In [None]:
sv = LinearSVC(random_state=0, tol=1e-5)
clf = sv.fit(X_train, y_train)

In [None]:
clf.predict([model.get_sentence_vector(anti_biden_tweet)])

In [None]:
# predict the labels of the test set
y_pred = clf.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
# let's save the labels for later
test['fasttext_sentiment'] = y_pred

### Approach 3: Deep Learning Models like BERT

In [None]:
#! pip install simpletransformers

In [2]:
#!pip install torch

In [3]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs

We one-hot or dummy encode our sentiment label variables, because a numeric representation is easier to deal with. 

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train['sentiment'])
train['labels'] = le.transform(train['sentiment'])
test['labels'] = le.transform(test['sentiment'])

In [None]:
# let's see an example of the encoded sentiment values
test['labels'][0:5]

In [None]:
# we can also reverse this to get the word labels back
le.inverse_transform(test['labels'][0:5])

In [None]:
# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=3,
                                output_dir='output_st',
                                overwrite_output_dir=True)

# Create a ClassificationModel
model = ClassificationModel(
    "distilbert", "distilbert-base-uncased",
    args=model_args, use_cuda = False, num_labels=3
)

In [None]:
import os

In [None]:
os.getcwd()

In [None]:
train_df = train[['text', 'labels']]
test_df = test[['text', 'labels']]

In [None]:
train_df

In [None]:
# Train the model
model.train_model(train_df)

In [None]:
predictions, raw_outputs = model.predict([anti_biden_tweet])
le.inverse_transform(predictions)

In [None]:
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test_df)
result

In [None]:
# if you want to store the predictions for later. This is also how you do inference on unlabeled data
preds, probs = model.predict(list(test_df['text'].values))
preds = le.inverse_transform(preds)

print(classification_report(test['sentiment'], preds))

In [None]:
# let's store these labels for later
test['distilbert_sentiment'] = preds

### Approach 4: Zero-shot Natural Language Inference

In [None]:
# ! pip install transformers

In [None]:
### zero-shot NLI classification
from transformers import pipeline

classifier = pipeline("zero-shot-classification",model='facebook/bart-large-mnli')

In [None]:
classifier(anti_biden_tweet, ['positive', 'negative', 'neutral']) # for sentiment

In [None]:
labels = ['positive', 'negative', 'neutral']
scores = classifier(anti_biden_tweet, labels)

In [None]:
scores['labels'][0] # the label with the highest score

In [None]:
# let's do this for the full test set
from tqdm import tqdm # neat library for keeping track of how much yourdata has been processed

In [None]:
bart_scores = []
for text in tqdm(test['text'].values):
    bart_scores.append(classifier(text, labels)['labels'][0])

In [None]:
print(classification_report(test['sentiment'], bart_scores))

In [None]:
test['bart_sentiment'] = bart_scores

### Approach 5: Off-the-shelf techniques (VADER)

We will use VADER sentiment: https://github.com/cjhutto/vaderSentiment which stands for Valence Aware Dictionary and sEntiment Reasoner. It is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 

In [None]:
import nltk # VADER is a part of the popular NLP toolkit called nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

VADER return four scores: positive, negative, neutral and compound

In [None]:
sid.polarity_scores(anti_biden_tweet)

Oftne, the compound score is thresholded to get categorical or ordinal labels. The paper suggests the following:

**negative**: -1 to -0.1

**neutral**: -0.1 to 0.1

**positive**: 0.1 to 1

In [None]:
test['vader_sentiment_score'] = [sid.polarity_scores(i) for i in test['text']]
test['vader_sentiment_score'] = [sid.polarity_scores(i)['compound'] for i in test['text']]

In [None]:
test

In [None]:
def vader_apply(row):
    if -1 < row['vader_sentiment_score'] < -0.1:
        return 'negative'
    elif 0.1 < row['vader_sentiment_score'] < 1:
        return 'positive'
    else:
        return 'neutral'
test['vader_sentiment'] = test.apply(vader_apply, axis = 1)
test.head()

In [None]:
print(classification_report(test['sentiment'], test['vader_sentiment']))

Now save all the labels in CSV format for using at a later point

In [None]:
test.to_csv('labeled_test_set.csv')

### Ex 1.2: Collect some posts from your favorite subreddit. Use any two of these approaches to find the sentiment of the posts. Compare the sentiment estimates from different approaches.

### More NLP pointers:

Nice set of tuorials for deep learning: https://www.aiforhumanists.com/tutorials/

NLP crash course: https://lena-voita.github.io/nlp_course.html