# Sentiment Tagging with Machine Learning

In this notebook, we will be using the NLTK library to train a machine learning model to classify tweets based on their sentiment. We will be using the [Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) dataset from Kaggle. We will be processing the dataset and training a model to classify tweets as either positive, negative, or neutral.

- **Author:** [Sergio Cuéllar](https://www.linkedin.com/in/sergiocuellaralmagro/)
- **Date:** March 2023
- **Dataset:** [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)
- **Python Version:** 3.10.10

## Objectives

As previously stated, the objective of this notebook is to classify sentiment of tweets using a machine learning model. We will be using the NLTK library for this purpose. We will be using the following steps:
1. **Setup:** Importing libraries and setting up the environment.
2. **Data Loading:** Loading the dataset into a Pandas DataFrame.
3. **Data Processing:** Preprocessing the data to be used by the model (tokenization, etc.).
4. **Model Training:** Training the model using the processed data. We will be comparing the performance of the following models:
    - VADER Sentiment Analyzer
    - Naive Bayes Classifier
    - Maximum Entropy Classifier
5. **Model Evaluation:** Evaluating the model using the test dataset. We will be comparing the performance of the previously mentioned models, and choosing the best one for the final classification.
6. **Model Export:** Exporting the model to be used in the deployment phase.

## Setup

We will be importing the following libraries:
- **Pandas:** For data manipulation and analysis.
- **NLTK:** For natural language processing. From this library, we will be importing the VADER sentiment analyzer, the stopwords and the tokenizer.
- **Matplotlib & Seaborn:** For data visualization. We will be using these libraries for the model evaluation.
- **Sklearn:** We will be using this library for model splitting, evaluation and feature extraction.

In [82]:
import pandas as pd
import nltk
import joblib
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.classify import NaiveBayesClassifier, SklearnClassifier, accuracy
from nltk.stem import WordNetLemmatizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [83]:
# Download the VADER lexicon
nltk.download('vader_lexicon')
nltk.download('wordnet')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\esser\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\esser\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Data Loading

In [84]:
data = pd.read_csv('data/Tweets.csv')
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## Data Processing

To prepare for the analysis, we need to tokenize the text into individual words and remove any stop words or punctuation that are not helpful for the sentiment analysis. We will be using the tokenizer and stopwords from the NLTK library.

In [85]:
stopwords = set(stopwords.words('english'))

In [86]:
def preprocess(text):
    lower_text = text.lower()
    tokens = word_tokenize(lower_text) # Tokenize the text
    filtered_tokens = [token for token in tokens if token.isalpha() and token not in stopwords] # token.isalpha() removes punctuation, and token not in stopwords removes stopwords
    return ' '.join(filtered_tokens) # Join the tokens back into a string and returns it

def remove_airline_mentions(text):
    airlines = ['americanair', 'southwestair', 'jetblue', 'virginamerica', 'usairways']
    filtered_text = [word for word in text.split() if word not in airlines] # Split the text into a list of words, and remove the airline mentions
    return ' '.join(filtered_text)

data['text'] = data['text'].apply(preprocess) # Apply the preprocess function to the text column
data['text'] = data['text'].apply(remove_airline_mentions) # Apply the remove_airline_mentions function to the text column

In [87]:
# Take a look at the first 5 rows of the data after preprocessing
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,dhepburn said,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,plus added commercials experience tacky,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,today must mean need take another trip,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,really aggressive blast obnoxious entertainmen...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,really big bad thing,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## Model Training

### VADER Sentiment Analyzer

Now, we will be using the VADER sentiment analyzer to classify the tweets. We will be using the `polarity_scores` method to get the sentiment of each tweet. This method returns a dictionary with the following keys:
- **neg:** Negative sentiment.
- **neu:** Neutral sentiment.
- **pos:** Positive sentiment.
- **compound:** Compound sentiment. This is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1 (most extreme negative) and +1 (most extreme positive).

We will be using the compound metric to classify the tweets. If the compound score is greater than 0.05, the tweet will be classified as positive. If the compound score is less than -0.05, the tweet will be classified as negative. Otherwise, the tweet will be classified as neutral.

It's important to note that we don't need to actually train the model. The VADER sentiment analyzer is already trained and we are just using it to classify the tweets. Therefore, it is not necessary to split the dataset into training and test sets. 

We will be using the airline_sentiment column as the target variable. This column was pre-classified by humans. We will be comparing the results of the VADER sentiment analyzer with the human classification.

In [88]:
vader = SentimentIntensityAnalyzer()

def get_sentiment(text):
    scores = vader.polarity_scores(text)
    compound_score = scores['compound']
    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'
    
data['vader_sentiment'] = data['text'].apply(get_sentiment) # Applies the VADER sentiment analysis function to the tokenized text column

In [89]:
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,vader_sentiment
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,dhepburn said,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),neutral
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,plus added commercials experience tacky,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),neutral
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,today must mean need take another trip,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),neutral
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,really aggressive blast obnoxious entertainmen...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),negative
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,really big bad thing,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),negative


In [90]:
# Calculate the accuracy of the VADER sentiment analysis
vader_accuracy = (data['airline_sentiment'] == data['vader_sentiment']).sum() / len(data)
print('The accuracy of the VADER sentiment analysis is: {:.2f}%'.format(vader_accuracy * 100))

The accuracy of the VADER sentiment analysis is: 44.47%


### Naive Bayes Classifier

Next, we will be using the Naive Bayes Classifier from the NLTK library to classify the tweets. We will be using the `NaiveBayesClassifier` class to train the model. We will be using the `train_test_split` method from the `sklearn.model_selection` library to split the dataset into training and test sets. We will be using 80% of the dataset for training and 20% for testing.

In [91]:
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(data['text'], data['airline_sentiment'], test_size=0.2, random_state=42)

In [92]:
# To train the Naive Bayes classifier, we need to extract the features (words) from the text data.
# We will use the NLTK NaiveBayesClassifier, which requires the features to be in a dictionary format.

def extract_features(text):
    words = set(text.split()) # Split the text into a list of words
    features = {word : True for word in words} # Vectorize the text by creating a dictionary of words and their corresponding values. The values are set to True, because the NaiveBayesClassifier only cares about the presence of the words, not their frequency.
    return features

# Extract the features from the training data
X_train_features = [(extract_features(text), sentiment) for text, sentiment in zip(X_train, y_train)] # Returns a list of tuples, where each tuple is a dictionary of features and the sentiment label
X_test_features = [(extract_features(text), sentiment) for text, sentiment in zip(X_test, y_test)] # Does the same thing for the test data

In [93]:
# Train the Naive Bayes classifier
classifier = NaiveBayesClassifier.train(X_train_features)

# Calculate the accuracy of the Naive Bayes classifier using the test data
print('The accuracy of the Naive Bayes classifier is: {:.2f}%'.format(accuracy(classifier, X_test_features) * 100))


The accuracy of the Naive Bayes classifier is: 78.28%


In [94]:
# Apply the Naive Bayes classifier to the dataframe in a new column
data['nb_sentiment'] = data['text'].apply(lambda text: classifier.classify(extract_features(text)))

data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,vader_sentiment,nb_sentiment
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,dhepburn said,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),neutral,negative
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,plus added commercials experience tacky,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),neutral,positive
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,today must mean need take another trip,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),neutral,negative
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,really aggressive blast obnoxious entertainmen...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),negative,negative
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,really big bad thing,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),negative,negative


### Support Vector Machine (SVM) Classifier

Finally, we will be using a Support Vector Machine (SVM) classifier to classify the tweets. We will be using the sklearn library's `SVC` class to train the model, and the NLTK library's `SklearnClassifier` class to wrap the sklearn model into an NLTK classifier. As before, we will be splitting the dataset into training and test sets.

Before training the model, we need to extract the features from the tweets. For this model, we will be lemmatizing the tweets and, as before, vectorizing them into a dictionary. We will be using the `lemmatize` method from the NLTK library to lemmatize the tweets. Lemmatizing is the process of reducing words to their base or dictionary form, known as the lemma. For example, the lemma of the word 'running' is 'run', and the lemma of 'feet' is 'foot'. This method helps to reduce the dimensionality of the feature space and, therefore, the training time.

In [95]:
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(data['text'], data['airline_sentiment'], test_size=0.2, random_state=42)

In [96]:
def svm_extract_features(text):
    words = set(text.split()) # Split the text into a list of words and remove duplicates

    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

    # Vectorize the text into a dictionary.
    features = {word : True for word in lemmatized_words}

    return features

# Extract the features from the training data
X_train_features = [(svm_extract_features(text), sentiment) for text, sentiment in zip(X_train, y_train)] # Returns a list of tuples, where each tuple is a dictionary of features and the sentiment label
X_test_features = [(svm_extract_features(text), sentiment) for text, sentiment in zip(X_test, y_test)] # Does the same thing for the test data

In [97]:
# Train the SVM classifier
svm_classifier = SklearnClassifier(SVC(kernel='linear')).train(X_train_features)

In [98]:
# Calculate the accuracy of the SVM classifier using the test data
print('The accuracy of the SVM classifier is: {:.2f}%'.format(accuracy(svm_classifier, X_test_features) * 100))

The accuracy of the SVM classifier is: 78.62%


In [99]:
# Apply the SVM classifier to the dataframe in a new column
data['svm_sentiment'] = data['text'].apply(lambda text: svm_classifier.classify(svm_extract_features(text)))

data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,vader_sentiment,nb_sentiment,svm_sentiment
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,dhepburn said,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),neutral,negative,neutral
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,plus added commercials experience tacky,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),neutral,positive,positive
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,today must mean need take another trip,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),neutral,negative,neutral
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,really aggressive blast obnoxious entertainmen...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),negative,negative,negative
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,really big bad thing,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),negative,negative,negative


## Model Evaluation

In this section of the notebook, we will be evaluating the performance of the 3 models that we trained. We will be using the `classification_report` method from the `sklearn.metrics` library to get the precision, recall, f1-score and support for each class. After comparing the results, we will be choosing the best model for the final classification and demployment.

In [100]:
# Generate a classification report for the VADER sentiment analysis
vader_classification_report = classification_report(data['airline_sentiment'], data['vader_sentiment'])
print(vader_classification_report)

              precision    recall  f1-score   support

    negative       0.90      0.36      0.52      9178
     neutral       0.37      0.34      0.35      3099
    positive       0.26      0.90      0.41      2363

    accuracy                           0.44     14640
   macro avg       0.51      0.53      0.43     14640
weighted avg       0.68      0.44      0.46     14640



In [101]:
# Generate a classification report for the Naive Bayes classifier
nb_classification_report = classification_report(data['airline_sentiment'], data['nb_sentiment'])
print(nb_classification_report)

              precision    recall  f1-score   support

    negative       0.84      0.97      0.90      9178
     neutral       0.87      0.53      0.66      3099
    positive       0.85      0.79      0.82      2363

    accuracy                           0.84     14640
   macro avg       0.85      0.76      0.79     14640
weighted avg       0.85      0.84      0.83     14640



In [102]:
# Generate a classification report for the SVM classifier
svm_classification_report = classification_report(data['airline_sentiment'], data['svm_sentiment'])
print(svm_classification_report)

              precision    recall  f1-score   support

    negative       0.94      0.95      0.95      9178
     neutral       0.85      0.81      0.83      3099
    positive       0.90      0.89      0.90      2363

    accuracy                           0.91     14640
   macro avg       0.90      0.89      0.89     14640
weighted avg       0.91      0.91      0.91     14640



Out of all these metrics, the one we care most about is the accuracy. The accuracy is the number of correct predictions divided by the total number of predictions. Accuracy is a good metric to use when the classes are balanced. However, in this case, the classes are not balanced. The dataset contains 14,640 tweets, and 24.5% of them are positive, 48.5% are neutral, and 27% are negative. Therefore, we should not use accuracy as our main metric. Instead, we should use the f1-score, which is the harmonic mean of precision and recall. The f1-score is a good metric to use when the classes are imbalanced.

Taking this into account, we can see that the `SVM classifier` has the best performance. It has the highest f1-score for all the classes, and the highest accuracy. Therefore, we will be using this model for the final classification.

## Model Export

In [105]:
# Export the SVM classifier to a joblib file.
joblib.dump(svm_classifier, 'svm_classifier.joblib')

# Export the SVM feature extractor to a joblib file.
joblib.dump(svm_extract_features, 'svm_extract_features.joblib')


FileNotFoundError: [Errno 2] No such file or directory: 'model/svm_classifier.joblib'