Problem 2: Use the Twitter US Airline Sentiment dataset (available on Kaggle), which contains tweets labeled with the sentiment of the user toward airlines (positive, negative, or neutral). The data is stored in a Pandas DataFrame with columns such as airline and text (tweet content)

In [19]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.models import KeyedVectors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [20]:
# Download required nltk resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to C:\Users\Soham
[nltk_data]     Murudkar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Soham
[nltk_data]     Murudkar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Soham
[nltk_data]     Murudkar\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Soham
[nltk_data]     Murudkar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Soham
[nltk_data]     Murudkar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [21]:
df = pd.read_csv("Airline-Sentiment-2-w-AA.csv")

(a) Preprocess each tweet:

In [22]:
if 'airline_sentiment' in df.columns:
    df.rename(columns={'airline_sentiment': 'sentiment'}, inplace=True)

In [23]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

contractions_dict = {
    "won't": "will not", "can't": "can not", "don't": "do not", "doesn't": "does not",
    "i'm": "i am", "it's": "it is", "they're": "they are", "we're": "we are",
    "that's": "that is", "there's": "there is", "what's": "what is", "couldn't": "could not",
    "didn't": "did not", "haven't": "have not", "isn't": "is not", "aren't": "are not",
    "you've": "you have", "you'll": "you will", "i've": "i have", "i'll": "i will"
}

In [24]:
def expand_contractions(text):
    for contraction, full_form in contractions_dict.items():
        text = re.sub(r"\b" + re.escape(contraction) + r"\b", full_form, text)
    return text

In [25]:
def preprocess_tweet(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", "", text) 
    text = re.sub(r"@\w+|\#\w+", "", text)               
    text = expand_contractions(text)
    text = re.sub(r"[^\w\s]", "", text)                  
    text = re.sub(r"[^\x00-\x7F]+", "", text)            
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return tokens

df['tokens'] = df['text'].apply(preprocess_tweet)

(b) Load the pre-trained Google News Word2Vec model using gensim:

In [26]:
model = KeyedVectors.load_word2vec_format("../GoogleNews-vectors-negative300.bin.gz", binary=True)

(c) Convert each tweet into a fixed-length vector by averaging the Word2Vec word vectors for all words in the tweet. Ignore words not found in the embeddings:

In [27]:
def vectorize(tokens, w2v_model):
    vectors = [w2v_model[word] for word in tokens if word in w2v_model]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(w2v_model.vector_size)

df['vector'] = df['tokens'].apply(lambda x: vectorize(x, model))

In [28]:
# Filter out empty vectors
df = df[df['vector'].apply(lambda x: np.any(x))]

(d) Split the dataset into training (80%) and testing (20%) sets using train test split

In [29]:
X = np.vstack(df['vector'].values)
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

(e) Train a Logistic Regression classifier on the vectorized training data and report the accuracy on the test set:

In [30]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

In [31]:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

Test Accuracy: 0.764826876928351


(f) Write a Python function predict tweet sentiment(model, glove model, tweet) that takes the trained classifier, the GloVe model, and a single tweet (string), and returns the predicted sentiment (positive, negative, or neutral):

In [32]:
def predict_tweet_sentiment(model_clf, w2v_model, tweet):
    tokens = preprocess_tweet(tweet)
    vec = vectorize(tokens, w2v_model).reshape(1, -1)
    return model_clf.predict(vec)[0]