# Sentiment Analysis with spaCy and Advanced Models

## Introduction

This notebook demonstrates an improved approach to sentiment analysis using natural language processing (NLP) with the spaCy library and more advanced machine learning models. The goal is to accurately classify text as expressing positive or negative sentiment.


In [1]:
!pip install spacy scikit-learn tensorflow keras transformers



In [2]:
import spacy
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier 
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

In [3]:
df_train = pd.read_csv("emotions_train_data.txt", sep=";", names=["text", "emotion"])
df_test = pd.read_csv("emotions_test_data.txt", sep=";", names=["text", "emotion"])

In [4]:
# Preprocess text with spaCy
nlp = spacy.load("en_core_web_lg")

In [5]:
df_train.head()

Unnamed: 0,text,emotion
0,im feeling quite sad and sorry for myself but ...,sadness
1,i feel like i am still looking at a blank canv...,sadness
2,i feel like a faithful servant,love
3,i am just feeling cranky and blue,anger
4,i can have for a treat or if i am feeling festive,joy


In [6]:
def preprocess(text):
    doc = nlp(text)
    
    tokens = [] 
    for token in doc:
        if not token.is_punct and not token.is_space and not token.like_num:
            tokens.append(token.lemma_.lower())

    return " ".join(tokens)

In [7]:
df_train['preprocessed_text'] = df_train['text'].apply(preprocess) 

In [8]:
# Split train data into train and validation
X_train, X_val, y_train, y_val = train_test_split(df_train['preprocessed_text'], df_train['emotion'], test_size=0.2, random_state=42)

In [9]:
vectorizer = TfidfVectorizer()
X_train_vect = vectorizer.fit_transform(X_train) 
X_val_vect = vectorizer.transform(X_val)

In [10]:
# Train and evaluate models
def evaluate_model(model):
    model.fit(X_train_vect, y_train)
    preds = model.predict(X_val_vect)
    print(f"{type(model).__name__} Accuracy: {accuracy_score(y_val, preds):.3f}")
    print(classification_report(y_val, preds))

models = [
    LogisticRegression(max_iter=1000),
    RandomForestClassifier(),
    LinearSVC()
]

for model in models:
    evaluate_model(model)

LogisticRegression Accuracy: 0.537
              precision    recall  f1-score   support

       anger       0.78      0.27      0.41        51
        fear       0.73      0.17      0.28        46
         joy       0.48      0.91      0.63       127
        love       1.00      0.02      0.04        46
     sadness       0.60      0.64      0.62       118
    surprise       0.00      0.00      0.00        12

    accuracy                           0.54       400
   macro avg       0.60      0.34      0.33       400
weighted avg       0.63      0.54      0.47       400



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


RandomForestClassifier Accuracy: 0.532
              precision    recall  f1-score   support

       anger       0.81      0.25      0.39        51
        fear       0.83      0.33      0.47        46
         joy       0.44      0.89      0.59       127
        love       0.94      0.33      0.48        46
     sadness       0.57      0.43      0.49       118
    surprise       1.00      0.50      0.67        12

    accuracy                           0.53       400
   macro avg       0.77      0.45      0.52       400
weighted avg       0.65      0.53      0.51       400

LinearSVC Accuracy: 0.720
              precision    recall  f1-score   support

       anger       0.80      0.73      0.76        51
        fear       0.77      0.59      0.67        46
         joy       0.65      0.86      0.74       127
        love       0.88      0.46      0.60        46
     sadness       0.74      0.73      0.73       118
    surprise       0.73      0.67      0.70        12

    accuracy



# Deep Learning with BERT

In [11]:
# df with train and test sets
df = pd.concat([df_train, df_test])

In [12]:
df['emotion'].unique()

array(['sadness', 'love', 'anger', 'joy', 'fear', 'surprise'],
      dtype=object)

In [13]:
# Load BERT model and tokenizer 
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
bert = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
def emotion_encoder(df):
    df.replace("surprise", 1, inplace=True)
    df.replace("love", 1, inplace=True)
    df.replace("joy", 1, inplace=True)
    df.replace("fear", 0, inplace=True)
    df.replace("anger", 0, inplace=True)
    df.replace("sadness", 0, inplace=True)

In [15]:
X = df['text']
y = df['emotion']
emotion_encoder(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  df.replace("sadness", 0, inplace=True)


In [16]:
X_train.head()

1994                             i can feel its suffering
423     i enjoyed it for the most part for an entertai...
991     ive had a few rough days since then and in the...
1221    i can say is that despite my occasional jokes ...
506     i dont come from a perfect past i come from a ...
Name: text, dtype: object

In [17]:
# Tokenize text
X_train_tokens = tokenizer(X_train.tolist(), truncation=True, padding=True)
X_test_tokens = tokenizer(X_test.tolist(), truncation=True, padding=True)

In [18]:
X_test_tokens

{'input_ids': [[101, 1045, 3984, 1045, 2514, 17704, 8884, 2000, 2068, 2144, 1045, 4821, 2933, 2006, 8660, 2911, 1999, 3054, 2000, 2397, 2244, 2000, 4019, 2013, 2662, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2514, 17380, 2007, 4422, 1055, 15330, 23617, 1997, 2010, 7344, 8087, 2021, 6343, 2060, 2084, 1996, 6778, 2071, 5621, 14396, 2000, 16360, 2121, 7874, 10992, 1997, 2108, 16709, 14701, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2052, 2293, 2000, 2330, 2039, 1037, 5053, 11090, 2005, 2613, 2308, 2028, 2154, 4873, 2216, 2040, 2079, 2025, 9352, 2031, 3819, 4230, 3096, 2064, 2272, 2302, 3110, 28028, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2572, 2145, 3110, 1037, 18819, 4326, 1999, 2216, 7247, 2100, 12461, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [19]:
# Convert to TensorFlow datasets
train_ds = tf.data.Dataset.from_tensor_slices((dict(X_train_tokens), y_train))
test_ds = tf.data.Dataset.from_tensor_slices((dict(X_test_tokens), y_test))

In [21]:
!pip install keras

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [22]:
import os

In [25]:
# Export TOKENIZERS_PARALLELISM variable
os.environ["TOKENIZERS_PARALLELISM"] = "true"

In [27]:
# Fine tune model on own loss function
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
bert.compile(optimizer=optimizer, loss=loss, metrics=metrics)

RecursionError: maximum recursion depth exceeded in comparison

In [None]:
# Evaluate by getting the accuracy
# Will most likely outperform simpler models