<a href="https://colab.research.google.com/github/shokoufehnaseri/Text-Mining/blob/main/text_mining_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment analysis of tweet data**

 

## ***1. Introduction***

Sentiment analysis, a key technique in natural language processing (NLP), has gained substantial attention for its ability to analyze and interpret opinions, emotions, and attitudes expressed in textual data. In recent years, the development of transformer models, such as FinBERT, has revolutionized the field of sentiment analysis, particularly in the context of financial data. These models, built on deep learning architectures like BERT, have demonstrated a remarkable ability to understand the nuances of language and context, making them particularly effective for analyzing specialized datasets such as financial news and social media content.

This project explores the performance of modern transformer models, specifically FinBERT, in comparison with traditional sentiment analysis methods like Support Vector Machines (SVM). While SVM has long been a popular choice for text classification tasks, its reliance on manual feature extraction and limited ability to capture contextual information poses certain challenges. In contrast, transformer models, with their self-attention mechanism, excel in capturing context and meaning from large amounts of unstructured data.

The dataset used for this study consists of tweets related to the S&P 500 index, a stock market index tracking the 500 largest publicly traded companies in the United States. Social media platforms like Twitter have become a rich source of real-time information, with users frequently sharing their opinions and reactions to market events. By analyzing these tweets, this project aims to uncover insights into market sentiment and compare the effectiveness of modern and traditional sentiment analysis techniques



## ***2. Data***

**Data Description:**

The Twitter raw data was downloaded using the Twitter REST API search, specifically the "Tweepy (version 3.8.0)" Python package, which simplifies the interaction between the REST API and developers. The Twitter REST API retrieves data from the past seven days and allows filtering by language. The tweets were filtered for the English (en) language.

Data collection was performed from April 9 to July 16, 2020, using the following Twitter tags as search parameters: #SPX500, #SP500, SPX500, SP500, $SPX, #stocks, $MSFT, $AAPL, $AMZN, $FB, $BBRK.B, $GOOG, $JNJ, $JPM, $V, $PG, $MA, $INTC, $UNH, $BAC, $T, $HD, $XOM, $DIS, $VZ, $KO, $MRK, $CMCSA, $CVX, $PEP, $PFE. Due to the large volume of data, I stored only each tweet's content and creation date.

The file tweets_labelled_09042020_16072020.csv consists of 5,000 tweets selected using random sampling from a total of 943,672. Of these, 1,300 tweets were manually annotated and reviewed by a second independent annotator. The file tweets_remaining_09042020_16072020.csv contains the remaining 938,672 tweets.



**Importing Libraries**

In [3]:
import pandas as pd
from wordcloud import WordCloud,STOPWORDS
import matplotlib.pyplot as plt 
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from nltk import word_tokenize
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

***Loading Dataset*** 

In [4]:
tweet_data = pd.read_csv(r"C:\Users\Shokoufeh\OneDrive\Thesis\thesis_coding\text_mining_project\Text-Mining\tweets_labelled_09042020_16072020.csv", delimiter=";")

**Inspect the Data**

Check for issues like missing values or incorrectly formatted columns.


In [5]:
print(tweet_data.head())

       id                 created_at  \
0   77522  2020-04-15 01:03:46+00:00   
1  661634  2020-06-25 06:20:06+00:00   
2  413231  2020-06-04 15:41:45+00:00   
3  760262  2020-07-03 19:39:35+00:00   
4  830153  2020-07-09 14:39:14+00:00   

                                                text sentiment  
0  RT @RobertBeadles: Yo💥\nEnter to WIN 1,000 Mon...  positive  
1  #SriLanka surcharge on fuel removed!\n⛽📉\nThe ...  negative  
2  Net issuance increases to fund fiscal programs...  positive  
3  RT @bentboolean: How much of Amazon's traffic ...  positive  
4  $AMD Ryzen 4000 desktop CPUs looking ‘great’ a...  positive  


In [6]:
print(tweet_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          5000 non-null   int64 
 1   created_at  5000 non-null   object
 2   text        5000 non-null   object
 3   sentiment   1300 non-null   object
dtypes: int64(1), object(3)
memory usage: 156.4+ KB
None


In [7]:
tweet_data['sentiment'].value_counts()

sentiment
positive    528
neutral     424
negative    348
Name: count, dtype: int64

In [8]:
print(tweet_data.isnull().sum())

id               0
created_at       0
text             0
sentiment     3700
dtype: int64


The dataset consists of 5,000 observations. Among these, 1,300 observations are labeled with sentiments as 'positive', 'negative', or 'neutral', while the remaining 3,700 observations are unlabeled, with their sentiment marked as `NaN`.

**Clean the Text Data**

In [9]:
print(tweet_data['text'].head())

0    RT @RobertBeadles: Yo💥\nEnter to WIN 1,000 Mon...
1    #SriLanka surcharge on fuel removed!\n⛽📉\nThe ...
2    Net issuance increases to fund fiscal programs...
3    RT @bentboolean: How much of Amazon's traffic ...
4    $AMD Ryzen 4000 desktop CPUs looking ‘great’ a...
Name: text, dtype: object



As observed, raw text data often contains noise and inconsistencies that can impede accurate sentiment analysis. To address this, preprocessing is an essential step to clean, standardize, and structure the data, ensuring its suitability for machine learning algorithms. The following steps are commonly employed to prepare textual data effectively:

*Lowercasing:* All text is converted to lowercase to ensure uniformity and avoid treating the same word differently due to capitalization (e.g., "Happy" vs. "happy").

*Removal of URLs:* Text often contains hyperlinks that do not contribute to the sentiment of the content. These are removed to reduce noise.

*Remove Mentions:* Mentions, typically denoted by the @ symbol followed by a username (e.g., @user), are common in social media text. While they indicate a reference to another user, they usually do not contribute to the sentiment of the text and are removed to reduce noise.

*Handling Hashtags:* Hashtags are common in social media text. While the # symbol is removed, the associated words are retained, as they may provide context or sentiment-related information.

*Removal of Numeric and Punctuation Data:* Numbers and punctuation marks, unless contextually relevant, are removed to simplify the text.

Perform text preprocessing to ensure consistency and remove noise.

In [10]:
import preprocess_tweet

In [11]:
tweet_data = preprocess_tweet.Preprocess_Tweets(tweet_data)

  pat = re.compile(pat, flags=flags)


In [12]:
tweet_data.head()

Unnamed: 0,id,created_at,text,sentiment,Text_Cleaned
0,77522,2020-04-15 01:03:46+00:00,"RT @RobertBeadles: Yo💥\nEnter to WIN 1,000 Mon...",positive,rt robertbeadles yo enter to win monarch token...
1,661634,2020-06-25 06:20:06+00:00,#SriLanka surcharge on fuel removed!\n⛽📉\nThe ...,negative,srilanka surcharge on fuel removed the surchar...
2,413231,2020-06-04 15:41:45+00:00,Net issuance increases to fund fiscal programs...,positive,net issuance increases to fund fiscal programs...
3,760262,2020-07-03 19:39:35+00:00,RT @bentboolean: How much of Amazon's traffic ...,positive,rt bentboolean how much of amazons traffic is ...
4,830153,2020-07-09 14:39:14+00:00,$AMD Ryzen 4000 desktop CPUs looking ‘great’ a...,positive,amd ryzen desktop cpus looking great and on tr...


### **Remove Stop Words**
Remove common words that don't contribute to sentiment.

In [13]:
from nltk.corpus import stopwords

# General English stop words
general_stop_words = set(stopwords.words('english'))

# Define custom stopwords
custom_stopwords = set([
    'SP500', 'S&P', '500', 'index', 'stock', 'market', 'stocks',
    'trading', 'finance', 'investing', 'investor', 'business',
    "billion", 'price', 'StockMarket', 'share',
    'RT', 'http','dollar','dollars', 'percent', 'https', 'www', 'bit.ly', '@username', '#finance',
    'breaking', 'update', 'today', 'yesterday', 'tomorrow',  "aapl", "msft", "amzn", "tsla", "googl", "meta", "nvda", "brk.b", "jnj", "pg", 
    "v", "unh", "hd", "ma", "pep", "bac", "xom", "ko", "abbv", "avgo", "cost", 
    "mcd", "csco", "pfe", "cvx", "adbe", "mrk", "nflx", "dis", "intc", "wmt", 
    "tmo", "orcl", "crm", "nke", "wfc", "acn", "lin", "mdt", "txn", "dhr", "hon", 
    "lly", "vz", "schw", "amgn", "ibm", "t", "qcom", "sbux", "mmm", "gs", "rtx", 
    "ups", "low", "bmy", "cat", "spgi", "isrg", "c", "elv", "lmt", "mo", "bkng", 
    "adp", "amd", "de", "pm", "gild", "syk", "ge", "amt", "ms", "blk", "cci", 
    "cvs", "now", "intu", "ci", "zts", "eqix", "ice", "tgt", "mu", "fis", "ew", 
    "cb", "mmc", "apd", "cl", "so", "pgr", "duke", "pld", "aon", "fisv", "itw", 
    "stz", "regn", "adi", "hum", "exc", "pxd", "snps", "cop", "kdp", "kmb", "rop", 
    "etn", "aep", "eog", "mar", "atvi", "noc", "pru", "oxy", "orly", "d", "chrw", 
    "bax", "adm", "fdx", "aig", "dg", "tsco", "qqq", "fb", "spx", "spy","new", "day", "week", "rt"
])

# Combine general_stop_words with custom stopwords
combined_stopwords = general_stop_words.union(custom_stopwords)

# Remove stopwords function
def remove_stopwords(text, stopwords):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stopwords]
    return " ".join(filtered_words)

# Apply the function to the 'text' column
tweet_data['Text_Cleaned'] = tweet_data['Text_Cleaned'].apply(lambda x: remove_stopwords(x, combined_stopwords))


### **Tokenize the Text**
Tokenization is the process of splitting text into smaller units, called tokens, which are often individual words. This step is essential for text preprocessing as it enables the analysis of each word separately. For example, the sentence "I love programming!" would be tokenized into ['I', 'love', 'programming', '!'].

In Python, the word_tokenize function from the nltk library is commonly used for this purpose. It efficiently breaks a sentence into tokens, taking care of punctuation and special characters, allowing for precise text analysis.

Split the cleaned text into individual words.

In [14]:

tweet_data['tokens'] = tweet_data['Text_Cleaned'].apply(word_tokenize)

The following code calculates the number of words in each entry of the dataset and identifies the minimum and maximum word counts. This analysis provides a better understanding of the text length distribution, offering valuable insights into the variability of the data prior to further processing.

In [15]:
tweet_data['n_word'] = [len(str(row['tokens']).split()) for _, row in tweet_data.iterrows()]

print(min(tweet_data['n_word']), 
max(tweet_data['n_word']))

1 42


In [16]:
tweet_data.head()

Unnamed: 0,id,created_at,text,sentiment,Text_Cleaned,tokens,n_word
0,77522,2020-04-15 01:03:46+00:00,"RT @RobertBeadles: Yo💥\nEnter to WIN 1,000 Mon...",positive,robertbeadles yo enter win monarch tokens us c...,"[robertbeadles, yo, enter, win, monarch, token...",14
1,661634,2020-06-25 06:20:06+00:00,#SriLanka surcharge on fuel removed!\n⛽📉\nThe ...,negative,srilanka surcharge fuel removed surcharge rs i...,"[srilanka, surcharge, fuel, removed, surcharge...",26
2,413231,2020-06-04 15:41:45+00:00,Net issuance increases to fund fiscal programs...,positive,net issuance increases fund fiscal programs gt...,"[net, issuance, increases, fund, fiscal, progr...",25
3,760262,2020-07-03 19:39:35+00:00,RT @bentboolean: How much of Amazon's traffic ...,positive,bentboolean much amazons traffic served fastly...,"[bentboolean, much, amazons, traffic, served, ...",14
4,830153,2020-07-09 14:39:14+00:00,$AMD Ryzen 4000 desktop CPUs looking ‘great’ a...,positive,ryzen desktop cpus looking great track launch ...,"[ryzen, desktop, cpus, looking, great, track, ...",10


### **Vader sentiment analysis**

In [18]:
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

# Download VADER lexicon
nltk.download('vader_lexicon')

# Initialize VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()


# Function to apply VADER sentiment analysis
def analyze_sentiment(text):
    scores = sia.polarity_scores(text)
    # Return compound score and label
    compound_score = scores["compound"]
    if compound_score >= 0.05:
        sentiment_label = "positive"
    elif compound_score <= -0.05:
        sentiment_label = "negative"
    else:
        sentiment_label = "neutral"
    return pd.Series([compound_score, sentiment_label])

# Apply VADER sentiment analysis to cleaned_text
tweet_data[["vader_score", "vader_sentiment"]] = tweet_data["Text_Cleaned"].apply(analyze_sentiment)




[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Shokoufeh\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [19]:
# Display results
print(tweet_data["vader_sentiment"])

0       positive
1       positive
2       positive
3       positive
4       positive
          ...   
4995    positive
4996    positive
4997     neutral
4998    positive
4999    positive
Name: vader_sentiment, Length: 5000, dtype: object


In [20]:
# Split into labeled and unlabeled data
labeled_data = tweet_data[tweet_data["sentiment"].notna()]
unlabeled_data = tweet_data[tweet_data["sentiment"].isna()]

In [21]:
# Add a new column to indicate whether the sentiments match
labeled_data["comparison"] = labeled_data["sentiment"] == labeled_data["vader_sentiment"]

# Calculate the number of matches
matches = labeled_data["comparison"].sum()

# Calculate total number of rows
total = len(labeled_data)

# Calculate accuracy
accuracy = (matches / total) * 100

print(f"Accuracy: {accuracy:.2f}%")


Accuracy: 68.69%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labeled_data["comparison"] = labeled_data["sentiment"] == labeled_data["vader_sentiment"]


### **6. Lemmatization or Stemming**
Reduce words to their base forms.

In [22]:
tweet_data.head()

Unnamed: 0,id,created_at,text,sentiment,Text_Cleaned,tokens,n_word,vader_score,vader_sentiment
0,77522,2020-04-15 01:03:46+00:00,"RT @RobertBeadles: Yo💥\nEnter to WIN 1,000 Mon...",positive,robertbeadles yo enter win monarch tokens us c...,"[robertbeadles, yo, enter, win, monarch, token...",14,0.5859,positive
1,661634,2020-06-25 06:20:06+00:00,#SriLanka surcharge on fuel removed!\n⛽📉\nThe ...,negative,srilanka surcharge fuel removed surcharge rs i...,"[srilanka, surcharge, fuel, removed, surcharge...",26,0.2023,positive
2,413231,2020-06-04 15:41:45+00:00,Net issuance increases to fund fiscal programs...,positive,net issuance increases fund fiscal programs gt...,"[net, issuance, increases, fund, fiscal, progr...",25,0.0516,positive
3,760262,2020-07-03 19:39:35+00:00,RT @bentboolean: How much of Amazon's traffic ...,positive,bentboolean much amazons traffic served fastly...,"[bentboolean, much, amazons, traffic, served, ...",14,0.3818,positive
4,830153,2020-07-09 14:39:14+00:00,$AMD Ryzen 4000 desktop CPUs looking ‘great’ a...,positive,ryzen desktop cpus looking great track launch ...,"[ryzen, desktop, cpus, looking, great, track, ...",10,0.6249,positive


### **7. Encode Sentiment Labels**
Convert sentiment labels (e.g., "positive", "negative") into numeric values for machine learning models.

In [23]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
tweet_data['sentiment_encoded'] = label_encoder.fit_transform(tweet_data['vader_sentiment'])

### **8. Prepare Data for Modeling**
Split the data into training and testing sets.

In [24]:
from sklearn.model_selection import train_test_split

X = tweet_data['Text_Cleaned']
y = tweet_data['sentiment_encoded']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### **9. Vectorize Text Data**
Convert text into numerical form using techniques like TF-IDF or Count Vectorization.

In [48]:
#X_train = [" ".join(doc) if isinstance(doc, list) else doc for doc in X_train]
#X_test = [" ".join(doc) if isinstance(doc, list) else doc for doc in X_test]


In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

### **10. Ready for Sentiment Analysis**
You can now use your processed data with machine learning models or sentiment analysis tools.

For example, using a logistic regression model:

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = model.predict(X_test_tfidf)

# Evaluate
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.21      0.34       156
           1       0.70      0.81      0.75       380
           2       0.76      0.85      0.80       464

    accuracy                           0.74      1000
   macro avg       0.77      0.62      0.63      1000
weighted avg       0.75      0.74      0.71      1000



In [27]:
# Create a DataFrame
df = pd.DataFrame({'Predicted': y_pred, 'Actual': y_test})

# Print the DataFrame as a table
print(df)

      Predicted  Actual
1501          1       1
2586          1       1
2653          2       2
1055          2       1
705           2       2
...         ...     ...
4711          2       2
2313          2       2
3214          2       2
2732          2       2
1926          1       2

[1000 rows x 2 columns]


### **Vader classification**

In [28]:
import nltk
nltk.download('vader_lexicon')


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Shokoufeh\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

### ***SVM Classifier***

In [29]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split


# Train SVM classifier
svm_clf = SVC(kernel='linear', random_state=42)
svm_clf.fit(X_train_tfidf, y_train)

# Predict and evaluate
y_pred_svm = svm_clf.predict(X_test_tfidf)
print("SVM Classification Report:\n", classification_report(y_test, y_pred_svm))
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))


SVM Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.35      0.47       156
           1       0.71      0.85      0.77       380
           2       0.82      0.83      0.83       464

    accuracy                           0.76      1000
   macro avg       0.76      0.68      0.69      1000
weighted avg       0.76      0.76      0.75      1000

SVM Accuracy: 0.762


###  ***Naive Bayes Classifier***

In [30]:
from sklearn.naive_bayes import MultinomialNB

# Train Naive Bayes classifier
nb_clf = MultinomialNB()
nb_clf.fit(X_train_tfidf, y_train)

# Predict and evaluate
y_pred_nb = nb_clf.predict(X_test_tfidf)
print("Naive Bayes Classification Report:\n", classification_report(y_test, y_pred_nb))
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))


Naive Bayes Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.08      0.15       156
           1       0.71      0.59      0.65       380
           2       0.62      0.90      0.74       464

    accuracy                           0.66      1000
   macro avg       0.78      0.52      0.51      1000
weighted avg       0.71      0.66      0.61      1000

Naive Bayes Accuracy: 0.655


### **Pre-trained Models**
For advanced analysis, we used transformer-based models like finBERT with libraries such as `transformers` from Hugging Face.

In [31]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

In [32]:
# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


In [33]:
# Load the FinBERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained("yiyanghkust/finbert-tone")
model_bert = BertForSequenceClassification.from_pretrained("yiyanghkust/finbert-tone", num_labels=3)
model_bert.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30873, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [34]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('yiyanghkust/finbert-pretrain')

In [35]:
tweet_data.head()

Unnamed: 0,id,created_at,text,sentiment,Text_Cleaned,tokens,n_word,vader_score,vader_sentiment,sentiment_encoded
0,77522,2020-04-15 01:03:46+00:00,"RT @RobertBeadles: Yo💥\nEnter to WIN 1,000 Mon...",positive,robertbeadles yo enter win monarch tokens us c...,"[robertbeadles, yo, enter, win, monarch, token...",14,0.5859,positive,2
1,661634,2020-06-25 06:20:06+00:00,#SriLanka surcharge on fuel removed!\n⛽📉\nThe ...,negative,srilanka surcharge fuel removed surcharge rs i...,"[srilanka, surcharge, fuel, removed, surcharge...",26,0.2023,positive,2
2,413231,2020-06-04 15:41:45+00:00,Net issuance increases to fund fiscal programs...,positive,net issuance increases fund fiscal programs gt...,"[net, issuance, increases, fund, fiscal, progr...",25,0.0516,positive,2
3,760262,2020-07-03 19:39:35+00:00,RT @bentboolean: How much of Amazon's traffic ...,positive,bentboolean much amazons traffic served fastly...,"[bentboolean, much, amazons, traffic, served, ...",14,0.3818,positive,2
4,830153,2020-07-09 14:39:14+00:00,$AMD Ryzen 4000 desktop CPUs looking ‘great’ a...,positive,ryzen desktop cpus looking great track launch ...,"[ryzen, desktop, cpus, looking, great, track, ...",10,0.6249,positive,2


In [37]:
# Preprocess the data
tweets = tweet_data["Text_Cleaned"].values
labels = tweet_data["sentiment_encoded"].values

In [38]:
# Tokenize the tweets
def encode_tweets(tweets, tokenizer, max_length=128):
    input_ids = []
    attention_masks = []

    for tweet in tweets:
        encoded = tokenizer.encode_plus(
            tweet,
            add_special_tokens=True,
            max_length=max_length,
            truncation=True,
            padding="max_length",
            return_attention_mask=True,
            return_tensors="pt",
        )
        input_ids.append(encoded["input_ids"])
        attention_masks.append(encoded["attention_mask"])

    return torch.cat(input_ids, dim=0), torch.cat(attention_masks, dim=0)


In [39]:
# Encode tweets
input_ids, attention_masks = encode_tweets(tweets, tokenizer)

In [40]:
# Split into train and test sets
train_inputs, test_inputs, train_masks, test_masks, train_labels, test_labels = train_test_split(
    input_ids, attention_masks, labels, test_size=0.2, random_state=42
)

In [41]:
# Convert to PyTorch tensors
train_inputs = torch.tensor(train_inputs)
test_inputs = torch.tensor(test_inputs)
train_masks = torch.tensor(train_masks)
test_masks = torch.tensor(test_masks)
train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)

  train_inputs = torch.tensor(train_inputs)
  test_inputs = torch.tensor(test_inputs)
  train_masks = torch.tensor(train_masks)
  test_masks = torch.tensor(test_masks)


In [42]:
# Create DataLoaders
batch_size = 16

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)


In [43]:
# Define optimizer and scheduler
optimizer = torch.optim.AdamW(model_bert.parameters(), lr=2e-5, eps=1e-8)
epochs = 4
total_steps = len(train_dataloader) * epochs

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)

In [44]:
import torch.nn as nn

# Define the loss function
loss_fn = nn.CrossEntropyLoss()


In [45]:
# Training loop
def train():
    model_bert.train()
    for epoch in range(epochs):
        total_loss = 0
        for step, batch in enumerate(train_dataloader):
            b_input_ids = batch[0].to(device)
            b_input_mask = batch[1].to(device)
            b_labels = batch[2].to(device).long()  # Convert labels to LongTensor

            model_bert.zero_grad()
            outputs = model_bert(b_input_ids, attention_mask=b_input_mask)
            logits = outputs.logits

            # Calculate loss
            loss = loss_fn(logits, b_labels)
            total_loss += loss.item()

            # Backpropagation
            loss.backward()
            optimizer.step()
        
        print(f"Epoch {epoch + 1}/{epochs} | Loss: {total_loss / len(train_dataloader)}")
        scheduler.step()


In [46]:
def evaluate():
    model_bert.eval()
    predictions, true_labels = [], []
    with torch.no_grad():
        for batch in test_dataloader:
            b_input_ids = batch[0].to(device)
            b_input_mask = batch[1].to(device)
            b_labels = batch[2].to(device)

            outputs = model_bert(b_input_ids, attention_mask=b_input_mask)
            logits = outputs.logits
            predictions.append(torch.argmax(logits, dim=1).cpu().numpy())
            true_labels.append(b_labels.cpu().numpy())

    predictions = np.concatenate(predictions)
    true_labels = np.concatenate(true_labels)
    print("Accuracy:", accuracy_score(true_labels, predictions))
    print(classification_report(true_labels, predictions, target_names=["Negative", "Neutral", "Positive"]))


In [54]:
# Run training and evaluation
train()
evaluate()

Epoch 1/4 | Loss: 1.0540175390243531
Epoch 2/4 | Loss: 0.586809101819992
Epoch 3/4 | Loss: 0.31523855816572904
Epoch 4/4 | Loss: 0.12900617661699654
Accuracy: 0.784
              precision    recall  f1-score   support

    Negative       0.65      0.64      0.65       161
     Neutral       0.84      0.75      0.79       391
    Positive       0.79      0.87      0.83       448

    accuracy                           0.78      1000
   macro avg       0.76      0.75      0.75      1000
weighted avg       0.79      0.78      0.78      1000

