<div style="text-align: center">

<b><h1>  </h1></b>
<h2> Assignment 4 - Classification </h2>
<h2> Modern Information Retrieval Course </h2>
<h3> Dr. Asgari </h3>
<h3> Group Members </h3>
Parsa Mohammadian - 98102284
<br/>
Sara Azarnoush - 98170668
<br/>
Kahbod Aeini - 98101209 
<br/>
<br/>
Sharif University of Technology
<br/>
Computer Engineering Department
<hr/>
</div>

### Introduction

In this project, we want to classify tweets into positive and negative sentiment using two methods. First, we will use a regression model to predict the sentiment of a tweet. Second, we will use an transformer-based model. Steps of the process are:
- Load the data
- Preprocess the data
  - Tokenize the data
  - Normalize the data
  - Stem the data
  - Rejoin the tokens
- Fit a regression model
  - Vectorize the data
  - Split the data into training and test sets
  - Train the model
  - Evaluate the model
- Fit a transformer-based model
  - Split the data into training and test sets
  - Train the model
  - Evaluate the model

---

### Requirements

---

In [1]:
try:
    from google.colab import drive
    COLAB = True
except:
    COLAB = False
    print('Not in Google Colab')

if COLAB:
  drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
from IPython.display import clear_output


In [3]:
%pip install pandas
%pip install nltk
%pip install sklearn
%pip install numpy
%pip install simpletransformers

clear_output()


In [4]:
import pandas as pd
import nltk
import string
import functools
import sklearn as sk
from simpletransformers.classification import ClassificationModel
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, precision_score, recall_score


In [5]:
nltk.download('punkt')
nltk.download('stopwords')

clear_output()


### Load Data

First, we load the [sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140) dataset which is provided by the documentation of the project into pandas dataframe. The dataset contains about 200MB of labelled data. After that, we drop redundant columns and keep only the tweet and the sentiment. The sentiment is encoded as a number from 0(negative) and 4(positive). We also remapped them to 0 and 1. Finally, since the data is to large for normal computer to process, we used fraction of the data in the rest of the notebook.

---


In [6]:
if COLAB:
    PATH_TO_SENTIMENT140_DATASET = 'drive/MyDrive/training.1600000.processed.noemoticon.csv'
else:
    PATH_TO_SENTIMENT140_DATASET = '../datasets/training.1600000.processed.noemoticon.csv'
CSV_COLUMNS = ['target', 'id', 'date', 'flag', 'user', 'text']
TEST_SIZE = 0.2


df = pd.read_csv(PATH_TO_SENTIMENT140_DATASET)
df.columns = CSV_COLUMNS
df.drop(columns=['id', 'date', 'flag', 'user'], inplace=True)
df['target'] = df['target'].map({0: 0, 4:1})
df.head()


Unnamed: 0,target,text
0,0,is upset that he can't update his Facebook by ...
1,0,@Kenichan I dived many times for the ball. Man...
2,0,my whole body feels itchy and like its on fire
3,0,"@nationwideclass no, it's not behaving at all...."
4,0,@Kwesidei not the whole crew


#### Sample Data

Limit the size of the dataset in order to limit resource and time usage.


In [7]:
df = df.sample(n=10000)


### Preprocess Data

Here we used nltk package and other defined functions to normalize the data as explained in the introduction. 

---

#### Tokenize Text

In [8]:
df['text_tokenized'] = df['text'].apply(lambda x: nltk.word_tokenize(x))


#### Normalize Text

In [9]:
def to_lower(tokens: list) -> list:
    """
    Converts the tokens to lower case.
    """
    return [token.lower() for token in tokens]
    

def contains_any_of(token: list, chars: str) -> bool:
    """
    Returns true if the token contains any of the characters in the given list.
    """
    return any(char in token for char in chars)


def remove_punctuation(tokens: list) -> list:
    """
    Removes punctuation from the given tokens.
    """
    return [token for token in tokens if not contains_any_of(token, string.punctuation+"’‘•")]


def remove_stop_words(tokens: list) -> list:
    """
    Removes stop words from the given tokens.
    """
    remove_stop_words.stop_words = set(nltk.corpus.stopwords.words('english'))
    return [token for token in tokens if token not in remove_stop_words.stop_words]


def normalize(tokens):
    """
    Normalizes the tokens of the lyrics.
    """
    normalization_functions = [to_lower, remove_punctuation, remove_stop_words]
    return functools.reduce(lambda x, f: f(x), normalization_functions, tokens)


df['text_normalized'] = df['text_tokenized'].apply(normalize)


#### Stem Text

In [10]:
stemmer = nltk.stem.SnowballStemmer('english')
df['text_stemmed'] = df['text_normalized'].apply(lambda x: [stemmer.stem(t) for t in x])


#### Join Text

In [11]:
df['text_preprocessed'] = df['text_stemmed'].apply(lambda x: ' '.join(x))


#### Preprocess Effec

Without preprocessing, the accuracy of the model will be lower. Because mentions, hashtags, stopwords and punctuation commonly occur in the tweets, removing them will help the model to predict the sentiment of the tweet.

### Regression Model

Here we used sklearn package in order to vectorize the tweet text, and fit a regression model with it. 

---

#### Vectorize Text

In [12]:
tfidf_vectorizer = sk.feature_extraction.text.TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text_preprocessed']).todense().tolist()
df['tfidf'] = tfidf_matrix


#### Split Data

In [13]:
X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(tfidf_matrix, df['target'], test_size=TEST_SIZE)


#### Train Model

In [14]:
logistic_regression = sk.linear_model.LogisticRegression()
logistic_regression.fit(X_train, y_train);


#### Test Model

In [31]:
logistic_regression.score(X_test, y_test)


0.7235

In [30]:
y_pred = logistic_regression.predict(X_test)
print("confusion matrix: \n", confusion_matrix(y_test, y_pred))
print("accuracy score: ", accuracy_score(y_test, y_pred))
print("f1 score: ", f1_score(y_test, y_pred, average="macro"))
print("precision score: ", precision_score(y_test, y_pred, average="macro"))
print("recall score: ", recall_score(y_test, y_pred, average="macro")) 

confusion matrix: 
 [[704 275]
 [278 743]]
accuracy score:  0.7235
f1 score:  0.7233948208806399
precision score:  0.7233833762138933
recall score:  0.7234095235999076


### Transformer Based Model

Here we used simpletransformer package in addition to sklearn package.

---

#### Split Data

In [18]:
train_df, test_df = sk.model_selection.train_test_split(df, test_size=TEST_SIZE)
train_df = train_df[['text_preprocessed', 'target']]
test_df = test_df[['text_preprocessed', 'target']]
train_df.columns = ['text', 'labels']
test_df.columns = ['text', 'labels']


#### Train Model

In [19]:
args = {
    "num_train_epochs": 10
}
model = ClassificationModel('roberta', 'roberta-base', use_cuda=COLAB, args=args)
model.train_model(train_df)

clear_output()


#### Test Model

In [29]:
f1_macro = lambda y_test, y_pred: f1_score(y_test, y_pred, average="macro")
result, _, _ = model.eval_model(test_df, f1=f1_macro, acc=accuracy_score,
                                confusion_matrix=confusion_matrix,
                                precision_score=precision_score,
                                recall_score=recall_score)


  0%|          | 0/2000 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/250 [00:00<?, ?it/s]

In [27]:
print("confusion matrix: \n", result['confusion_matrix'])
print("accuracy score: ", result['acc'])
print("f1 score: ", result['f1'])
print("precision score: ", result['precision_score'])
print("recall score: ", result['recall_score']) 


confusion matrix: 
 [[768 235]
 [274 723]]
accuracy score:  0.7455
f1 score:  0.7453710941163965
precision score:  0.7546972860125261
recall score:  0.7251755265797393


#### Model Size

In [28]:
!du -hs outputs


22G	outputs
