<a href="https://colab.research.google.com/github/teejx/CCMACLRL_EXERCISES_COM222ML/blob/main/Exercise7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 7: Hate Speech Classification using Multinomial Naive Bayes

Instructions:
- You do not need to split your data. Use the training, validation and test sets provided below.
- Use Multinomial Naive Bayes to train a model that can classify if a sentence is a hate speech or non-hate speech
- A sentence with a label of zero (0) is classified as non-hate speech
- A sentence with a label of one (1) is classified as a hate speech

Apply text pre-processing techniques such as
- Converting to lowercase
- Stop word Removal
- Removal of digits, special characters
- Stemming or Lemmatization but not both
- Count Vectorizer or TF-IDF Vectorizer but not both

Evaluate your model by:
- Providing input by yourself
- Creating a Confusion Matrix
- Calculating the Accuracy, Precision, Recall and F1-Score

In [67]:
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [68]:
splits = {'train': 'unique_train_dataset.csv', 'validation': 'unique_validation_dataset.csv', 'test': 'unique_test_dataset.csv'}

**Training Set**

Use this to train your model

In [69]:
df_train = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["train"])

**Validation Set**

Use this set to evaluate your model

In [70]:
df_validation = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["validation"])

**Test Set**
  
Use this set to test your model

In [71]:
df_test = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["test"])

## A. Understanding your training data

1. Check the first 10 rows of the training dataset

In [72]:
# put your answer here
df_train.head(10)

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
5,"""Ang sinungaling sa umpisa ay sinungaling hang...",1
6,Leni Kiko,0
7,Nahiya si Binay sa Makati kaya dito na lang sa...,1
8,Another reminderHalalan,0
9,[USERNAME] Maybe because VP Leni Sen Kiko and ...,0


2. Check how many rows and columns are in the training dataset using `.info()`

In [73]:
# put your answer here
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21773 entries, 0 to 21772
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    21773 non-null  object
 1   label   21773 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 340.3+ KB


3. Check for NaN values

In [74]:
# put your answer here
df_train.isna()

Unnamed: 0,text,label
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
21768,False,False
21769,False,False
21770,False,False
21771,False,False


4. Check for duplicate rows

In [75]:
# put your answer here
df_train.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
21768,False
21769,False
21770,False
21771,False


5. Check how many rows belong to each class

In [76]:
# put your answer here
df_train.shape[0]

21773

## B. Text pre-processing

6. Remove duplicate rows

In [77]:
# put your answer here
df_train.drop_duplicates()

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
...,...,...
21768,Marcos Talunan Marcos Magnanakaw,1
21769,Grabe kayo kay binay ??????????,0
21770,[USERNAME] Cnu ba naman ang hindImabibighani s...,0
21771,RT [USERNAME]: Tabi tabi yung mga nagsasabing ...,1


7. Remove rows with NaN values

In [78]:
# put your answer here
df_train.isnull().any().sum()

0

8. Convert all text to lowercase

In [79]:
# put your answer here
df_train['text'] = df_train['text'].str.lower()

9. Remove digits, URLS and special characters

In [80]:
# put your answer here
df_train['text'] = df_train['text'].str.replace(r'\d+', '', regex=True)
df_train['text'] = df_train['text'].str.replace(r'http\S+', '', regex=True)
df_train['text'] = df_train['text'].str.replace(r'[^a-zA-Z\s]', '', regex=True)


10. Remove stop words

In [81]:
# Define Tagalog stopwords
tagalog_stopwords = [
    "akin", "aking", "ako", "alin", "am", "amin", "aming", "ang", "ano", "anumang",
    "apat", "at", "atin", "ating", "ay", "bababa", "bago", "bakit", "bawat",
    "bilang", "dahil", "dalawa", "dapat", "din", "dito", "doon", "gagawin",
    "gayunman", "ginagawa", "ginawa", "ginawang", "gumawa", "gusto", "habang",
    "hanggang", "hindi", "huwag", "iba", "ibaba", "ibabaw", "ibig", "ikaw",
    "ilagay", "ilalim", "ilan", "inyong", "isa", "isang", "itaas", "ito",
    "iyo", "iyon", "iyong", "ka", "kahit", "kailangan", "kailanman", "kami",
    "kanila", "kanilang", "kanino", "kanya", "kanyang", "kapag", "kapwa",
    "karamihan", "katiyakan", "katulad", "kaya", "kaysa", "ko", "kong",
    "kulang", "kumuha", "kung", "laban", "lahat", "lamang", "likod", "lima",
    "maaari", "maaaring", "maging", "mahusay", "makita", "marami", "marapat",
    "masyado", "may", "mayroon", "mga", "minsan", "mismo", "mula", "muli",
    "na", "nabanggit", "naging", "nagkaroon", "nais", "nakita", "namin",
    "napaka", "narito", "nasaan", "ng", "ngayon", "ni", "nila", "nilang",
    "nito", "niya", "niyang", "noon", "o", "pa", "paano", "pababa",
    "paggawa", "pagitan", "pagkakaroon", "pagkatapos", "palabas",
    "pamamagitan", "panahon", "pangalawa", "para", "paraan", "pareho",
    "pataas", "pero", "pumunta", "pumupunta", "sa", "saan", "sabi",
    "sabihin", "sarili", "sila", "sino", "siya", "tatlo", "tayo",
    "tulad", "tungkol", "una", "walang"
]

# Assuming df_train is your DataFrame
df_train['text'] = df_train['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in tagalog_stopwords]))

# # Download stopwords
# nltk.download('punkt')
# nltk.download('stopwords')

# # Sample text
# text = "This is a sample sentence, showing off the stopwords removal."

# # Tokenize the text
# words = word_tokenize(text)

# # Get the English stopwords
# stop_words = set(stopwords.words('english'))

# # Remove stopwords
# filtered_words = [word for word in words if word.lower() not in stop_words]

# print(filtered_words)


11. Use Stemming or Lemmatization

In [82]:
# put your answer here
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(lemmatized_tokens)

df_train['text'] = df_train['text'].apply(lemmatize_text)




## C. Training your model

12. Put all text training data in variable **X_train**

In [83]:
# put your answer here
X = df_train['text']

13. Put all training data labels in variable **y_train**

In [84]:
# put your answer here
y = df_train['label']

14. Use `CountVectorizer()` or `TfidfVectorizer()` to convert text data to its numerical form.

Put the converted data to **X_train_transformed** variable

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)

(15241,) (15241,) (6532,) (6532,)


In [86]:
# put your answer here
from sklearn.feature_extraction.text import TfidfVectorizer
# Correctly set the stop_words parameter
vect = TfidfVectorizer(stop_words=tagalog_stopwords, max_df=0.5)

# Fitting train data and transforming it to the TF-IDF matrix
X_train_transformed = vect.fit_transform(X_train)

# Transforming the test data into the TF-IDF matrix using the fitted vectorizer
X_test_transformed = vect.transform(X_test)

15. Create an instance of `MultinomalNB()`

In [87]:
# put your answer here
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB(alpha=1)

16. Train the model using `.fit()`

In [88]:
# put your answer here
model.fit(X_train_transformed, y_train)

## D. Evaluate your model

17. Use `.predict()` to generate model predictions using the **validation dataset**


- Put all text validation data in **X_validation** variable

- Convert **X_validation** to its numerical form.

- Put the converted data to **X_validation_transformed**

- Put all predictions in **y_validation_pred** variable

In [89]:
# put your answer here
X_validation = df_validation['text']
X_validation_transformed = vect.transform(X_validation)
y_validation_pred = model.predict(X_validation_transformed)

18. Get the Accuracy, Precision, Recall and F1-Score of the model using the **validation dataset**

- Put all validation data labels in **y_validation** variable

In [90]:
# put your answer here
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_validation = df_validation['label']

accuracy = accuracy_score(y_validation, y_validation_pred)
precision = precision_score(y_validation, y_validation_pred)
recall = recall_score(y_validation, y_validation_pred)
f1 = f1_score(y_validation, y_validation_pred)

print(f"Accuracy (Validation): {accuracy}")
print(f"Precision (Validation): {precision}")
print(f"Recall (Validation): {recall}")
print(f"F1 Score (Validation): {f1}")

Accuracy (Validation): 0.8264285714285714
Precision (Validation): 0.8094603597601598
Recall (Validation): 0.8586572438162544
F1 Score (Validation): 0.8333333333333334


19. Create a confusion matrix using the **validation dataset**

In [91]:
# put your answer here
from sklearn.metrics import confusion_matrix

cm_validation = confusion_matrix(y_validation, y_validation_pred)
print("Confusion Matrix (Validation):")
print(cm_validation)


Confusion Matrix (Validation):
[[1099  286]
 [ 200 1215]]


20. Use `.predict()` to generate the model predictions using the **test dataset**


- Put all text validation data in **X_test** variable

- Convert **X_test** to its numerical form.

- Put the converted data to **X_test_transformed**

- Put all predictions in **y_test_pred** variable

In [92]:
# put your answer here
X_test = df_test['text']

X_test_transformed = vect.transform(X_test)

y_test_pred = model.predict(X_test_transformed)

21. Get the Accuracy, Precision, Recall and F1-Score of the model using the **test dataset**

- Put all test data labels in **y_validation** variable



In [93]:
# put your answer here
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_test = df_test['label']

accuracy_test = accuracy_score(y_test, y_test_pred)
precision_test = precision_score(y_test, y_test_pred)
recall_test = recall_score(y_test, y_test_pred)
f1_test = f1_score(y_test, y_test_pred)

print(f"Accuracy (Test): {accuracy_test}")
print(f"Precision (Test): {precision_test}")
print(f"Recall (Test): {recall_test}")
print(f"F1 Score (Test): {f1_test}")

Accuracy (Test): 0.8252669039145908
Precision (Test): 0.802939211756847
Recall (Test): 0.8597997138769671
F1 Score (Test): 0.8303972366148532


22. Create a confusion matrix using the **test dataset**

In [94]:
# put your answer here

cm_test = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix (Test):")
print(cm_test)

Confusion Matrix (Test):
[[1117  295]
 [ 196 1202]]


## E. Test the model

23. Test the model by providing a non-hate speech input. The model should predict it as 0

In [95]:
# put your answer here

new_input = ["i love you"]
new_input_transformed = vect.transform(new_input)
prediction = model.predict(new_input_transformed)

print("Prediction:", prediction)

Prediction: [0]


24. Test the model by providing a hate speech input. The model should predict it as 1

In [96]:
# put your answer here

new_input = ["Ang init ng ulo ko ang gulo gulo ng paligid ang sarap talaga sumigaw ng putang ina"]
new_input_transformed = vect.transform(new_input)
prediction = model.predict(new_input_transformed)

print("Prediction:", prediction)

Prediction: [1]
