<a href="https://colab.research.google.com/github/vannicc/CCMACLRL_EXERCISES_COM222-ML/blob/main/Exercise7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 7: Hate Speech Classification using Multinomial Naive Bayes

Instructions:
- You do not need to split your data. Use the training, validation and test sets provided below.
- Use Multinomial Naive Bayes to train a model that can classify if a sentence is a hate speech or non-hate speech
- A sentence with a label of zero (0) is classified as non-hate speech
- A sentence with a label of one (1) is classified as a hate speech

Apply text pre-processing techniques such as
- Converting to lowercase
- Stop word Removal
- Removal of digits, special characters
- Stemming or Lemmatization but not both
- Count Vectorizer or TF-IDF Vectorizer but not both

Evaluate your model by:
- Providing input by yourself
- Creating a Confusion Matrix
- Calculating the Accuracy, Precision, Recall and F1-Score

In [360]:
import pandas as pd
import re

In [361]:
splits = {'train': 'unique_train_dataset.csv', 'validation': 'unique_validation_dataset.csv', 'test': 'unique_test_dataset.csv'}

**Training Set**

Use this to train your model

In [362]:
df_train = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["train"])

**Validation Set**

Use this set to evaluate your model

In [363]:
df_validation = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["validation"])

**Test Set**
  
Use this set to test your model

In [364]:
df_test = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["test"])

## A. Understanding your training data

1. Check the first 10 rows of the training dataset

In [365]:
# put your answer here
df_train.head(10)

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
5,"""Ang sinungaling sa umpisa ay sinungaling hang...",1
6,Leni Kiko,0
7,Nahiya si Binay sa Makati kaya dito na lang sa...,1
8,Another reminderHalalan,0
9,[USERNAME] Maybe because VP Leni Sen Kiko and ...,0


2. Check how many rows and columns are in the training dataset using `.info()`

In [366]:
# put your answer here
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21773 entries, 0 to 21772
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    21773 non-null  object
 1   label   21773 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 340.3+ KB


3. Check for NaN values

In [367]:
# put your answer here
df_train.isnull().sum()

Unnamed: 0,0
text,0
label,0


4. Check for duplicate rows

In [368]:
# put your answer here
df_train.duplicated().sum()

0

5. Check how many rows belong to each class

In [369]:
# put your answer here
df_train['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,10994
0,10779


## B. Text pre-processing

6. Remove duplicate rows

In [370]:
# put your answer here
df_train.drop_duplicates(subset='text',keep='first')

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
...,...,...
21768,Marcos Talunan Marcos Magnanakaw,1
21769,Grabe kayo kay binay ??????????,0
21770,[USERNAME] Cnu ba naman ang hindImabibighani s...,0
21771,RT [USERNAME]: Tabi tabi yung mga nagsasabing ...,1


7. Remove rows with NaN values

In [371]:
# put your answer here
df_train.dropna(axis = 0, inplace=True)

8. Convert all text to lowercase

In [372]:
# put your answer here
df_train['text'] = df_train['text'].str.lower()

9. Remove digits, URLS and special characters

In [373]:
# put your answer here
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"http\S+|www\.\S+", "", x))
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"\w+@\w+\.com", "", x))
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"[.,;:!\?\"'`]", "", x))
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"[@#$%^&*\/\+-_=\{\}<>]", "", x))
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"½m|½s|½t|½ï", "", x))

10. Remove stop words

In [374]:
# put your answer here
from nltk.corpus import stopwords

!pip install nltk # install nltk if it's not installed
import nltk # import nltk
nltk.download('stopwords') # download the stopwords resource

STOP_WORDS = set(
    """
akin
aking
ako
alin
am
amin
aming
ang
ano
anumang
apat
at
atin
ating
ay
bababa
bago
bakit
bawat
bilang
dahil
dalawa
dapat
din
dito
doon
gagawin
gayunman
ginagawa
ginawa
ginawang
gumawa
gusto
habang
hanggang
hindi
huwag
iba
ibaba
ibabaw
ibig
ikaw
ilagay
ilalim
ilan
inyong
isa
isang
itaas
ito
iyo
iyon
iyong
ka
kahit
kailangan
kailanman
kami
kanila
kanilang
kanino
kanya
kanyang
kapag
kapwa
karamihan
katiyakan
katulad
kaya
kaysa
ko
kong
kulang
kumuha
kung
laban
lahat
lamang
likod
lima
maaari
maaaring
maging
mahusay
makita
marami
marapat
masyado
may
mayroon
mga
minsan
mismo
mula
muli
na
nabanggit
naging
nagkaroon
nais
nakita
namin
napaka
narito
nasaan
ng
ngayon
ni
nila
nilang
nito
niya
niyang
noon
o
pa
paano
pababa
paggawa
pagitan
pagkakaroon
pagkatapos
palabas
pamamagitan
panahon
pangalawa
para
paraan
pareho
pataas
pero
pumunta
pumupunta
sa
saan
sabi
sabihin
sarili
sila
sino
siya
tatlo
tayo
tulad
tungkol
una
walang
""".split()
)
stop_words = set(stopwords.words('english'))
df_train['text'] = df_train['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words and STOP_WORDS)]))



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


11. Use Stemming or Lemmatization

In [375]:
# put your answer here
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
df_train['text'] = df_train['text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split() if word not in (stop_words and STOP_WORDS)]))

## C. Training your model

12. Put all text training data in variable **X_train**

In [376]:
# put your answer here
X_train = df_train['text']

13. Put all training data labels in variable **y_train**

In [377]:
# put your answer here
y_train = df_train['label']

14. Use `CountVectorizer()` or `TfidfVectorizer()` to convert text data to its numerical form.

Put the converted data to **X_train_transformed** variable

In [378]:
# put your answer here
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)

15. Create an instance of `MultinomalNB()`

In [379]:
# put your answer here
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

16. Train the model using `.fit()`

In [380]:
# put your answer here
model.fit(X_train_transformed, y_train)

## D. Evaluate your model

17. Use `.predict()` to generate model predictions using the **validation dataset**


- Put all text validation data in **X_validation** variable

- Convert **X_validation** to its numerical form.

- Put the converted data to **X_validation_transformed**

- Put all predictions in **y_validation_pred** variable

In [381]:
# put your answer here
X_validation = df_validation['text']
X_validation_transformed = vectorizer.transform(X_validation)
y_validation_pred = model.predict(X_validation_transformed)

18. Get the Accuracy, Precision, Recall and F1-Score of the model using the **validation dataset**

- Put all validation data labels in **y_validation** variable

In [382]:
# put your answer here
y_validation = df_validation['label']
accuracy = accuracy_score(y_validation, y_validation_pred)
precision = precision_score(y_validation, y_validation_pred)
recall = recall_score(y_validation, y_validation_pred)
f1 = f1_score(y_validation, y_validation_pred)

print("Accuracy:", round(100*accuracy,2), '%')
print("Recall:", round(100*recall,2), '%')
print("Precision:", round(100*precision,2), '%')
print("F1-Score:", round(100*f1,2), '%')

Accuracy: 81.64 %
Recall: 85.16 %
Precision: 79.85 %
F1-Score: 82.42 %


19. Create a confusion matrix using the **validation dataset**

In [383]:
# put your answer here
from sklearn.metrics import confusion_matrix
confusion_matrix(df_validation['label'], y_validation_pred)

array([[1081,  304],
       [ 210, 1205]])

20. Use `.predict()` to generate the model predictions using the **test dataset**


- Put all text validation data in **X_test** variable

- Convert **X_test** to its numerical form.

- Put the converted data to **X_test_transformed**

- Put all predictions in **y_test_pred** variable

In [384]:
# put your answer here
X_test = df_test['text']
X_test_transformed = vectorizer.transform(X_test)
y_test_pred = model.predict(X_test_transformed)

21. Get the Accuracy, Precision, Recall and F1-Score of the model using the **test dataset**

- Put all test data labels in **y_validation** variable



In [385]:
# put your answer here
y_test = df_test['label']
accuracy = accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

print("Accuracy:", round(100*accuracy,2), '%')
print("Recall:", round(100*recall,2), '%')
print("Precision:", round(100*precision,2), '%')
print("F1-Score:", round(100*f1,2), '%')

Accuracy: 81.07 %
Recall: 83.91 %
Precision: 79.26 %
F1-Score: 81.51 %


22. Create a confusion matrix using the **test dataset**

In [386]:
# put your answer here
from sklearn.metrics import confusion_matrix
confusion_matrix(df_test['label'], y_test_pred)

array([[1105,  307],
       [ 225, 1173]])

## E. Test the model

23. Test the model by providing a non-hate speech input. The model should predict it as 0

In [387]:
# put your answer here
sentence = 'Mahal kita pero mahal ko rin ang sarili ko'
if(model.predict(vectorizer.transform([sentence]))[0] == 0):
  print("Non-hate speech")
else:
  print("Hate speech")

Non-hate speech


24. Test the model by providing a hate speech input. The model should predict it as 1

In [388]:
# put your answer here
sentence = 'Tarantado ka talaga eh no?'
if(model.predict(vectorizer.transform([sentence]))[0] == 0):
  print("Non-hate speech")
else:
  print("Hate speech")

Hate speech


In [389]:
sentence = 'Ang init ng ulo ko ang gulo gulo ng paligid ang sarap talaga sumigaw ng putang ina'
if(model.predict(vectorizer.transform([sentence]))[0] == 0):
  print("Non-hate speech")
else:
  print("Hate speech")

Hate speech
