<a href="https://colab.research.google.com/github/thatvernon-yes/CCMACLRL_EXERCISES_COM222/blob/main/Exercise7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 7: Hate Speech Classification using Multinomial Naive Bayes

Instructions:
- You do not need to split your data. Use the training, validation and test sets provided below.
- Use Multinomial Naive Bayes to train a model that can classify if a sentence is a hate speech or non-hate speech
- A sentence with a label of zero (0) is classified as non-hate speech
- A sentence with a label of one (1) is classified as a hate speech

Apply text pre-processing techniques such as
- Converting to lowercase
- Stop word Removal
- Removal of digits, special characters
- Stemming or Lemmatization but not both
- Count Vectorizer or TF-IDF Vectorizer but not both

Evaluate your model by:
- Providing input by yourself
- Creating a Confusion Matrix
- Calculating the Accuracy, Precision, Recall and F1-Score

In [30]:
import pandas as pd
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.naive_bayes import MultinomialNB

import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [31]:
splits = {'train': 'unique_train_dataset.csv', 'validation': 'unique_validation_dataset.csv', 'test': 'unique_test_dataset.csv'}

**Training Set**

Use this to train your model

In [32]:
df_train = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["train"])

**Validation Set**

Use this set to evaluate your model

In [33]:
df_validation = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["validation"])

**Test Set**
  
Use this set to test your model

In [34]:
df_test = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["test"])

## A. Understanding your training data

1. Check the first 10 rows of the training dataset

In [35]:
# put your answer here
df_train.sample(10)

Unnamed: 0,text,label
9676,[USERNAME]:Can BINAY really be relied on as a ...,1
16403,Currently stalking Jejomar Binay. Mapapamura k...,1
14228,[USERNAME] Oo nga pandak nognog yan si binay h...,1
13321,I don't know kung totoo issue mo grace POE JUANCO,0
7392,Si Trillanes wala ng pag asa kaya nabili na si...,1
17728,True ba? Pula Ang Kulay Ng MagnanakawPula Ang ...,1
7104,Ang laki ng nawala sa atin Nakaka sad Tapos ii...,0
19670,so yung nanay ko hindi si inday sara ang vp ny...,0
2027,El Vibora de Manila [USERNAME] Timbre ng amuyo...,0
8092,[USERNAME]and[USERNAME] Confirmed sibuyas here...,0


2. Check how many rows and columns are in the training dataset using `.info()`

In [36]:
# put your answer here
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21773 entries, 0 to 21772
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    21773 non-null  object
 1   label   21773 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 340.3+ KB


3. Check for NaN values

In [37]:
# put your answer here
df_train.isnull().any().sum()

0

4. Check for duplicate rows

In [38]:
# put your answer here
df_train.duplicated().sum()

0

5. Check how many rows belong to each class

In [39]:
# put your answer here
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21773 entries, 0 to 21772
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    21773 non-null  object
 1   label   21773 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 340.3+ KB


## B. Text pre-processing

6. Remove duplicate rows

In [40]:
# put your answer here
df_train.drop_duplicates()

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
...,...,...
21768,Marcos Talunan Marcos Magnanakaw,1
21769,Grabe kayo kay binay ??????????,0
21770,[USERNAME] Cnu ba naman ang hindImabibighani s...,0
21771,RT [USERNAME]: Tabi tabi yung mga nagsasabing ...,1


7. Remove rows with NaN values

In [41]:
# put your answer here
df_train.dropna()

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
...,...,...
21768,Marcos Talunan Marcos Magnanakaw,1
21769,Grabe kayo kay binay ??????????,0
21770,[USERNAME] Cnu ba naman ang hindImabibighani s...,0
21771,RT [USERNAME]: Tabi tabi yung mga nagsasabing ...,1


8. Convert all text to lowercase

In [42]:
# put your answer here
df_train["text"] = df_train["text"].str.lower()

9. Remove digits, URLS and special characters

In [43]:
# put your answer here

# removing links
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"http\S+|www\.\S+", "", x))

# removing email addresses
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"\w+@\w+\.com", "", x))

# removing punctuation marks
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"[.,;:!\?\"'`]", "", x))

# removing special characters
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"[@#$%^&*\/\+-_=\{\}<>]", "", x))

# removing unnecessary characters
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"½m|½s|½t|½ï", "", x))

10. Remove stop words

In [44]:
# put your answer here
tagalog = [
    "ako", "sa", "akin", "ko", "aking", "sarili", "kami", "atin", "ang", "aming",
    "amin", "ating", "iyong", "iyo", "inyong", "siya", "kanya", "mismo", "ito",
    "nito", "kanyang", "sila", "nila", "kanila", "kanilang", "kung", "ano", "alin",
    "sino", "kanino", "na", "mga", "iyon", "am", "ay", "maging", "naging", "mayroon",
    "may", "nagkaroon", "pagkakaroon", "gumawa", "ginagawa", "ginawa", "paggawa",
    "ibig", "dapat", "maaari", "marapat", "kong", "ikaw", "tayo", "hindi", "namin",
    "gusto", "nais", "niyang", "nilang", "niya", "huwag", "ginawang", "gagawin",
    "maaaring", "sabihin", "narito", "kapag", "ni", "nasaan", "bakit", "paano",
    "kailangan", "walang", "katiyakan", "isang", "at", "pero", "o", "dahil", "bilang",
    "hanggang", "habang", "ng", "pamamagitan", "para", "tungkol", "laban", "pagitan",
    "panahon", "bago", "pagkatapos", "itaas", "ibaba", "mula", "pataas", "pababa",
    "palabas", "ibabaw", "ilalim", "muli", "pa", "minsan", "dito", "doon", "saan",
    "lahat", "anumang", "kapwa", "bawat", "ilan", "karamihan", "iba", "tulad",
    "lamang", "pareho", "kaya", "kaysa", "masyado", "napaka", "isa", "bababa",
    "kulang", "marami", "ngayon", "kailanman", "sabi", "nabanggit", "din", "kumuha",
    "pumunta", "pumupunta", "ilagay", "makita", "nakita", "katulad", "mahusay",
    "likod", "kahit", "paraan", "noon", "gayunman", "dalawa", "tatlo", "apat",
    "lima", "una", "pangalawa", "gawa", "tahimik", "ano", "para", "paraan" , "pareho",
    "pataas", "pero", "pumunta", "pumupunta", "sa", "saan", "sabi", "sila", "sino",
    "siya", "tatlo", "tayo", "tulad", "tungkol", "una", "mo"

]

stop_words = tagalog + stopwords.words("english")
df_train["text"] = df_train["text"].apply(lambda x: " ".join(word for word in x.split() if word not in stop_words))

11. Use Stemming or Lemmatization

In [45]:
# put your answer here
wnl = WordNetLemmatizer()
df_train["text"] = df_train["text"].apply(lambda x: " ".join(wnl.lemmatize(word, "v") for word in x.split()))
df_train.head()


Unnamed: 0,text,label
0,presidential candidate mar roxas imply govt li...,1
1,parang mali sumunod patalastas nescaf coffee b...,1
2,bet pula kulay posas,1
3,username kakampink,0
4,parang pink doc willie ong reaction paper,1


## C. Training your model

12. Put all text training data in variable **X_train**

In [46]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))

tfidf_vectorizer.fit(df_train["text"])
tfidf_array = tfidf_vectorizer.transform(df_train["text"]).toarray()
data_cv = pd.DataFrame(tfidf_array, columns = tfidf_vectorizer.get_feature_names_out())

13. Put all training data labels in variable **y_train**

In [47]:
df_train['cleaned_text'] = df_train['text']
df_validation['cleaned_text'] = df_validation['text']
df_test['cleaned_text'] = df_test['text']


In [48]:
X_train = tfidf_vectorizer.fit_transform(df_train['cleaned_text'])
y_train = df_train['label']

X_validation = tfidf_vectorizer.transform(df_validation['cleaned_text'])
y_validation = df_validation['label']

X_test = tfidf_vectorizer.transform(df_test['cleaned_text'])
y_test = df_test['label']

14. Use `CountVectorizer()` or `TfidfVectorizer()` to convert text data to its numerical form.

Put the converted data to **X_train_transformed** variable

In [49]:

#TFIDF vectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))

tfidf_vectorizer.fit(df_train["text"])
tfidf_array = tfidf_vectorizer.transform(df_train["text"]).toarray()
data_cv = pd.DataFrame(tfidf_array, columns = tfidf_vectorizer.get_feature_names_out())

15. Create an instance of `MultinomalNB()`

In [50]:
# Multinomial Naive Bayes Model Training
model = MultinomialNB()


16. Train the model using `.fit()`

In [51]:
# put your answer here
model.fit(X_train, y_train)

## D. Evaluate your model

17. Use `.predict()` to generate model predictions using the **validation dataset**


- Put all text validation data in **X_validation** variable

- Convert **X_validation** to its numerical form.

- Put the converted data to **X_validation_transformed**

- Put all predictions in **y_validation_pred** variable

In [52]:
# put your answer here
y_pred_val = model.predict(X_validation)

18. Get the Accuracy, Precision, Recall and F1-Score of the model using the **validation dataset**

- Put all validation data labels in **y_validation** variable

In [53]:
# put your answer here
print("Validation Accuracy: ", accuracy_score(y_validation, y_pred_val))

Validation Accuracy:  0.8382142857142857


19. Create a confusion matrix using the **validation dataset**

In [55]:
# put your answer here
print("Confusion Matrix:")
print(confusion_matrix(y_validation, y_pred_val))

Confusion Matrix:
[[1097  288]
 [ 165 1250]]


20. Use `.predict()` to generate the model predictions using the **test dataset**


- Put all text validation data in **X_test** variable

- Convert **X_test** to its numerical form.

- Put the converted data to **X_test_transformed**

- Put all predictions in **y_test_pred** variable

In [56]:
# put your answer here
y_pred_test = model.predict(X_test)

21. Get the Accuracy, Precision, Recall and F1-Score of the model using the **test dataset**

- Put all test data labels in **y_validation** variable



In [57]:
# put your answer here
print("Test Accuracy: ", accuracy_score(y_test, y_pred_test))

Test Accuracy:  0.8323843416370107


22. Create a confusion matrix using the **test dataset**

In [58]:
# put your answer here
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))

Confusion Matrix:
[[1113  299]
 [ 172 1226]]


## E. Test the model

23. Test the model by providing a non-hate speech input. The model should predict it as 0

In [63]:
# NEW: Testing Tagalog hate speech detection with new text input
new_text = pd.Series("Matalinong botante")


# Transform the new text using the trained vectorizer (vect)
new_text_transform = tfidf_vectorizer.transform(new_text)

# Make the prediction using the trained Naive Bayes model (nb)
prediction = model.predict(new_text_transform)
print(prediction)

# Interpret the prediction result
if prediction == 1:
    print("The sentence is classified as hate speech.")
else:
    print("The sentence is classified as non-hate speech.")

[0]
The sentence is classified as non-hate speech.


24. Test the model by providing a hate speech input. The model should predict it as 1

In [64]:
# put your answer here
# NEW: Testing Tagalog hate speech detection with new text input
new_text = pd.Series("magnanakaw na politiko")


# Transform the new text using the trained vectorizer (vect)
new_text_transform = tfidf_vectorizer.transform(new_text)

# Make the prediction using the trained Naive Bayes model (nb)
prediction = model.predict(new_text_transform)
print(prediction)

# Interpret the prediction result
if prediction == 1:
    print("The sentence is classified as hate speech.")
else:
    print("The sentence is classified as non-hate speech.")

[1]
The sentence is classified as hate speech.
