___
<a href='http://www.kgptalkie.com'><img src='../kgptalkie_strips.png'/></a>
___

<center><em>Copyright by KGPTalkie</em></center>
<center><em>For free ML tutorials, visit us at <a href='http://www.kgptalkie.com'>www.kgptalkie.com</a> and <a href='http://www.youtube.com/kgptalkie'>www.youtube.com/kgptalkie</a> </em></center>

# Hate Speech Classification

Hate speech is defined by the Cambridge Dictionary as "public speech that expresses hate or encourages violence towards a person or group based on something such as race, religion, sex, or sexual orientation".

Steps to classify hate speech
- Preprocess the text data
- Convert text to numerical tokens
- Build and Train ML
- Test the Model
- Save and use it later

Dataset: https://github.com/laxmimerit/hate_speech_dataset

In [None]:
!pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git

In [None]:
# !pip install spacy

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Embedding, Dropout
from tensorflow.keras.layers import Conv1D, MaxPool1D, GlobalMaxPooling1D, MaxPooling1D

In [None]:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
import preprocess_kgptalkie as ps

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/hate_speech_dataset/master/data.csv', index_col = 0)
df.head()

In [None]:
#0 - hate speech | 1 - offensive language | 2 - neither

In [None]:
df['class'].value_counts()

In [None]:
vc = df['class'].value_counts()
index = list(vc.index)
count = min(vc.values)

In [None]:
count, index

In [None]:
df_bal = pd.DataFrame()
for i in index:
  temp = df[df['class']==i].sample(count)
  df_bal = df_bal.append(temp, ignore_index = True)

In [None]:
df = df_bal.copy()

In [None]:
df['class'].value_counts()

### Preprocessing

In [None]:
def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = ps.cont_exp(x)
    x = ps.remove_emails(x)
    x = ps.remove_urls(x)
    x = ps.remove_html_tags(x)
    x = ps.remove_rt(x)
    x = ps.remove_accented_chars(x)
    x = ps.remove_special_chars(x)
    x = re.sub("(.)\\1{2,}", "\\1", x)
    return x

In [None]:
x = 'iiii llloooooovvvveeee_#@youuuuuuu'

In [None]:
get_clean(x)

In [None]:
df['tweet'] = df['tweet'].apply(lambda x: get_clean(x))

In [None]:
df.head()

### Text Tokenization

In [None]:
text = df['tweet'].tolist()

In [None]:
text[:2]

In [None]:
token = Tokenizer()
token.fit_on_texts(text)

In [None]:
# help(token)

In [None]:
len(token.word_counts)

In [None]:
print(token.index_word)

In [None]:
x = ['i love you']

In [None]:
token.texts_to_sequences(x)

In [None]:
vocab_size = len(token.word_counts) + 1

In [None]:
encoded_text = token.texts_to_sequences(text)

In [None]:
print(encoded_text)

In [None]:
max_length = 120
X = pad_sequences(encoded_text, maxlen=max_length, padding = 'post')

In [None]:
print(X)

In [None]:
X.shape

### Model Building

In [None]:
from keras.utils import np_utils
from tensorflow.keras.optimizers import Adam

In [None]:
y = df['class']

In [None]:
y

In [None]:
y = np_utils.to_categorical(df['class'])

In [None]:
print(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)

In [None]:
X_train.shape, X_test.shape

In [None]:
vec_size = 300
model = Sequential()
model.add(Embedding(vocab_size, vec_size, input_length=max_length))

model.add(Conv1D(32, 2, activation='relu'))
model.add(MaxPooling1D(2))
model.add(Dropout(0.2))

model.add(Dense(32, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(3, activation='softmax'))

In [None]:
model.compile(optimizer=Adam(learning_rate=0.001), loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.fit(X_train, y_train, epochs = 2, validation_data=(X_test, y_test), shuffle = True)

### Model Testing

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

In [None]:
y_pred = np.argmax(model.predict(X_test), axis=-1)

In [None]:
plot_confusion_matrix(confusion_matrix(np.argmax(y_test, axis=-1), y_pred))


In [None]:
print(classification_report(np.argmax(y_test, axis=-1), y_pred))

### Testing with Custom Data

In [None]:
x = 'hey bitch whass up'

In [None]:
def get_encoded(x):
  x = get_clean(x)
  x = token.texts_to_sequences([x])
  x = pad_sequences(x, maxlen=max_length, padding = 'post')
  return x

In [None]:
get_encoded(x)

In [None]:
np.argmax(model.predict(get_encoded(x)), axis=-1)

In [None]:
model.save('model_hate_speech.h5',)

In [None]:
import pickle

In [None]:
pickle.dump(token, open('token.pkl', 'wb'))

# Other Resources
|  ML Course | Description |
|:---|:---|
| [**Data Visualization in Python Masterclass™: Beginners to Pro**](https://bit.ly/udemy95off_kgptalkie) |  Learn to build Machine Learning and Deep Learning models using Python and its libraries like Scikit-Learn, Keras, and TensorFlow. |
| [**Python for Machine Learning: A Step-by-Step Guide**](https://bit.ly/ml-ds-project) | Learn to build Machine Learning and Deep Learning models using Python and its libraries like Scikit-Learn, Keras, and TensorFlow. |
| [**Python for Linear Regression in Machine Learning**](https://bit.ly/regression-python) | Learn to build Linear Regression models using Python and its libraries like Scikit-Learn. |
| [**Introduction to Spacy 3 for Natural Language Processing**](https://bit.ly/spacy-intro) | Learn to build Natural Language Processing models using Python and its libraries like Spacy. |
| [**Advanced Machine Learning and Deep Learning Projects**](https://bit.ly/kgptalkie_ml_projects) | Learn to build Advanced Machine Learning and Deep Learning models using Python and transformer models like BERT, GPT-2, and XLNet. |
| [**Natural Language Processing in Python for Beginners**](https://bit.ly/intro_nlp) | Learn to build Natural Language Processing Projects using Spacy, NLTK, and Gensim, and transformer models like BERT, GPT-2, and XLNet. |
| [**Deployment of Machine Learning Models in Production in Python**](https://bit.ly/bert_nlp) |  Learn to deploy Machine Learning and Deep Learning models using Python and its libraries like Flask, Streamlit, and NGINX. |
| [**R 4.0 Programming for Data Science - Beginners to Pro**](https://bit.ly/r4-ml) | Learn to build Machine Learning and Deep Learning models using R and its libraries like caret, tidyverse, and keras. |
