# Inappropriate Language Classifier

This notebook is dedicated to run a model in order to predict wether a text contains inappropriate language or not.

Please follow the following steps to run the model.

### 0. Imports
Import the required modules

In [None]:
import pickle
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

### 1. Load Model and Tokenizer
Tokenizer will split the characters into known tokens, and the model will be used later to predict the class of the text

In [None]:
# Load Tokenizer
tokenizer_path = "./saved/countvectorizer.pickle"
with open(tokenizer_path, 'rb') as f:
    tokenizer : CountVectorizer = pickle.load(f)

# Load Model
model_path = "./saved/decisiontree/dt_cv.pickle"
with open(model_path, 'rb') as f:
    model = pickle.load(f)

### 2. Create the function to preprocess the texts
The texts have to be cleaned and tokenized before being computed by the model

In [None]:
lemmatizer = WordNetLemmatizer()

def preprocess_data(text:str):
    text = text.lower().removeprefix("\"").removesuffix("\"")
    words = text.split()
    words = [word for word in words if word not in stopwords.words('english')]
    words = [lemmatizer.lemmatize(word) for word in words]
    text = ' '.join(words)
    return tokenizer.transform([text])

### 3. Predict your text
This is the place you will use to predict the class of your text. Be careful, our model was trained on a large social comment database, which means that you could eventually bring new vocabulary to the network that hasn\'t be seen before.

In [None]:
sentence = "This is an appropriate sentence."

sentence = preprocess_data(sentence)
prediction = model.predict(sentence)

print("appropriate" if str(prediction) == "[[1 0]]" else "inappropriate")