# Introduction and Imports

![Credit: Pexels](https://images.pexels.com/photos/3862130/pexels-photo-3862130.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=750&w=1260)

In this notebook, I will be using only **Machine Learning** methods to get decent prediction scores. There are much better and sophisticated ways (like RNN, GRU, Fine-tuning BERT, etc) but you have seen them on a lot of notebook already.

The main aim of this notebook is to just show how quickly and easily you can do Text Classification using Basic Machine Learning Methods, rather than spend waiting 1 hour for a model to train!

<p style="color:red">If you like this notebook, please make sure to give an upvote, it helps a lot and motivates me to make much more good-quality content</p>
<p style="color:blue">If you don't like my work, please leave a comment on what can I do to make it better!</p>
<hr>
<h3 style="color:aqua">Edits:</h3>
<ul>
<li style="color:green">All Classifiers now classify for all 3 categories and not just 2. Good Validation Accuracy is maintained.</li>
</ul>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import random
import warnings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix, plot_confusion_matrix, plot_precision_recall_curve

warnings.simplefilter("ignore")

In [None]:
def plot_metric(clf, testX, testY, name):
    """
    Small function to plot ROC-AUC values and confusion matrix
    """
    styles = ['bmh', 'classic', 'fivethirtyeight', 'ggplot']

    plt.style.use(random.choice(styles))
    plot_confusion_matrix(clf, testX, testY)
    plt.title(f"Confusion Matrix [{name}]")

# Data Preprocessing and Some EDA

Read the data and don't use the low quality edit data

In [None]:
data = pd.read_csv("../input/60k-stack-overflow-questions-with-quality-rate/data.csv")
data.head()

All the open questions are grouped under a single class (1), while the closed one is grouped under (0)

In [None]:
data = data.drop(['Id', 'Tags', 'CreationDate'], axis=1)
data['Y'] = data['Y'].map({'LQ_CLOSE':0, 'LQ_EDIT': 1, 'HQ':2})
data.head()

In [None]:
labels = ['Open Questions', 'Low Quality Question - Close', 'Low Quality Question - Edit']
values = [len(data[data['Y'] == 2]), len(data[data['Y'] == 0]), len(data[data['Y'] == 1])]
plt.style.use('classic')
plt.figure(figsize=(16, 9))
plt.pie(x=values, labels=labels, autopct="%1.1f%%")
plt.title("Target Value Distribution")
plt.show()

Let's join the title and the body of the text data so that we can use both of them in our classification

In [None]:
data['text'] = data['Title'] + ' ' + data['Body']
data = data.drop(['Title', 'Body'], axis=1)
data.head()

In [None]:
# Clean the data
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^(a-zA-Z)\s]','', text)
    return text
data['text'] = data['text'].apply(clean_text)

## Splitting the Data
Let's now split the dataset into training and validation sets

In [None]:
# Define how much percent data you wanna split
split_pcent = 0.20
split = int(split_pcent * len(data))

# Shuffles dataframe
data = data.sample(frac=1).reset_index(drop=True)

# Training Sets
train = data[split:]
trainX = train['text']
trainY = train['Y'].values

# Validation Sets
valid = data[:split]
validX = valid['text']
validY = valid['Y'].values

assert trainX.shape == trainY.shape
assert validX.shape == validY.shape

print(f"Training Data Shape: {validX.shape}\nValidation Data Shape: {validX.shape}")

## Vectorizing the Data
Let's vectorize the data so it's in the numerical format

In [None]:
# Load the vectorizer, fit on training set, transform on validation set
vectorizer = TfidfVectorizer()
trainX = vectorizer.fit_transform(trainX)
validX = vectorizer.transform(validX)

# Modelling
Let's start with different non-deep learning approaches for this task.

## 1. Logistic Regression
Let's first start with our good old, Logistic Regression!

In [None]:
# Define and fit the classifier on the data
lr_classifier = LogisticRegression(C=1.)
lr_classifier.fit(trainX, trainY)

In [None]:
# Print the accuracy score of the classifier
print(f"Validation Accuracy of Logsitic Regression Classifier is: {(lr_classifier.score(validX, validY))*100:.2f}%")

In [None]:
# Also plot the metric
plot_metric(lr_classifier, validX, validY, "Logistic Regression")

## 2. Multinomial Naive Bayes
Let's now switch to the naive the bayes, the NAIVE BAYES!

In [None]:
# Define and fit the classifier on the data
nb_classifier = MultinomialNB()
nb_classifier.fit(trainX, trainY)

In [None]:
# Print the accuracy score of the classifier
print(f"Validation Accuracy of Naive Bayes Classifier is: {(nb_classifier.score(validX, validY))*100:.2f}%")

In [None]:
# Also plot the metric
plot_metric(nb_classifier, validX, validY, "Naive Bayes")

## 3. Random Forest Classifier
Let's now enter the forest with the Random Forest Classifier and see where it takes us!

In [None]:
# Define and fit the classifier on the data
rf_classifier = RandomForestClassifier()
rf_classifier.fit(trainX, trainY)

In [None]:
# Print the accuracy score of the classifier
print(f"Validation Accuracy of Random Forest Classifier is: {(rf_classifier.score(validX, validY))*100:.2f}%")

In [None]:
# Also plot the metric
plot_metric(nb_classifier, validX, validY, "Random Forest")

## 4. Decision Tree Classifier
Let's now take some decisions using the Decision Tree Classifer

In [None]:
# Define and fit the classifier on the data
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(trainX, trainY)

In [None]:
# Print the accuracy score of the classifier
print(f"Validation Accuracy of Decision Tree Clf. is: {(dt_classifier.score(validX, validY))*100:.2f}%")

In [None]:
# Also plot the metric
plot_metric(dt_classifier, validX, validY, "Decision Tree Classifier")

## 5. KNN Classifier
We now are going to use KNN Classifier for this task.

In [None]:
# Define and fit the classifier on the data
kn_classifier = KNeighborsClassifier()
kn_classifier.fit(trainX, trainY)

In [None]:
# Print the accuracy score of the classifier
print(f"Validation Accuracy of KNN Clf. is: {(kn_classifier.score(validX, validY))*100:.2f}%")

In [None]:
# Also plot the metric
plot_metric(dt_classifier, validX, validY, "Decision Tree Classifier")

## 6. XGBoost
Finally, let's use the XGBoost Classifier and then we'll compare all the different classifiers so far

In [None]:
# Define and fit the classifier on the data
xg_classifier = XGBClassifier()
xg_classifier.fit(trainX, trainY)

In [None]:
# Print the accuracy score of the classifier
print(f"Validation Accuracy of XGBoost Clf. is: {(xg_classifier.score(validX, validY))*100:.2f}%")

In [None]:
# Also plot the metric
plot_metric(xg_classifier, validX, validY, "XGBoost Classifier")