# Model 1: Text Classification with `sklearn`

Our goal is to build a ML model that uses the features to predict the label:
- **Feature:** A bag of words (questions from Quora users; `question_text` in the `train.csv` and `test.csv` datasets)
- **Label (Binary):** Insincere (TRUE or FALSE; `target` in the `train.csv` dataset)

## Step 1: Import libraries

Ensure that all the required libraries have been installed by running `pip install <LIBRARY>` in the terminal.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use('ggplot')

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, f1_score

## Step 2: Load data

In [None]:
# loading train and test datasets
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

# printing the first 5 rows of train dataset
train.head()

# printing the first 5 rows of test dataset
test.head()

## Step 3: Perform Text Vectorization

Text Vectorization is the process of converting text into numerical representation.

For our use case, we are converting `question_text` to a matrix of TF-IDF (which measures the frequency of a word in a text against its overall frequency in the corpus) features.

In [None]:
text_vectorizer = TfidfVectorizer()

train_vector = text_vectorizer.fit_transform(train["question_text"])
test_vector = text_vectorizer.transform(test["question_text"])

## Step 4: Splitting X and Y into training and validation sets

As Quora provided the train and test datasets separately, there is no need to split X and Y into training and testing sets. Instead, X and Y are split into training and validation sets.

The `stratify` argument is used for Y to be split into training and validation sets as they are in the original dataset.

In [None]:
X_train,X_val,y_train,y_val = train_test_split(
    train_vector,
    train["target"],
    test_size=0.2,
    stratify=train["target"],
    random_state=42)

## Step 5: Choose classifiers

The three chosen classifers are the following models:

    - Logistic Regression
    - SVM with linear kernel
    - Bernoulli

In [None]:
# setting up the Logistic Regression model
log_model = LogisticRegression()

# setting up the SVM model with linear kernel
svm_model = SVC(kernel='rbf')

# setting up the Bernoulli model
bernoulli_model = BernoulliNB()

## Step 6: Train the chosen classifiers and calculate metrics

The metrics that are used to analyze the performance are:

    - Accuracy
    - F1 Score

Along with these two scores, the confusion matrix is displayed with a _heatmap_.

In [None]:
# scaling X
scaler_X = StandardScaler(with_mean=False)
X_train_scaled = scaler_X.fit_transform(X_train)
X_val_scaled = scaler_X.transform(X_val)

# creating a method that train the model, predict using the model, and calculate metrics
def model_and_predict(model, X_train_scaled, y_train, X_val_scaled, y_val):
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_val_scaled)

    f1 = f1_score(y_val, y_pred)
    fpr, tpr, _ = roc_curve(y_val, y_pred)
    auc = roc_auc_score(y_val, y_pred)
    conf_matrix = confusion_matrix(y_val, y_pred)

    print("[Text Classification using %s]"%(model))
    print("   1) F1 score: %.2f"%(f1))
    print("   2) ROC AUC score: %.2f"%(auc))
    print("   3) ROC Curve -> Saved as \"roc_curve_log_reg.png\"")
    plt.plot(fpr, tpr, label='AUC Score = %.2f'%(auc))
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.legend(loc=4)
    plt.savefig('C:/Users/Soohyun/Desktop/UW MGTE \'23/YEAR 4/4B/MSCI 546/Term Project/quora-insincere-questions/1_baseline_classification/images/roc_curve_log_reg.png')
    plt.close()
    print("   4) Confusion Matrix -> Saved as \"conf_matrix_log_reg.png\"")
    group_names = ['True Negative','False Positive','False Negative','True Positive']
    group_counts = ["{0:0.0f}".format(value) for value in conf_matrix.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in conf_matrix.flatten()/np.sum(conf_matrix)]
    labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names, group_counts, group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    sns.heatmap(conf_matrix, annot=labels, fmt='')
    plt.title("Confusion Matrix")
    plt.savefig('C:/Users/Soohyun/Desktop/UW MGTE \'23/YEAR 4/4B/MSCI 546/Term Project/quora-insincere-questions/1_baseline_classification/images/confusion_matrix_log_reg.png')
    plt.close()

# calling method above for every model
baseline_models = [log_model, svm_model, bernoulli_model]
for model in baseline_models:
    model_and_predict(model, X_train_scaled, y_train, X_val_scaled, y_val)

## Step 7: Predict test dataset using the "best" classifier

Now that the performance of each classifier has been observed, the test dataset is ready to be predicted using the classifier with the "best" metrics.

In [None]:
result = np.array(np.mean(log_model.predict_proba(test_vector)[:, 1], axis=0) > 0.5, dtype=int)
result_df = pd.DataFrame({"qid": test["qid"], "prediction": result})
result_df.to_csv("result.csv", index=False)