# Feedback Classification with Traditional ML

This notebook trains classical machine-learning models to predict three labels from feedback comments:

- **Teacher vs. course**
- **Sentiment**
- **Aspect** (behaviour, teaching skills, relevancy, etc.)

To keep the repository free of binary artifacts, the dataset is versioned as `data/data_feedback.csv`. The first code cell exports it to `data/data_feedback.xlsx` so the rest of the workflow can load the Excel file exactly as originally requested.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

import shap

plt.style.use('seaborn-v0_8')
pd.set_option('display.max_colwidth', 200)

data_dir = Path('data')
csv_path = data_dir / 'data_feedback.csv'
excel_path = data_dir / 'data_feedback.xlsx'

# Export CSV to Excel so downstream cells can load the XLSX file without committing a binary to git
df_csv = pd.read_csv(csv_path)
df_csv.to_excel(excel_path, index=False)
print(f"Excel file created at: {excel_path.resolve()}")

# Load the Excel version requested by the assignment
df = pd.read_excel(excel_path)
df.head()


## Label distributions

A quick overview of the class balance for each prediction task helps choose evaluation strategies for the small dataset.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, column in zip(axes, ['teacher/course', 'sentiment', 'aspect']):
    sns.countplot(data=df, x=column, ax=ax, palette='crest')
    ax.set_title(f"{column} distribution")
    ax.tick_params(axis='x', rotation=20)
plt.tight_layout()
plt.show()


## Utility: train and evaluate a TF-IDF + Logistic Regression model

We reuse the same helper to train separate classifiers for each label. Confusion matrices and misclassifications make errors explicit for later explainability analysis.

In [None]:
def train_and_report(df, text_col, target_col, test_size=0.25, random_state=42):
    train_df, test_df = train_test_split(df, test_size=test_size, stratify=df[target_col], random_state=random_state)

    model = Pipeline([
        ('tfidf', TfidfVectorizer(ngram_range=(1, 2), min_df=1)),
        ('logreg', LogisticRegression(max_iter=1000, class_weight='balanced'))
    ])

    model.fit(train_df[text_col], train_df[target_col])
    preds = model.predict(test_df[text_col])

    print(f"=== {target_col} ===")
    print(classification_report(test_df[target_col], preds))

    cm = confusion_matrix(test_df[target_col], preds, labels=model.classes_)
    fig, ax = plt.subplots(figsize=(5, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=model.classes_, yticklabels=model.classes_, ax=ax)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('True')
    ax.set_title(f"Confusion Matrix: {target_col}")
    plt.show()

    errors = test_df.assign(prediction=preds)[lambda d: d[target_col] != d['prediction']]
    return model, errors[['comments', target_col, 'prediction']]


## Train separate models

In [None]:
teacher_model, teacher_errors = train_and_report(df, 'comments', 'teacher/course')
sentiment_model, sentiment_errors = train_and_report(df, 'comments', 'sentiment')
aspect_model, aspect_errors = train_and_report(df, 'comments', 'aspect')

teacher_errors.head()


## Error analysis

Inspecting misclassifications highlights where the models struggle. The small dataset makes each error easy to review manually.

In [None]:
def show_errors(errors, title, n=5):
    display(errors.head(n))

show_errors(teacher_errors, 'Teacher/Course errors')
show_errors(sentiment_errors, 'Sentiment errors')
show_errors(aspect_errors, 'Aspect errors')


## Explainability with SHAP

We use SHAP values on the teacher/course classifier to understand which n-grams contribute to each prediction. The same approach can be applied to the sentiment or aspect models.

If SHAP is not installed, run `pip install shap` before executing this section.


In [None]:
# Fit a SHAP explainer on the linear model
vectorizer = teacher_model.named_steps['tfidf']
classifier = teacher_model.named_steps['logreg']

# Transform the full corpus for consistent feature mapping
X_tfidf = vectorizer.transform(df['comments'])
explainer = shap.LinearExplainer(classifier, X_tfidf, feature_perturbation='interventional')
shap_values = explainer(X_tfidf)

# Visualize feature impact for a few examples
shap.summary_plot(shap_values, features=vectorizer.get_feature_names_out(), show=False)
plt.title('SHAP summary for teacher/course classifier')
plt.show()

# Inspect a single comment
shap.initjs()
shap_text_explainer = shap.Explainer(teacher_model, teacher_model['tfidf'])
shap_values_text = shap_text_explainer(df['comments'])
shap.plots.text(shap_values_text[0])
