Student Emails
====

Classify, learn stuff, etc.

In [1]:
import numpy as np
import pandas as pd
from sklearn import metrics
from scipy.sparse import csr_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_validate


from libEmails import read

First we'll try to classify the email subject lines into their demand categories.

We need to clean the data up a bit first; we need to decide how many emails per demand type to require (demand types rarer than this will be discarded):

In [2]:
sampling = "undersample"

In [3]:
min_num_labels = 75

(
    (X_train, categories_train),
    (X_test, categories_test),
    vectorizer,
) = read.read_email_subjects(
    min_num_labels, verbose=True, sampling=sampling, return_vectorizer=True
)

Dropping 66 NaN values of 7500
Following categories will be removed; fewer than the minimum (75) found:
{'Admissions': 30,
 'Arrival': 29,
 'Certificates and Transcripts': 52,
 'Course Information': 33,
 'Degree Classification': 9,
 'Document Request': 53,
 'Graduation': 7,
 'Placement': 14,
 'Plagiarism': 15,
 'Student Finance': 48,
 'Student Visa': 39,
 'Study Skills': 7,
 'Supplementary Year ': 66,
 'Tutorial': 19,
 'Unit Evaluation': 10,
 'Welcome Week': 15,
 'Withdrawal': 25,
 'Year Abroad': 45}
Label counts in training set after resampling (undersample):
{'Absence/Attendance': 66,
 'Appeal': 66,
 'Blackboard': 66,
 'Course Transfer': 66,
 'Exams and Assessments': 66,
 'Extension': 66,
 'Extenuating Circumstances': 66,
 'Miscellaneous': 66,
 'Online learning ': 66,
 'Personal Tutoring ': 66,
 'Registration': 66,
 'Student Support and Wellbeing': 66,
 'Suspension': 66,
 'Timetable': 66,
 'Unit change/choices': 66}


Now we have a bag of words (`X_train`) and some labels (`categories_train`); we can use these to train a classifier:

In [4]:
def classifiers():
    return [MultinomialNB(), SGDClassifier(loss="modified_huber", penalty="l1", max_iter=100, alpha=0.0001)]


subject_clfs = classifiers()
for c in subject_clfs:
    c.fit(X_train, categories_train)

In [5]:
def print_info(clf, values, labels):
    print(f"{type(clf).__name__}:")
    print(metrics.classification_report(labels, clf.predict(values)))
    cv_test = cross_validate(
        clf,
        X_test,
        categories_test,
        scoring="balanced_accuracy",
    )
    print(
        f"Balanced accuracy {cv_test['test_score'].mean():.4f}+-{cv_test['test_score'].std():.4f}\n{'-' * 79}"
    )


# Print a classification report for the test data
for c in subject_clfs:
    print_info(c, X_test, categories_test)

MultinomialNB:
                               precision    recall  f1-score   support

           Absence/Attendance       0.53      0.85      0.65        55
                       Appeal       0.40      1.00      0.57        29
                   Blackboard       0.23      1.00      0.37        19
              Course Transfer       0.82      0.94      0.88       154
        Exams and Assessments       0.92      0.62      0.74       463
                    Extension       0.73      0.96      0.83        85
    Extenuating Circumstances       0.32      1.00      0.48        23
                Miscellaneous       0.80      0.35      0.49       337
             Online learning        0.64      0.97      0.77        64
           Personal Tutoring        0.79      1.00      0.88        30
                 Registration       0.73      0.89      0.81        65
Student Support and Wellbeing       0.71      0.89      0.79        45
                   Suspension       0.84      0.96      0.89 

We can also attempt to classify the email bodies similarly; we will put the email body and subject line into a single bag of words.

We will need to use a smaller minimum since this dataset only contains 200 emails.

In [6]:
min_num_labels = 13
(X_train, categories_train), (X_test, categories_test) = read.read_email_body(
    min_num_labels, verbose=True, sampling=sampling
)

EmailBody:	Dropping 1 NaN values of 200
Subject:	Dropping 0 NaN values of 200
Following categories will be removed; fewer than the minimum (13) found:
{'Academic': 10,
 'Attendance': 11,
 'Awards/Graduation Certificate': 8,
 'Change of Circumstances': 5,
 'Course Transfer': 8,
 'Extenuating Circumstances': 11,
 'Funding': 1,
 'General': 4,
 'Late Coursework/Penalties': 3,
 'Marks & Feedback': 9,
 'Misc Enquiry': 7,
 'Misc Student Data/Process Errors': 3,
 'Online Learning': 12,
 'Plagiarism & Collusion': 2,
 'Progression': 2,
 'Student Status Letter': 2,
 'Timetables': 5,
 'Transcripts': 3,
 'Tuition / Accommodation Fees': 2,
 'University Email Account': 1,
 'Visa': 2}
Label counts in training set after resampling (undersample):
{'Dissertations/Coursework': 12, 'Exams/Resits': 12, 'Unit Choices': 12}


In [7]:
body_clfs = [MultinomialNB(), SGDClassifier()]
for c in body_clfs:
    c.fit(X_train, categories_train)

In [8]:
for c in body_clfs:
    print_info(c, X_test, categories_test)

MultinomialNB:
                          precision    recall  f1-score   support

Dissertations/Coursework       1.00      1.00      1.00        10
            Exams/Resits       1.00      1.00      1.00         7
            Unit Choices       1.00      1.00      1.00         6

                accuracy                           1.00        23
               macro avg       1.00      1.00      1.00        23
            weighted avg       1.00      1.00      1.00        23

Balanced accuracy 0.8667+-0.1247
-------------------------------------------------------------------------------
SGDClassifier:
                          precision    recall  f1-score   support

Dissertations/Coursework       0.91      1.00      0.95        10
            Exams/Resits       1.00      0.86      0.92         7
            Unit Choices       1.00      1.00      1.00         6

                accuracy                           0.96        23
               macro avg       0.97      0.95      0.96     

We can also run the classifier on arbitrary strings:

In [None]:
s = input()
while s:
    X: csr_matrix = vectorizer.transform([s])
    for c in subject_clfs:
        (predicted_class,) = c.predict(X)
        probs = c.predict_proba(X)
        print(
            f"{type(c).__name__}:\n\t{predicted_class}\n\t{' '.join(('{:.4f}'.format(x) for x in probs[0]))}"
        )
        print("-" * 79)
    s = input()

what time is it
MultinomialNB:
	Exams and Assessments
	0.0616 0.0606 0.0881 0.0799 0.0984 0.0609 0.0608 0.0616 0.0612 0.0616 0.0609 0.0612 0.0610 0.0613 0.0609
-------------------------------------------------------------------------------
SGDClassifier:
	Exams and Assessments
	0.0000 0.0000 0.0000 0.2367 0.5338 0.0000 0.0000 0.1792 0.0000 0.0000 0.0000 0.0502 0.0000 0.0000 0.0000
-------------------------------------------------------------------------------
dropping out of my studies
MultinomialNB:
	Suspension
	0.0540 0.0528 0.0525 0.0665 0.0529 0.0532 0.0530 0.0540 0.0890 0.0539 0.0656 0.0535 0.1924 0.0536 0.0531
-------------------------------------------------------------------------------
SGDClassifier:
	Suspension
	0.0000 0.0000 0.0000 0.4981 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.5019 0.0000 0.0000
-------------------------------------------------------------------------------
blackboard login problem
MultinomialNB:
	Blackboard
	0.0423 0.0414 0.3299 0.0418 0.