<a href="https://colab.research.google.com/github/simon-clematide/colab-notebooks-for-teaching/blob/main/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Text Classification Pipeline with sklearn
(taken from https://scikit-learn.org/stable/auto_examples/semi_supervised/plot_semi_supervised_newsgroups.html)

In [12]:
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

np.set_printoptions(precision=3)

# Loading dataset containing first five categories
categories = [
        "alt.atheism",
        "comp.graphics",
        "comp.os.ms-windows.misc",
        "comp.sys.ibm.pc.hardware",
        "comp.sys.mac.hardware",
    ]

# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html
data = fetch_20newsgroups(
    subset="train",
    categories=categories,
)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names), data.target_names)

# labels are already encoded numerically
print("Example:")
print(data.data[0], data.target[0])

# Hyperparameters
sdg_params = dict(alpha=1e-5, penalty="l2", loss="log_loss")
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)

# Supervised Pipeline
pipeline = Pipeline(
    [
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        ("clf", SGDClassifier(**sdg_params)),
    ]
)


2823 documents
5 categories ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware']
Example:
From: hades@coos.dartmouth.edu (Brian V. Hughes)
Subject: Re: New Apple Ergo-Mouse
Reply-To: hades@Dartmouth.Edu
Organization: Dartmouth College, Hanover, NH
Disclaimer: Personally, I really don't care who you think I speak for.
Moderator: Rec.Arts.Comics.Info
Lines: 19

nwcs@utkvx.utk.edu (Schizophrenia means never being alone) writes:

>Does anyone know how to open up the Apple Ergo-Mouse (ADB Mouse II)?
>Mine lives near a cat (true, really...) and picks up her fur.  From what
>I can tell, it looks like Apple welded it shut.

    You must not have tried very hard. I just opend mine in about 2
seconds. Take a look on the bottom, it has a dial that turns to open
much like the older ADB mouses used to have. It's a bit harder to turn
at first but it is quite simple to open.

>Also, does anyone know about installing FPUs in a Mac LC III?  I'

In [13]:
# inspect pipeline
pipeline

In [14]:
def train_predict_eval(clf, X_train, y_train, X_test, y_test, categories):
    print("Number of training samples:", len(X_train))
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print("Number of test samples:", len(y_pred))
    print(
        "Micro-averaged F1 score on test set: %0.3f"
        % f1_score(y_test, y_pred, average="micro")
    )
    print(classification_report(y_test, y_pred, target_names=categories))
    print(confusion_matrix(y_test, y_pred, normalize='all'))
    return y_pred

In [15]:
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

print("Supervised SGDClassifier on 100% of the data:")
y_pred = train_predict_eval(pipeline, X_train, y_train, X_test, y_test, data.target_names)


Supervised SGDClassifier on 100% of the data:
Number of training samples: 2117
Number of test samples: 706
Micro-averaged F1 score on test set: 0.888
                          precision    recall  f1-score   support

             alt.atheism       0.99      0.99      0.99       116
           comp.graphics       0.89      0.86      0.88       146
 comp.os.ms-windows.misc       0.81      0.92      0.86       136
comp.sys.ibm.pc.hardware       0.88      0.81      0.84       161
   comp.sys.mac.hardware       0.90      0.89      0.89       147

                accuracy                           0.89       706
               macro avg       0.89      0.89      0.89       706
            weighted avg       0.89      0.89      0.89       706

[[0.163 0.001 0.    0.    0.   ]
 [0.    0.178 0.016 0.01  0.003]
 [0.    0.004 0.177 0.008 0.003]
 [0.    0.008 0.02  0.184 0.016]
 [0.001 0.008 0.006 0.007 0.186]]


Further ideas to implement
 - Confusion Heatmap https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html
 - Adding Hyperparameter Search