## Exploring text classification models

*Trying out different text classification algorithms on one dataset. In the process, I use `scikit-learn` naive Bayes, logistic regression and multilayer perceptron with different kinds of text representation and end it with `spacy` model.*

Importing the dataset:

In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_dataset = fetch_20newsgroups(subset='all', remove=('headers', 'footers'))

newsgroups_dataset.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

Converting into pandas dataframe for analysis:

In [2]:
import pandas as pd

df = pd.DataFrame(newsgroups_dataset.data, columns=('text',))

df['y'] = newsgroups_dataset.target
df['y_text'] = [newsgroups_dataset.target_names[y] for y in df['y']]

df.head()

Unnamed: 0,text,y,y_text
0,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,3,comp.sys.ibm.pc.hardware
2,"|>The student of ""regional killings"" alias Dav...",17,talk.politics.mideast
3,In article <1993Apr19.034517.12820@julian.uwo....,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,4,comp.sys.mac.hardware


Categories distribution:

In [62]:
df.value_counts(['y_text'])

y_text                  
rec.sport.hockey            999
soc.religion.christian      997
rec.motorcycles             996
rec.sport.baseball          994
sci.crypt                   991
rec.autos                   990
sci.med                     990
comp.windows.x              988
sci.space                   987
comp.os.ms-windows.misc     985
sci.electronics             984
comp.sys.ibm.pc.hardware    982
misc.forsale                975
comp.graphics               973
comp.sys.mac.hardware       963
talk.politics.mideast       940
talk.politics.guns          910
alt.atheism                 799
talk.politics.misc          775
talk.religion.misc          628
dtype: int64

Looks pretty balanced, good. Let's preprocess the texts.

In [4]:
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

def lemmatize(text):
  lemmatizer = WordNetLemmatizer()
  return ' '.join([lemmatizer.lemmatize(w) for w in word_tokenize(text)])

def stem(text):
  stemmer = PorterStemmer()
  return ' '.join([stemmer.stem(w) for w in word_tokenize(text)])

df['lemmatized'] = df['text'].apply(lemmatize)
df['stemmed'] = df['text'].apply(stem)

df.head()

Unnamed: 0,text,y,y_text,lemmatized,stemmed
0,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey,I am sure some bashers of Pens fan are pretty ...,i am sure some basher of pen fan are pretti co...
1,My brother is in the market for a high-perform...,3,comp.sys.ibm.pc.hardware,My brother is in the market for a high-perform...,my brother is in the market for a high-perform...
2,"|>The student of ""regional killings"" alias Dav...",17,talk.politics.mideast,| > The student of `` regional killing '' alia...,| > the student of `` region kill '' alia davi...
3,In article <1993Apr19.034517.12820@julian.uwo....,3,comp.sys.ibm.pc.hardware,In article < 1993Apr19.034517.12820 @ julian.u...,in articl < 1993apr19.034517.12820 @ julian.uw...
4,1) I have an old Jasmine drive which I cann...,4,comp.sys.mac.hardware,1 ) I have an old Jasmine drive which I can no...,1 ) i have an old jasmin drive which i can not...


Now let's create some standard classifiers and see how they perform. I'm going to use naive Bayes, logistic regression and feedforward neural network as the models, and simple count BoW, TF-IDF, word2vec average and doc2vec as input representations. As neural net learning takes much more time and have more hyperparameters, let's omit it for now (and forget about dense vectors for a while).

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
import re
import functools

get_class_str = lambda class_: re.search(r"'(.*)'", str(class_)).groups()[0].split('.')[-1]

def try_models(
    df,
    input_columns,
    vectorizers,
    models,
    model_args = {},
    select_features_fit=lambda X, y: (X, []),
    k_folds = None
  ):
  y = df['y']
  results = []

  for input_column in input_columns:
    X = df[input_column]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    for Vectorizer in vectorizers:
      vectorizer = Vectorizer(stop_words='english', max_features=100_000)
      X_train_vectors = vectorizer.fit_transform(X_train)
      X_test_vectors = vectorizer.transform(X_test)
      X_train_vectors, f_transformers = select_features_fit(X_train_vectors, y_train)
      X_test_vectors = (
        functools.reduce(lambda vectors, f_transformer: f_transformer.transform(vectors), f_transformers, X_test_vectors)
          if len(f_transformers)
          else X_test_vectors
      )

      for Model in models:
        model = Model(**model_args.get(Model, dict()))

        if not k_folds:
          model.fit(X_train_vectors, y_train)
          y_pred = model.predict(X_test_vectors)
          scores = [
            accuracy_score(y_test, y_pred),
            f1_score(y_test, y_pred, average='macro'),
            f1_score(y_test, y_pred, average='micro'),
            roc_auc_score(y_test, model.predict_proba(X_test_vectors), multi_class="ovr")
          ]
        else:
          cv_scores = cross_validate(
            model, X_train_vectors, y_train, cv=k_folds, scoring=('accuracy', 'f1_macro', 'f1_micro', 'roc_auc_ovr')
          )
          scores = [
            f"{cv_scores['test_accuracy'].mean().format_float_positional(precision=3, min_digits=3)}±{cv_scores['test_accuracy'].std().format_float_positional(precision=3, min_digits=3)}",
            f"{cv_scores['test_f1_macro'].mean().format_float_positional(precision=3, min_digits=3)}±{cv_scores['test_f1_macro'].std().format_float_positional(precision=3, min_digits=3)}",
            f"{cv_scores['test_f1_micro'].mean().format_float_positional(precision=3, min_digits=3)}±{cv_scores['test_f1_micro'].std().format_float_positional(precision=3, min_digits=3)}",
            f"{cv_scores['test_roc_auc_ovr'].mean().format_float_positional(precision=3, min_digits=3)}±{cv_scores['test_roc_auc_ovr'].std().format_float_positional(precision=3, min_digits=3)}"
          ]

        results.append([input_column, get_class_str(Vectorizer), get_class_str(Model)] + scores)

  return pd.DataFrame(
    results,
    columns=['input', 'vectorizer', 'model', 'accuracy', 'f1_macro', 'f1_micro', 'auc ovr']
  ).sort_values(by=['accuracy'], ascending=False)

df_half = df.sample(frac=0.5)

try_models(
  df=df_half,
  input_columns=['text', 'lemmatized', 'stemmed'],
  vectorizers=[CountVectorizer, TfidfVectorizer],
  models=[MultinomialNB, LogisticRegression],
  model_args={
    LogisticRegression: dict(max_iter=500)
  }
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
3,text,TfidfVectorizer,LogisticRegression,0.839788,0.831752,0.839788,0.984645
7,lemmatized,TfidfVectorizer,LogisticRegression,0.837666,0.82838,0.837666,0.984089
11,stemmed,TfidfVectorizer,LogisticRegression,0.833952,0.82557,0.833952,0.984987
2,text,TfidfVectorizer,MultinomialNB,0.820159,0.798335,0.820159,0.984802
6,lemmatized,TfidfVectorizer,MultinomialNB,0.819098,0.793229,0.819098,0.984583
1,text,CountVectorizer,LogisticRegression,0.809549,0.807633,0.809549,0.975277
5,lemmatized,CountVectorizer,LogisticRegression,0.802653,0.797108,0.802653,0.973681
10,stemmed,TfidfVectorizer,MultinomialNB,0.797347,0.768995,0.797347,0.985709
0,text,CountVectorizer,MultinomialNB,0.793103,0.779431,0.793103,0.968297
9,stemmed,CountVectorizer,LogisticRegression,0.793103,0.787804,0.793103,0.974815


Expectedly, TF-IDF outperforms CountVectorizer, but what if we first select some more useful features? As far as I understand, the IDF part of TF-IDF has a purpose that's pretty similar to what chi-squared test does (by reducing the value of words which don't matter much).

In [6]:
from sklearn.feature_selection import SelectPercentile, chi2

try_models(
  df=df_half,
  input_columns=['text', 'lemmatized', 'stemmed'],
  vectorizers=[CountVectorizer, TfidfVectorizer],
  models=[MultinomialNB, LogisticRegression],
  model_args={
    LogisticRegression: dict(max_iter=500)
  },
  select_features_fit=lambda X, y: (p := SelectPercentile(chi2, percentile=50)) and (p.fit_transform(X, y), [p])
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
7,lemmatized,TfidfVectorizer,LogisticRegression,0.844032,0.834445,0.844032,0.985484
3,text,TfidfVectorizer,LogisticRegression,0.84191,0.835853,0.84191,0.987168
11,stemmed,TfidfVectorizer,LogisticRegression,0.823873,0.810696,0.823873,0.982055
6,lemmatized,TfidfVectorizer,MultinomialNB,0.822281,0.799417,0.822281,0.98604
2,text,TfidfVectorizer,MultinomialNB,0.82069,0.800688,0.82069,0.987224
4,lemmatized,CountVectorizer,MultinomialNB,0.806897,0.786673,0.806897,0.972428
5,lemmatized,CountVectorizer,LogisticRegression,0.806897,0.799401,0.806897,0.973705
1,text,CountVectorizer,LogisticRegression,0.803183,0.799707,0.803183,0.974059
0,text,CountVectorizer,MultinomialNB,0.797347,0.781862,0.797347,0.970854
10,stemmed,TfidfVectorizer,MultinomialNB,0.795756,0.769527,0.795756,0.983209


Ok, looks like it doesn't help at all. Before getting rid of count vectorizers completely, let's first try to scale the counts.

In [7]:
from sklearn.preprocessing import MaxAbsScaler

try_models(
  df=df_half,
  input_columns=['text', 'lemmatized', 'stemmed'],
  vectorizers=[CountVectorizer, TfidfVectorizer],
  models=[MultinomialNB, LogisticRegression],
  model_args={
    LogisticRegression: dict(max_iter=500)
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=50)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s]))
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
11,stemmed,TfidfVectorizer,LogisticRegression,0.867905,0.861966,0.867905,0.989479
3,text,TfidfVectorizer,LogisticRegression,0.853581,0.848973,0.853581,0.988855
7,lemmatized,TfidfVectorizer,LogisticRegression,0.851989,0.847269,0.851989,0.98805
10,stemmed,TfidfVectorizer,MultinomialNB,0.846154,0.836445,0.846154,0.98685
2,text,TfidfVectorizer,MultinomialNB,0.838196,0.828103,0.838196,0.987023
6,lemmatized,TfidfVectorizer,MultinomialNB,0.8313,0.821446,0.8313,0.98678
9,stemmed,CountVectorizer,LogisticRegression,0.785146,0.78738,0.785146,0.978739
0,text,CountVectorizer,MultinomialNB,0.777719,0.769298,0.777719,0.968724
4,lemmatized,CountVectorizer,MultinomialNB,0.774536,0.764712,0.774536,0.968712
1,text,CountVectorizer,LogisticRegression,0.774005,0.778656,0.774005,0.976832


Nope, no point in using them. But scaling seems to help, especially for logistic regression accuracy and learning time:

In [8]:
try_models(
  df=df_half,
  input_columns=['text', 'lemmatized', 'stemmed'],
  vectorizers=[TfidfVectorizer],
  models=[MultinomialNB, LogisticRegression],
  model_args={
    LogisticRegression: dict(max_iter=500)
  },
  select_features_fit=lambda X, y: (
    (s := MaxAbsScaler()) and
    (s.fit_transform(X), [s]))
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
3,lemmatized,TfidfVectorizer,LogisticRegression,0.861008,0.858198,0.861008,0.987958
1,text,TfidfVectorizer,LogisticRegression,0.85305,0.846447,0.85305,0.987075
5,stemmed,TfidfVectorizer,LogisticRegression,0.848276,0.84561,0.848276,0.985988
0,text,TfidfVectorizer,MultinomialNB,0.788329,0.773714,0.788329,0.966094
2,lemmatized,TfidfVectorizer,MultinomialNB,0.782493,0.771113,0.782493,0.963819
4,stemmed,TfidfVectorizer,MultinomialNB,0.762865,0.754086,0.762865,0.95925


Notice how NB got much worse without feature selection. LR also took a hit. Let's try a lower smoothing parameter for naive Bayes:

In [9]:
try_models(
  df=df_half,
  input_columns=['text', 'lemmatized', 'stemmed'],
  vectorizers=[TfidfVectorizer],
  models=[MultinomialNB],
  model_args={
    MultinomialNB: dict(alpha=0.5)
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=50)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s]))
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
0,text,TfidfVectorizer,MultinomialNB,0.851989,0.840003,0.851989,0.987428
1,lemmatized,TfidfVectorizer,MultinomialNB,0.846154,0.840538,0.846154,0.986581
2,stemmed,TfidfVectorizer,MultinomialNB,0.834483,0.826015,0.834483,0.987506


With smart feature selection and lower smoothing, NB is almost on par with LR. Pretty good performance, considering how much faster and simpler NB is than LR. We have to keep in mind that default LR here also uses L2 regularization. Let's take a look at LR with different kinds of regularization:

In [10]:
class LRNone(LogisticRegression):
  def __init__(self, **kwargs):
    super().__init__(penalty=None, solver='saga', **kwargs)

class LRL1(LogisticRegression):
  def __init__(self, **kwargs):
    super().__init__(penalty='l1', solver='saga', **kwargs)

class LRL2(LogisticRegression):
  def __init__(self, **kwargs):
    super().__init__(penalty='l2', solver='saga', **kwargs)

try_models(
  df=df_half,
  input_columns=['text', 'lemmatized', 'stemmed'],
  vectorizers=[TfidfVectorizer],
  models=[LRNone, LRL1, LRL2],
  model_args={
    LRNone: dict(max_iter=500),
    LRL1: dict(max_iter=500),
    LRL2: dict(max_iter=500),
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=50)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s]))
)



Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
5,lemmatized,TfidfVectorizer,LRL2,0.857294,0.853415,0.857294,0.989124
3,lemmatized,TfidfVectorizer,LRNone,0.856233,0.853239,0.856233,0.989118
8,stemmed,TfidfVectorizer,LRL2,0.855172,0.849353,0.855172,0.989918
2,text,TfidfVectorizer,LRL2,0.854642,0.849647,0.854642,0.987401
6,stemmed,TfidfVectorizer,LRNone,0.85305,0.847506,0.85305,0.989878
0,text,TfidfVectorizer,LRNone,0.848276,0.843097,0.848276,0.986751
7,stemmed,TfidfVectorizer,LRL1,0.779841,0.770466,0.779841,0.978777
4,lemmatized,TfidfVectorizer,LRL1,0.773475,0.766414,0.773475,0.974276
1,text,TfidfVectorizer,LRL1,0.771883,0.766111,0.771883,0.970404


After 6 mins and 500 iterations 6 out of 9 models still didn't converge. But even like this we can see how L2 regularization helps. Let's perform 5-fold cross validation to see if lemmatization and stemming has any noticable effect:

In [12]:
try_models(
  df=df_half,
  input_columns=['text', 'lemmatized', 'stemmed'],
  vectorizers=[TfidfVectorizer],
  models=[MultinomialNB, LogisticRegression],
  model_args={
    LogisticRegression: dict(max_iter=500)
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=50)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s])),
  k_folds=5
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
3,lemmatized,TfidfVectorizer,LogisticRegression,0.835±0.011,0.829±0.012,0.835±0.011,0.986±0.002
0,text,TfidfVectorizer,MultinomialNB,0.835±0.001,0.824±0.002,0.835±0.001,0.985±0.002
2,lemmatized,TfidfVectorizer,MultinomialNB,0.833±0.011,0.822±0.013,0.833±0.011,0.985±0.002
5,stemmed,TfidfVectorizer,LogisticRegression,0.832±0.01,0.827±0.01,0.832±0.01,0.986±0.002
1,text,TfidfVectorizer,LogisticRegression,0.832±0.008,0.827±0.009,0.832±0.008,0.985±0.001
4,stemmed,TfidfVectorizer,MultinomialNB,0.825±0.007,0.814±0.008,0.825±0.007,0.985±0.001


Looks like there's practically no difference. Finally, let's take a look at the performance on the whole dataset:

In [13]:
try_models(
  df=df,
  input_columns=['text', 'lemmatized', 'stemmed'],
  vectorizers=[TfidfVectorizer],
  models=[MultinomialNB, LogisticRegression],
  model_args={
    LogisticRegression: dict(max_iter=500)
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=50)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s])),
  k_folds=5
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
5,stemmed,TfidfVectorizer,LogisticRegression,0.87±0.001,0.868±0.001,0.87±0.001,0.99±0.0
3,lemmatized,TfidfVectorizer,LogisticRegression,0.877±0.004,0.875±0.004,0.877±0.004,0.991±0.0
1,text,TfidfVectorizer,LogisticRegression,0.873±0.004,0.871±0.004,0.873±0.004,0.991±0.001
2,lemmatized,TfidfVectorizer,MultinomialNB,0.871±0.007,0.863±0.006,0.871±0.007,0.991±0.001
0,text,TfidfVectorizer,MultinomialNB,0.866±0.003,0.858±0.004,0.866±0.003,0.991±0.001
4,stemmed,TfidfVectorizer,MultinomialNB,0.859±0.004,0.85±0.003,0.859±0.004,0.99±0.001


Now let's try using a simple neural net from scikit-learn:

In [14]:
from sklearn.neural_network import MLPClassifier

try_models(
  df=df_half,
  input_columns=['text'],
  vectorizers=[TfidfVectorizer],
  models=[MLPClassifier],
  model_args={
    MLPClassifier: dict(hidden_layer_sizes=(10,))
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=50)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s]))
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
0,text,TfidfVectorizer,MLPClassifier,0.81008,0.802017,0.81008,0.978159


Yeah, that's not good. Let's try to have a bit more units:

In [15]:
from sklearn.neural_network import MLPClassifier

try_models(
  df=df_half,
  input_columns=['text'],
  vectorizers=[TfidfVectorizer],
  models=[MLPClassifier],
  model_args={
    MLPClassifier: dict(hidden_layer_sizes=(50,))
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=25)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s]))
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
0,text,TfidfVectorizer,MLPClassifier,0.850928,0.847085,0.850928,0.988836


Ok, that's almost the same as NB. Let's continue increasing neurons:

In [19]:
from sklearn.neural_network import MLPClassifier

try_models(
  df=df_half,
  input_columns=['text'],
  vectorizers=[TfidfVectorizer],
  models=[MLPClassifier],
  model_args={
    MLPClassifier: dict(hidden_layer_sizes=(100,))
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=25)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s])),
  k_folds=5
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
0,text,TfidfVectorizer,MLPClassifier,0.867±0.011,0.864±0.01,0.867±0.011,0.99±0.002


Still a bit worse than NB. Not sure, if using huge sparse vectors for the NN is a good idea, so I'll use word2vec to make dense vectors instead a bit later, but for now let's just try to increase the accuracy to beat NB and LR without using too many neurons:

In [17]:
from sklearn.neural_network import MLPClassifier

try_models(
  df=df_half,
  input_columns=['text'],
  vectorizers=[TfidfVectorizer],
  models=[MLPClassifier],
  model_args={
    MLPClassifier: dict(hidden_layer_sizes=(250,))
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=25)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s])),
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
0,text,TfidfVectorizer,MLPClassifier,0.861538,0.856277,0.861538,0.988086


Again, same as NB. The problem in comparing these results is also that I should use k-fold cross-validation but it already takes more than 5 mins so...

In [18]:
from sklearn.neural_network import MLPClassifier

try_models(
  df=df_half,
  input_columns=['text'],
  vectorizers=[TfidfVectorizer],
  models=[MLPClassifier],
  model_args={
    MLPClassifier: dict(hidden_layer_sizes=(250, 125))
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=25)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s])),
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
0,text,TfidfVectorizer,MLPClassifier,0.859947,0.856978,0.859947,0.989669


With 1 more layer the results seem to be worse.

In [20]:
from sklearn.neural_network import MLPClassifier

try_models(
  df=df_half,
  input_columns=['text'],
  vectorizers=[TfidfVectorizer],
  models=[MLPClassifier],
  model_args={
    MLPClassifier: dict(hidden_layer_sizes=(500,))
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=25)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s])),
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
0,text,TfidfVectorizer,MLPClassifier,0.851989,0.850234,0.851989,0.987666


15 mins to get 0.85 accuracy...

In [21]:
from sklearn.neural_network import MLPClassifier

try_models(
  df=df_half,
  input_columns=['text'],
  vectorizers=[TfidfVectorizer],
  models=[MLPClassifier],
  model_args={
    MLPClassifier: dict(hidden_layer_sizes=(500,))
  },
  select_features_fit=lambda X, y: (
    (p := SelectPercentile(chi2, percentile=10)) and
    (s := MaxAbsScaler()) and
    (s.fit_transform(p.fit_transform(X, y)), [p, s])),
)

Unnamed: 0,input,vectorizer,model,accuracy,f1_macro,f1_micro,auc ovr
0,text,TfidfVectorizer,MLPClassifier,0.846154,0.837919,0.846154,0.986064


So yeah, it's much harder to train NNs than NB or LR. Finally, before moving on to using word2vec, let's make a truly correct comparison using grid search and cross validation and leave this for a day to calculate:

In [137]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
  TfidfVectorizer(stop_words='english', max_features=100_000),
  SelectPercentile(chi2, percentile=10),
  MaxAbsScaler(),
  MLPClassifier()
)

param_grid = {
  'selectpercentile__percentile': [5, 10, 20, 40],
  'mlpclassifier__hidden_layer_sizes': [(50,), (250,), (500,), (1000,), (50, 25), (250, 125), (500, 250)]
}

mlp_grid_search = GridSearchCV(pipeline, param_grid, scoring=('accuracy', 'roc_auc_ovr'), refit='accuracy', cv=5, verbose=3)

mlp_grid_search.fit(df['text'], df['y'])

columns_rename = {
  'param_mlpclassifier__hidden_layer_sizes': 'hidden_layer_sizes',
  'param_selectpercentile__percentile': 'percentile'
}
columns_to_select = [
  'hidden_layer_sizes',
  'percentile',
  'mean_test_accuracy',
  'std_test_accuracy',
  'mean_test_roc_auc_ovr',
  'std_test_roc_auc_ovr',
  'mean_fit_time',
  'std_fit_time'
]
scores_df = pd.DataFrame(mlp_grid_search.cv_results_).rename(columns=columns_rename)
scores_df[columns_to_select].sort_values(by=['mean_test_accuracy'], ascending=False)


Fitting 5 folds for each of 28 candidates, totalling 140 fits
[CV 1/5] END mlpclassifier__hidden_layer_sizes=(50,), selectpercentile__percentile=5; accuracy: (test=0.821) roc_auc_ovr: (test=0.982) total time= 1.7min
[CV 2/5] END mlpclassifier__hidden_layer_sizes=(50,), selectpercentile__percentile=5; accuracy: (test=0.823) roc_auc_ovr: (test=0.984) total time= 1.9min
[CV 3/5] END mlpclassifier__hidden_layer_sizes=(50,), selectpercentile__percentile=5; accuracy: (test=0.820) roc_auc_ovr: (test=0.983) total time= 1.9min
[CV 4/5] END mlpclassifier__hidden_layer_sizes=(50,), selectpercentile__percentile=5; accuracy: (test=0.820) roc_auc_ovr: (test=0.983) total time= 1.6min
[CV 5/5] END mlpclassifier__hidden_layer_sizes=(50,), selectpercentile__percentile=5; accuracy: (test=0.812) roc_auc_ovr: (test=0.982) total time= 1.8min
[CV 1/5] END mlpclassifier__hidden_layer_sizes=(50,), selectpercentile__percentile=10; accuracy: (test=0.855) roc_auc_ovr: (test=0.987) total time= 2.1min
[CV 2/5] END 

Unnamed: 0,hidden_layer_sizes,percentile,mean_test_accuracy,std_test_accuracy,mean_test_roc_auc_ovr,std_test_roc_auc_ovr,mean_fit_time,std_fit_time
7,"(250,)",40,0.886766,0.003797,0.992452,0.000662,1193.496633,207.691884
3,"(50,)",40,0.884485,0.002196,0.992263,0.000761,288.075506,18.48939
27,"(500, 250)",40,0.882468,0.003801,0.991855,0.000865,1969.442595,343.112819
11,"(500,)",40,0.881831,0.007362,0.991377,0.001358,2734.831152,520.489281
15,"(1000,)",40,0.877322,0.005955,0.990539,0.001262,4993.085214,236.257278
6,"(250,)",20,0.877003,0.004383,0.991127,0.000996,705.027613,76.707757
2,"(50,)",20,0.875995,0.00243,0.990932,0.00083,188.560122,25.644004
23,"(250, 125)",40,0.874721,0.004865,0.990987,0.001578,1361.096519,206.8475
10,"(500,)",20,0.874721,0.003721,0.990609,0.001289,1410.458531,270.969656
14,"(1000,)",20,0.871272,0.003765,0.989806,0.001314,2519.259795,237.499484


Ok, that took longer than expected. In the end, we get 0.01-0.02 accuracy increase over NB and LR. It's funny how 1000 neurons didn't give the best results, I assumed it would and that it would overfit, but in reality it just didn't deliver. We can also see how the length of a vector used directly affects the performance of the network. That being said, I'm not sure anyone actually uses sparse vectors for NN models, so now is the time for dense vectors. Let's start with a pretrained word2vec, I will use a 50-unit layer as it seems to provide good results while still being relatively quick to train:

In [4]:
import gensim.downloader

word2vec =  gensim.downloader.load('word2vec-google-news-300')

Loaded word2vec in a separate widget so that it would stay in RAM (otherwise, it takes damn 40 secs to load every time). Now let's prepare the vectors:

In [5]:
from nltk.corpus import stopwords

eng_stopwords = stopwords.words('english')

def text2word2vec(text):
  tokenized_text =  [w for w in word_tokenize(text) if w not in eng_stopwords]
  return word2vec.get_mean_vector(tokenized_text) if len(tokenized_text) else None

df['mean_word2vec'] = df['text'].apply(text2word2vec)

df['mean_word2vec'].head()

0    [0.018354833, 0.01911231, -0.003590454, 0.0425...
1    [-0.01658663, -0.0056632212, 0.0072250823, 0.0...
2    [-0.00084048323, 0.004721938, 0.018587181, 0.0...
3    [-0.031490926, -0.006962086, 0.0083416365, 0.0...
4    [0.02332852, -0.0047870562, 0.022667598, 0.035...
Name: mean_word2vec, dtype: object

In [6]:
df['mean_word2vec'].isna().sum()

34

In [7]:
df = df.dropna()
df['mean_word2vec'].isna().sum()

0

Got rid of some None vectors.

In [8]:
import numpy as np

# god knows why this bs works but I already spent way too much time to find the answer
# without this df['mean_word2vec'] has the shape (18812,) and sklearn refuses to take it
mean_word2vec_matrix = np.array(np.concatenate(df[['mean_word2vec']].values).tolist())
mean_word2vec_matrix.shape

(18812, 300)

In [10]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=500)

cv_scores = cross_validate(mlp, mean_word2vec_matrix, df['y'], cv=5, scoring=('accuracy', 'roc_auc_ovr'), verbose=3)

print(f"{cv_scores['test_accuracy'].mean()}±{cv_scores['test_accuracy'].std()}")
print(f"{cv_scores['test_roc_auc_ovr'].mean()}±{cv_scores['test_roc_auc_ovr'].std()}")

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ... accuracy: (test=0.716) roc_auc_ovr: (test=0.963) total time=  29.2s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   29.2s remaining:    0.0s


[CV] END ... accuracy: (test=0.722) roc_auc_ovr: (test=0.962) total time=  28.0s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   57.2s remaining:    0.0s


[CV] END ... accuracy: (test=0.720) roc_auc_ovr: (test=0.963) total time=  31.1s
[CV] END ... accuracy: (test=0.715) roc_auc_ovr: (test=0.963) total time=  29.2s
[CV] END ... accuracy: (test=0.713) roc_auc_ovr: (test=0.963) total time=  54.6s
0.7172547749760779±0.0030662066612993206
0.9628886928505217±0.0003607229761157865


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  2.9min finished


Yeah...

In [11]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()

mean_word2vec_matrix_normalized = standardScaler.fit_transform(mean_word2vec_matrix)

mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=500)

cv_scores = cross_validate(mlp, mean_word2vec_matrix_normalized, df['y'], cv=5, scoring=('accuracy', 'roc_auc_ovr'), verbose=3)

print(f"{cv_scores['test_accuracy'].mean()}±{cv_scores['test_accuracy'].std()}")
print(f"{cv_scores['test_roc_auc_ovr'].mean()}±{cv_scores['test_roc_auc_ovr'].std()}")

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ... accuracy: (test=0.720) roc_auc_ovr: (test=0.965) total time=  32.3s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   32.3s remaining:    0.0s


[CV] END ... accuracy: (test=0.724) roc_auc_ovr: (test=0.964) total time=  46.5s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.3min remaining:    0.0s


[CV] END ... accuracy: (test=0.731) roc_auc_ovr: (test=0.964) total time=  45.8s
[CV] END ... accuracy: (test=0.717) roc_auc_ovr: (test=0.963) total time=  41.6s
[CV] END ... accuracy: (test=0.719) roc_auc_ovr: (test=0.963) total time=  39.8s
0.7220923022411196±0.005135455919028041
0.9637498152252227±0.0007827808913493769


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  3.4min finished


That's pretty bad. I feel like there's some serious mistake in my reasoning or code as the score seems to be way too low. Maybe I should train word2vec myself? AFAIK averaged word2vec vector is pretty similar to doc2vec in terms of performance. Let's quickly try doc2vec to compare:

In [22]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

documents = [TaggedDocument(word_tokenize(doc), [i]) for i, doc in enumerate(df['text'])]

doc2vec = Doc2Vec(documents, min_count=10, workers=8)

df['doc2vec'] = df['text'].apply(lambda doc: doc2vec.infer_vector(word_tokenize(doc)))

df['doc2vec'].head()

0    [-0.35365555, 0.17339046, 0.18393995, -0.58620...
1    [-0.2553672, 0.12558939, 0.15981369, -0.221346...
2    [0.29246834, -0.16016297, -0.13159648, 1.15566...
3    [-0.29304528, -0.7467441, 0.30352396, -0.75571...
4    [-0.21903262, 0.11049359, 0.004415522, -0.1185...
Name: doc2vec, dtype: object

In [25]:
import numpy as np

doc2vec_matrix = np.array(np.concatenate(df[['doc2vec']].values).tolist())
doc2vec_matrix.shape

(18846, 100)

In [28]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=500)

cv_scores = cross_validate(mlp, doc2vec_matrix, df['y'], cv=5, scoring=('accuracy', 'roc_auc_ovr'))

print(f"{cv_scores['test_accuracy'].mean()}±{cv_scores['test_accuracy'].std()}")
print(f"{cv_scores['test_roc_auc_ovr'].mean()}±{cv_scores['test_roc_auc_ovr'].std()}")



0.6019313216220838±0.005598206367794166
0.9473297495737375±0.001923384963970542




In [29]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()

doc2vec_matrix_normalized = standardScaler.fit_transform(doc2vec_matrix)

mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=500)

cv_scores = cross_validate(mlp, doc2vec_matrix_normalized, df['y'], cv=5, scoring=('accuracy', 'roc_auc_ovr'))

print(f"{cv_scores['test_accuracy'].mean()}±{cv_scores['test_accuracy'].std()}")
print(f"{cv_scores['test_roc_auc_ovr'].mean()}±{cv_scores['test_roc_auc_ovr'].std()}")



0.5846860011837459±0.008096889133146996
0.941561599511474±0.0013361609949245645




Ok, fuck it, let's just forget everything now and take spacy with the default text classification config and see the results:

In [3]:
import spacy
from spacy.tokens import DocBin

nlp = spacy.blank('en')
db = DocBin()

categories = df['y_text'].unique()

def prepare_data(X, y, filename):
  for text, category in zip(X, y):
    doc = nlp.make_doc(text)
    doc.cats = {category: 0 for category in categories}
    doc.cats[category] = 1
    db.add(doc)

  db.to_disk(f'{filename}.spacy')

df_train = df.sample(frac=0.7)
df_dev_test = df.drop(df_train.index)
df_dev = df.sample(frac=0.5)
df_test = df.drop(df_dev.index)

prepare_data(df_train['text'], df_train['y_text'], 'train')
prepare_data(df_dev['text'], df_dev['y_text'], 'dev')
prepare_data(df_test['text'], df_test['y_text'], 'test')

In [None]:
!python -m spacy init config --pipeline textcat -G -F config.cfg
!python -m spacy train config.cfg --paths.train ./train.spacy  --paths.dev ./dev.spacy --output spacy_model
!python -m spacy evaluate spacy_model/model-best/ --output metrics.json ./test.spacy

And after a few hours we get macro f-score of 0.93... 💀

What's funny is that the default config doesn't seem to use anything complex: CPU accuracy one uses an ensemble of some linear BoW model and a CNN model, GPU accuracy one is the same but it also incorporates transformers, while the one I used looks almost like an LR with a simple counting BoW (the docs suck btw, had to look at the source). But then again, why does it take so long to train such a simple model? Which extra hyperparameters do they update?

Need to dig deeper into spacy now.