# Traditional Machine Learning Methods

There are multiple traditional methods that perform classification really well on Tokenized text.
We will utitlize:
1. various Probabilistic Classifiers such as Multinomial Naive bayes and Complement Naive Bayes.
2. Linear SVMs
3. Decision Tree Method such as Random Forest Classifier
4. Other old algorithms such as nearest centroid and gradient boost classifier

In [0]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


We will first prepare the dataset and extract the required flair and full_text columns

In [0]:
df = pd.read_csv('drive/My Drive/dataset_final.csv', engine='python')

In [6]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,title,flair,score,num_comments,author,created_utc,self_post,over_18,full_text
0,0,Even the poorest are supporting Modi in this.,Politics,64.0,78.0,hungarywolf,1586106000.0,False,False,Even the poorest are supporting Modi in this.
1,1,Someone tried to sell Statue of Unity on Olx. ...,Non-Political,50.0,143.0,Athar147,1586106000.0,False,False,Someone tried to sell Statue of Unity on Olx. ...
2,2,Captured India with my phone.,Photography,49.0,202.0,random_saiyajin,1586102000.0,False,False,Captured India with my phone.
3,3,You guys are too impure to understand Modi Ji'...,Coronavirus,39.0,81.0,AdmiralSP,1586101000.0,True,False,You guys are too impure to understand Modi Ji'...
4,4,Posting again because stupidity was on show to...,Coronavirus,34.0,21.0,msbuttergourd,1586108000.0,False,False,Posting again because stupidity was on show to...


In [0]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106037 entries, 0 to 106036
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   title         106036 non-null  object 
 1   flair         106036 non-null  object 
 2   score         106036 non-null  float64
 3   num_comments  106036 non-null  float64
 4   author        106036 non-null  object 
 5   created_utc   106036 non-null  float64
 6   self_post     106036 non-null  object 
 7   over_18       106036 non-null  object 
 8   full_text     106022 non-null  object 
dtypes: float64(3), object(6)
memory usage: 7.3+ MB


In [9]:
df['flair'].value_counts()

Non-Political         34037
Politics              30192
Policy/Economy        12880
AskIndia              12640
Science/Technology     4616
Business/Finance       4513
[R]eddiquette          3351
Sports                 1660
Photography            1128
Coronavirus            1019
Name: flair, dtype: int64

In [0]:
sr = pd.isnull(df['title'])
sr.loc[sr==True]
df.drop(83369, axis=0, inplace=True)
df = df.reset_index(drop=True)
sr = pd.isnull(df['full_text'])
drop_arr = sr.loc[sr==True].index.tolist()
df.drop(drop_arr, axis=0, inplace=True)
df = df.reset_index(drop=True)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106022 entries, 0 to 106021
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   title         106022 non-null  object 
 1   flair         106022 non-null  object 
 2   score         106022 non-null  float64
 3   num_comments  106022 non-null  float64
 4   author        106022 non-null  object 
 5   created_utc   106022 non-null  float64
 6   self_post     106022 non-null  object 
 7   over_18       106022 non-null  object 
 8   full_text     106022 non-null  object 
dtypes: float64(3), object(6)
memory usage: 7.3+ MB


In [0]:
target_flairs = df['flair'].index.tolist()
X_text = list(df['full_text'])
Y = list(df['flair'])
X_train, X_test, y_train, y_test = train_test_split(X_text, Y, test_size=0.3)

The first algorithm we will try is [Multinomial Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) which is a recommended method for a task such as text classification

1. We will first build a pipeline by which we will first tokenize the text and convert to vector and pass it through a TFIDF transform and then pass the output to the MultinomialNB object.

2. We must try out various parameters to get the best out of our algorithms. To implement this in an easy and efficient way, we will use the GridSearch feature offered by Scikit-Learn where it makes a grid of the different parameters we set and executes each of them and returns to us the best parameters.

3. We will predict using the trained model and get the final classification report

In [0]:
naive_bayes = Pipeline([('vect', CountVectorizer(stop_words='english')),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB(fit_prior=False)),
              ], verbose=True)

In [0]:
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
               'clf__alpha': (1, 1e-1, 1e-2)}

In [0]:
gs_naive_bayes = GridSearchCV(naive_bayes, parameters, verbose=3)

In [19]:
gs_naive_bayes = gs_naive_bayes.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1) .....


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.6s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.1s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.1s
[CV]  clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.596, total=   2.3s
[CV] clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1) .....


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.3s remaining:    0.0s


[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.6s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.1s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.1s
[CV]  clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.591, total=   2.3s
[CV] clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1) .....


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.6s remaining:    0.0s


[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.6s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.1s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.1s
[CV]  clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.582, total=   2.3s
[CV] clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1) .....
[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.6s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.1s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.1s
[CV]  clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.589, total=   2.3s
[CV] clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1) .....
[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.6s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.1s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.1s
[CV]  clf__alpha=1, tfidf__use_

[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:  5.1min finished


[Pipeline] .............. (step 1 of 3) Processing vect, total=   7.7s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.1s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.6s


In [20]:
y_pred = gs_naive_bayes.predict(X_test)
print(classification_report(y_test, y_pred))

                    precision    recall  f1-score   support

          AskIndia       0.54      0.65      0.59      3776
  Business/Finance       0.62      0.15      0.24      1368
       Coronavirus       0.72      0.11      0.19       324
     Non-Political       0.55      0.69      0.61     10260
       Photography       0.80      0.23      0.36       333
    Policy/Economy       0.56      0.50      0.53      3899
          Politics       0.69      0.77      0.73      8942
Science/Technology       0.66      0.18      0.28      1419
            Sports       0.86      0.41      0.56       512
     [R]eddiquette       0.42      0.02      0.03       974

          accuracy                           0.60     31807
         macro avg       0.64      0.37      0.41     31807
      weighted avg       0.60      0.60      0.57     31807



In [21]:
gs_naive_bayes.best_params_

{'clf__alpha': 0.1, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}

Next, we will try the [Complement Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html#sklearn.naive_bayes.ComplementNB) algorithm. This is particularly used to deal with imbalanced datasets such as ours

In [0]:
c_naive_bayes = Pipeline([('vect', CountVectorizer(stop_words='english')),
               ('tfidf', TfidfTransformer()),
               ('clf', ComplementNB(fit_prior=False)),
              ], verbose=True)

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
               'clf__alpha': (1, 1e-1, 1e-2)}

c_gs_naive_bayes = GridSearchCV(c_naive_bayes, parameters, n_jobs=-1, verbose=3)
c_gs_naive_bayes = c_gs_naive_bayes.fit(X_train, y_train)

y_pred = c_gs_naive_bayes.predict(X_test)
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   23.7s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  1.4min finished


[Pipeline] .............. (step 1 of 3) Processing vect, total=   7.2s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.5s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.7s
                    precision    recall  f1-score   support

          AskIndia       0.58      0.60      0.59      3811
  Business/Finance       0.59      0.16      0.26      1341
       Coronavirus       0.71      0.25      0.37       320
     Non-Political       0.56      0.69      0.62     10232
       Photography       0.78      0.22      0.35       338
    Policy/Economy       0.59      0.45      0.51      3838
          Politics       0.67      0.81      0.73      9037
Science/Technology       0.65      0.22      0.33      1363
            Sports       0.78      0.57      0.66       521
     [R]eddiquette       0.56      0.02      0.04      1006

          accuracy                           0.61     31807
         macro avg       0.65      0.40      0.45     31807
     

In [0]:
c_gs_naive_bayes.best_params_

{'clf__alpha': 1, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

Here, we will train our model on a Linear SVM which we converge using Stochastic Gradient Descent

In [0]:
sgd = Pipeline([('vect', CountVectorizer(stop_words='english')),
                ('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(loss='hinge', penalty='l2', random_state=42, tol=None)),
               ], verbose=3)

In [24]:
gs_sgd = GridSearchCV(sgd, parameters, n_jobs=1, verbose=3, cv=2)
gs_sgd = gs_sgd.fit(X_train, y_train)

Fitting 2 folds for each of 12 candidates, totalling 24 fits
[CV] clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1) .....


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.0s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=  56.5s
[CV]  clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.293, total=  58.6s
[CV] clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1) .....


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   58.6s remaining:    0.0s


[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.0s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=  54.7s
[CV]  clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.320, total=  56.9s
[CV] clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 2) .....


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.9min remaining:    0.0s


[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s
[Pipeline] ............... (step 3 of 3) Processing clf, total= 2.0min
[CV]  clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 2), score=0.286, total= 2.1min
[CV] clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 2) .....
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s
[Pipeline] ............... (step 3 of 3) Processing clf, total= 1.9min
[CV]  clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 2), score=0.320, total= 2.0min
[CV] clf__alpha=1, tfidf__use_idf=False, vect__ngram_range=(1, 1) ....
[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.0s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=  53.2s
[CV]  clf__alpha=1, tfidf__use_

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed: 33.9min finished


[Pipeline] .............. (step 1 of 3) Processing vect, total=   7.6s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.8s
[Pipeline] ............... (step 3 of 3) Processing clf, total= 4.3min


In [25]:
y_pred = gs_sgd.predict(X_test)
print(classification_report(y_test, y_pred))
print(gs_sgd.best_params_)

                    precision    recall  f1-score   support

          AskIndia       0.58      0.67      0.62      3776
  Business/Finance       0.54      0.20      0.30      1368
       Coronavirus       0.68      0.40      0.50       324
     Non-Political       0.61      0.63      0.62     10260
       Photography       0.74      0.18      0.29       333
    Policy/Economy       0.58      0.46      0.51      3899
          Politics       0.62      0.83      0.71      8942
Science/Technology       0.57      0.26      0.36      1419
            Sports       0.75      0.57      0.65       512
     [R]eddiquette       0.71      0.02      0.03       974

          accuracy                           0.61     31807
         macro avg       0.64      0.42      0.46     31807
      weighted avg       0.61      0.61      0.58     31807

{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


In [0]:
logreg = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range=(1,2))),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(solver="saga", penalty='l2',verbose=1)),
               ], verbose=5)

parameters = {'clf__C': np.logspace(0,4,10)}

In [37]:
gs_logreg = GridSearchCV(logreg, parameters, verbose=3, cv=2, n_jobs=1)
gs_logreg = gs_logreg.fit(X_train, y_train)
y_pred = gs_logreg.predict(X_test)
print(classification_report(y_test, y_pred))
print(gs_logreg.best_params_)

Fitting 2 folds for each of 10 candidates, totalling 20 fits
[CV] clf__C=1.0 ......................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.9s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 22 epochs took 5 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=   5.2s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.2s finished


[CV] .......................... clf__C=1.0, score=0.599, total=  11.3s
[CV] clf__C=1.0 ......................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.3s remaining:    0.0s


[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.9s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.4s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 23 epochs took 5 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=   5.5s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.5s finished


[CV] .......................... clf__C=1.0, score=0.602, total=  11.7s
[CV] clf__C=2.7825594022071245 .......................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   23.1s remaining:    0.0s


[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 27 epochs took 6 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=   6.2s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.2s finished


[CV] ........... clf__C=2.7825594022071245, score=0.607, total=  12.2s
[CV] clf__C=2.7825594022071245 .......................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.9s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 29 epochs took 6 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=   6.8s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.7s finished


[CV] ........... clf__C=2.7825594022071245, score=0.609, total=  12.9s
[CV] clf__C=7.742636826811269 ........................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 44 epochs took 10 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  10.1s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   10.1s finished


[CV] ............ clf__C=7.742636826811269, score=0.608, total=  16.1s
[CV] clf__C=7.742636826811269 ........................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 48 epochs took 11 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  11.1s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.1s finished


[CV] ............ clf__C=7.742636826811269, score=0.608, total=  17.1s
[CV] clf__C=21.544346900318832 .......................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 87 epochs took 20 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  19.8s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   19.8s finished


[CV] ........... clf__C=21.544346900318832, score=0.606, total=  25.9s
[CV] clf__C=21.544346900318832 .......................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.9s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 95 epochs took 22 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  22.0s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.0s finished


[CV] ........... clf__C=21.544346900318832, score=0.603, total=  28.1s
[CV] clf__C=59.94842503189409 ........................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.9s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 23 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  23.3s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.3s finished


[CV] ............ clf__C=59.94842503189409, score=0.603, total=  29.5s
[CV] clf__C=59.94842503189409 ........................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.9s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.4s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 23 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  23.5s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.4s finished


[CV] ............ clf__C=59.94842503189409, score=0.601, total=  29.6s
[CV] clf__C=166.81005372000593 .......................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   4.0s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.4s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 24 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  24.6s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   24.6s finished


[CV] ........... clf__C=166.81005372000593, score=0.601, total=  30.8s
[CV] clf__C=166.81005372000593 .......................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   4.0s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 23 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  23.1s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.1s finished


[CV] ........... clf__C=166.81005372000593, score=0.598, total=  29.3s
[CV] clf__C=464.15888336127773 .......................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.9s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 23 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  23.2s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.2s finished


[CV] ........... clf__C=464.15888336127773, score=0.600, total=  29.3s
[CV] clf__C=464.15888336127773 .......................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.9s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 23 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  23.0s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.9s finished


[CV] ........... clf__C=464.15888336127773, score=0.596, total=  29.0s
[CV] clf__C=1291.5496650148827 .......................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 23 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  23.0s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.9s finished


[CV] ........... clf__C=1291.5496650148827, score=0.599, total=  28.9s
[CV] clf__C=1291.5496650148827 .......................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   4.7s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 23 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  22.9s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.8s finished


[CV] ........... clf__C=1291.5496650148827, score=0.597, total=  29.7s
[CV] clf__C=3593.813663804626 ........................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 22 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  22.7s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.7s finished


[CV] ............ clf__C=3593.813663804626, score=0.598, total=  28.7s
[CV] clf__C=3593.813663804626 ........................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 23 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  22.8s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.8s finished


[CV] ............ clf__C=3593.813663804626, score=0.596, total=  28.8s
[CV] clf__C=10000.0 ..................................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 23 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  22.9s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.9s finished


[CV] ...................... clf__C=10000.0, score=0.599, total=  28.9s
[CV] clf__C=10000.0 ..................................................
[Pipeline] .............. (step 1 of 3) Processing vect, total=   3.9s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.3s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 23 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  23.2s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.2s finished


[CV] ...................... clf__C=10000.0, score=0.595, total=  29.3s


[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  8.1min finished


[Pipeline] .............. (step 1 of 3) Processing vect, total=   8.0s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.7s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 26 epochs took 13 seconds
[Pipeline] ............... (step 3 of 3) Processing clf, total=  13.2s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.2s finished


                    precision    recall  f1-score   support

          AskIndia       0.61      0.63      0.62      3776
  Business/Finance       0.47      0.27      0.34      1368
       Coronavirus       0.70      0.43      0.53       324
     Non-Political       0.58      0.70      0.63     10260
       Photography       0.78      0.30      0.43       333
    Policy/Economy       0.57      0.55      0.56      3899
          Politics       0.71      0.74      0.72      8942
Science/Technology       0.53      0.33      0.40      1419
            Sports       0.80      0.57      0.67       512
     [R]eddiquette       0.51      0.03      0.06       974

          accuracy                           0.62     31807
         macro avg       0.63      0.45      0.50     31807
      weighted avg       0.62      0.62      0.61     31807

{'clf__C': 2.7825594022071245}


We will now try the Random Forest Classifier method

In [0]:
rfc = Pipeline([('vect', CountVectorizer(stop_words='english')),
                ('tfidf', TfidfTransformer()),
                ('clf', RandomForestClassifier(n_estimators=100, max_depth=160, verbose=5)),
               ], verbose=2)

In [58]:
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
print(classification_report(y_test, y_pred))

[Pipeline] .............. (step 1 of 3) Processing vect, total=   2.1s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.1s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


building tree 1 of 100


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.6s remaining:    0.0s


building tree 2 of 100


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.5s remaining:    0.0s


building tree 3 of 100


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    6.6s remaining:    0.0s


building tree 4 of 100


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    8.8s remaining:    0.0s


building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 1

[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  3.6min finished


[Pipeline] ............... (step 3 of 3) Processing clf, total= 3.6min


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.3s finished


                    precision    recall  f1-score   support

          AskIndia       0.60      0.56      0.58      3776
  Business/Finance       0.78      0.04      0.08      1368
       Coronavirus       0.77      0.22      0.34       324
     Non-Political       0.48      0.81      0.60     10260
       Photography       0.95      0.06      0.11       333
    Policy/Economy       0.65      0.35      0.46      3899
          Politics       0.71      0.66      0.69      8942
Science/Technology       0.71      0.08      0.14      1419
            Sports       0.87      0.27      0.41       512
     [R]eddiquette       0.93      0.01      0.03       974

          accuracy                           0.57     31807
         macro avg       0.75      0.31      0.34     31807
      weighted avg       0.63      0.57      0.53     31807



The NearestCentroid method is the last algorithm we will be implementing from the traditional ML methods

In [0]:
from sklearn.neighbors.nearest_centroid import NearestCentroid
nrc = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', NearestCentroid()),
                     ])
nrc.fit(X_train, y_train)
y_pred = nrc.predict(X_test)
print(classification_report(y_test, y_pred))



                    precision    recall  f1-score   support

          AskIndia       0.51      0.69      0.58      3811
  Business/Finance       0.22      0.54      0.31      1341
       Coronavirus       0.57      0.48      0.52       320
     Non-Political       0.61      0.36      0.46     10232
       Photography       0.19      0.68      0.30       338
    Policy/Economy       0.52      0.44      0.47      3838
          Politics       0.75      0.54      0.63      9037
Science/Technology       0.25      0.48      0.33      1363
            Sports       0.43      0.64      0.51       521
     [R]eddiquette       0.06      0.15      0.09      1006

          accuracy                           0.48     31807
         macro avg       0.41      0.50      0.42     31807
      weighted avg       0.57      0.48      0.50     31807



We see that the performance of Multinomial Naive Bayes, Complement Naive Bayes, Linear SVM and logistic regression is very comparable with each other with logistic regression just edging out the other algorithms with an accuracy of 62%.

Although we found comparable accuracies, this was only possible when we explored various parameters and reached the optimum parameters for each algorithm as we can see from the grid search values that for some parameters, the performance was really bad.

Random Forest and NearestCentroid did not perform as well as the other methods.

There remains the question whether a balanced dataset would have performed better on the test set we have. In the extras folder of the repository is the code for which I have implemented these algorithms after augmenting the dataset by performing oversampling using SMOTE.

From here, we proceed to Deep Learning algorithms to perform classification. These methods are implemented in the notebooks 3A, 3B, 3C