# Traditional Machine Learning Methods

There are multiple traditional methods that perform classification really well on Tokenized text.
We will utitlize:
1. various Probabilistic Classifiers such as Multinomial Naive bayes and Complement Naive Bayes.
2. Linear SVMs
3. Decision Tree Method such as Random Forest Classifier
4. Other old algorithms such as nearest centroid and gradient boost classifier

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

Using TensorFlow backend.


We will first prepare the dataset and extract the required flair and full_text columns

In [2]:
df = pd.read_csv('dataset_final.csv', engine='python')

In [3]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,title,flair,score,num_comments,author,created_utc,self_post,over_18,full_text
0,0,Even the poorest are supporting Modi in this.,Politics,64.0,78.0,hungarywolf,1586106000.0,False,False,Even the poorest are supporting Modi in this.
1,1,Someone tried to sell Statue of Unity on Olx. ...,Non-Political,50.0,143.0,Athar147,1586106000.0,False,False,Someone tried to sell Statue of Unity on Olx. ...
2,2,Captured India with my phone.,Photography,49.0,202.0,random_saiyajin,1586102000.0,False,False,Captured India with my phone.
3,3,You guys are too impure to understand Modi Ji'...,Coronavirus,39.0,81.0,AdmiralSP,1586101000.0,True,False,You guys are too impure to understand Modi Ji'...
4,4,Posting again because stupidity was on show to...,Coronavirus,34.0,21.0,msbuttergourd,1586108000.0,False,False,Posting again because stupidity was on show to...


In [4]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106037 entries, 0 to 106036
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   title         106036 non-null  object 
 1   flair         106036 non-null  object 
 2   score         106036 non-null  float64
 3   num_comments  106036 non-null  float64
 4   author        106036 non-null  object 
 5   created_utc   106036 non-null  float64
 6   self_post     106036 non-null  object 
 7   over_18       106036 non-null  object 
 8   full_text     106022 non-null  object 
dtypes: float64(3), object(6)
memory usage: 7.3+ MB


In [6]:
df['flair'].value_counts()

Non-Political         34037
Politics              30192
Policy/Economy        12880
AskIndia              12640
Science/Technology     4616
Business/Finance       4513
[R]eddiquette          3351
Sports                 1660
Photography            1128
Coronavirus            1019
Name: flair, dtype: int64

In [7]:
sr = pd.isnull(df['title'])
sr.loc[sr==True]
df.drop(83369, axis=0, inplace=True)
df = df.reset_index(drop=True)
sr = pd.isnull(df['full_text'])
drop_arr = sr.loc[sr==True].index.tolist()
df.drop(drop_arr, axis=0, inplace=True)
df = df.reset_index(drop=True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106022 entries, 0 to 106021
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   title         106022 non-null  object 
 1   flair         106022 non-null  object 
 2   score         106022 non-null  float64
 3   num_comments  106022 non-null  float64
 4   author        106022 non-null  object 
 5   created_utc   106022 non-null  float64
 6   self_post     106022 non-null  object 
 7   over_18       106022 non-null  object 
 8   full_text     106022 non-null  object 
dtypes: float64(3), object(6)
memory usage: 7.3+ MB


In [9]:
target_flairs = df['flair'].index.tolist()
X_text = list(df['full_text'])
Y = list(df['flair'])
X_train, X_test, y_train, y_test = train_test_split(X_text, Y, test_size=0.3)

The first algorithm we will try is [Multinomial Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) which is a recommended method for a task such as text classification

1. We will first build a pipeline by which we will first tokenize the text and convert to vector and pass it through a TFIDF transform and then pass the output to the MultinomialNB object.

2. We must try out various parameters to get the best out of our algorithms. To implement this in an easy and efficient way, we will use the GridSearch feature offered by Scikit-Learn where it makes a grid of the different parameters we set and executes each of them and returns to us the best parameters.

3. We will predict using the trained model and get the final classification report

In [10]:
vect = CountVectorizer(stop_words='english', ngram_range=(1,2))
X_train = vect.fit_transform(X_train)
X_test = vect.transform(X_test)

tfidf = TfidfTransformer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)

In [12]:
smote = SMOTE(sampling_strategy={'Coronavirus':6000, 'Science/Technology':6000,
'Business/Finance':6000,
'[R]eddiquette':6000,
'Sports':6000,
'Photography':6000})

X_train, y_train = smote.fit_resample(X_train, y_train)

In [14]:
naive_bayes = MultinomialNB(fit_prior=False)
parameters = {'alpha': (1, 1e-1, 1e-2)}
gs_naive_bayes = GridSearchCV(naive_bayes, parameters, verbose=3)
gs_naive_bayes = gs_naive_bayes.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] alpha=1 .........................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ............................. alpha=1, score=0.665, total=   0.8s
[CV] alpha=1 .........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s


[CV] ............................. alpha=1, score=0.664, total=   0.8s
[CV] alpha=1 .........................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.6s remaining:    0.0s


[CV] ............................. alpha=1, score=0.688, total=   0.7s
[CV] alpha=1 .........................................................
[CV] ............................. alpha=1, score=0.707, total=   0.7s
[CV] alpha=1 .........................................................
[CV] ............................. alpha=1, score=0.714, total=   0.7s
[CV] alpha=0.1 .......................................................
[CV] ........................... alpha=0.1, score=0.706, total=   0.8s
[CV] alpha=0.1 .......................................................
[CV] ........................... alpha=0.1, score=0.712, total=   0.8s
[CV] alpha=0.1 .......................................................
[CV] ........................... alpha=0.1, score=0.724, total=   0.8s
[CV] alpha=0.1 .......................................................
[CV] ........................... alpha=0.1, score=0.746, total=   0.8s
[CV] alpha=0.1 .......................................................
[CV] .

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   11.8s finished


In [15]:
y_pred = gs_naive_bayes.predict(X_test)
print(classification_report(y_test, y_pred))
gs_naive_bayes.best_params_

                    precision    recall  f1-score   support

          AskIndia       0.53      0.65      0.58      3704
  Business/Finance       0.46      0.34      0.39      1430
       Coronavirus       0.52      0.46      0.48       346
     Non-Political       0.58      0.58      0.58     10232
       Photography       0.43      0.55      0.49       366
    Policy/Economy       0.56      0.50      0.53      3800
          Politics       0.70      0.76      0.72      9023
Science/Technology       0.49      0.36      0.41      1417
            Sports       0.75      0.68      0.71       519
     [R]eddiquette       0.17      0.07      0.10       970

          accuracy                           0.59     31807
         macro avg       0.52      0.50      0.50     31807
      weighted avg       0.58      0.59      0.59     31807



{'alpha': 0.1}

Next, we will try the [Complement Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html#sklearn.naive_bayes.ComplementNB) algorithm. This is particularly used to deal with imbalanced datasets such as ours

In [16]:
cnaive_bayes = ComplementNB(fit_prior=False)
parameters = {'alpha': (1, 1e-1, 1e-2)}
cgs_naive_bayes = GridSearchCV(cnaive_bayes, parameters, verbose=3)
cgs_naive_bayes = cgs_naive_bayes.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] alpha=1 .........................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ............................. alpha=1, score=0.709, total=   1.0s
[CV] alpha=1 .........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s


[CV] ............................. alpha=1, score=0.706, total=   0.9s
[CV] alpha=1 .........................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.9s remaining:    0.0s


[CV] ............................. alpha=1, score=0.727, total=   0.9s
[CV] alpha=1 .........................................................
[CV] ............................. alpha=1, score=0.753, total=   0.9s
[CV] alpha=1 .........................................................
[CV] ............................. alpha=1, score=0.753, total=   0.8s
[CV] alpha=0.1 .......................................................
[CV] ........................... alpha=0.1, score=0.670, total=   0.8s
[CV] alpha=0.1 .......................................................
[CV] ........................... alpha=0.1, score=0.669, total=   0.8s
[CV] alpha=0.1 .......................................................
[CV] ........................... alpha=0.1, score=0.677, total=   0.8s
[CV] alpha=0.1 .......................................................
[CV] ........................... alpha=0.1, score=0.694, total=   0.8s
[CV] alpha=0.1 .......................................................
[CV] .

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   13.3s finished


In [18]:
y_pred = cgs_naive_bayes.predict(X_test)
print(classification_report(y_test, y_pred))
cgs_naive_bayes.best_params_

                    precision    recall  f1-score   support

          AskIndia       0.57      0.60      0.59      3704
  Business/Finance       0.45      0.36      0.40      1430
       Coronavirus       0.42      0.70      0.52       346
     Non-Political       0.60      0.57      0.59     10232
       Photography       0.31      0.65      0.42       366
    Policy/Economy       0.60      0.46      0.52      3800
          Politics       0.68      0.80      0.73      9023
Science/Technology       0.48      0.39      0.43      1417
            Sports       0.60      0.81      0.69       519
     [R]eddiquette       0.20      0.06      0.10       970

          accuracy                           0.60     31807
         macro avg       0.49      0.54      0.50     31807
      weighted avg       0.59      0.60      0.59     31807



{'alpha': 1}

Here, we will train our model on a Linear SVM which we converge using Stochastic Gradient Descent

In [19]:
lsvm = SGDClassifier(loss='hinge', penalty='l2', random_state=42, tol=None)
parameters = {'alpha': (1, 1e-1, 1e-2)}
gs_lsvm = GridSearchCV(lsvm, parameters, n_jobs=1, verbose=3, cv=2)
gs_lsvm = gs_lsvm.fit(X_train, y_train)

Fitting 2 folds for each of 3 candidates, totalling 6 fits
[CV] alpha=1 .........................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ............................. alpha=1, score=0.221, total= 5.2min
[CV] alpha=1 .........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  5.2min remaining:    0.0s


[CV] ............................. alpha=1, score=0.214, total= 4.8min
[CV] alpha=0.1 .......................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 10.0min remaining:    0.0s


[CV] ........................... alpha=0.1, score=0.240, total= 4.6min
[CV] alpha=0.1 .......................................................
[CV] ........................... alpha=0.1, score=0.244, total= 4.3min
[CV] alpha=0.01 ......................................................
[CV] .......................... alpha=0.01, score=0.677, total= 4.5min
[CV] alpha=0.01 ......................................................
[CV] .......................... alpha=0.01, score=0.709, total= 4.0min


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 27.5min finished


In [20]:
y_pred = gs_lsvm.predict(X_test)
print(classification_report(y_test, y_pred))
print(gs_lsvm.best_params_)

                    precision    recall  f1-score   support

          AskIndia       0.61      0.60      0.61      3704
  Business/Finance       0.44      0.36      0.40      1430
       Coronavirus       0.24      0.74      0.36       346
     Non-Political       0.66      0.50      0.57     10232
       Photography       0.21      0.71      0.33       366
    Policy/Economy       0.59      0.44      0.50      3800
          Politics       0.65      0.82      0.72      9023
Science/Technology       0.45      0.42      0.44      1417
            Sports       0.41      0.80      0.54       519
     [R]eddiquette       0.24      0.07      0.11       970

          accuracy                           0.58     31807
         macro avg       0.45      0.55      0.46     31807
      weighted avg       0.60      0.58      0.58     31807

{'alpha': 0.01}


In [22]:
logreg = LogisticRegression(solver="saga", penalty='l2',verbose=1)
parameters = {'C': np.logspace(0,4,10)}
gs_logreg = GridSearchCV(logreg, parameters, n_jobs=1, verbose=3, cv=2)
gs_logreg = gs_logreg.fit(X_train, y_train)

Fitting 2 folds for each of 10 candidates, totalling 20 fits
[CV] C=1.0 ...........................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 22 epochs took 9 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.5s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ............................... C=1.0, score=0.644, total=   9.7s
[CV] C=1.0 ...........................................................
convergence after 21 epochs took 8 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.9s finished
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   17.8s remaining:    0.0s


[CV] ............................... C=1.0, score=0.666, total=   8.1s
[CV] C=2.7825594022071245 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 26 epochs took 11 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.4s finished


[CV] ................ C=2.7825594022071245, score=0.699, total=  11.6s
[CV] C=2.7825594022071245 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 25 epochs took 9 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.4s finished


[CV] ................ C=2.7825594022071245, score=0.741, total=   9.6s
[CV] C=7.742636826811269 .............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 52 epochs took 21 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   21.3s finished


[CV] ................. C=7.742636826811269, score=0.720, total=  21.6s
[CV] C=7.742636826811269 .............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 46 epochs took 16 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   16.5s finished


[CV] ................. C=7.742636826811269, score=0.754, total=  16.7s
[CV] C=21.544346900318832 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 90 epochs took 37 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   36.6s finished


[CV] ................ C=21.544346900318832, score=0.724, total=  36.8s
[CV] C=21.544346900318832 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 98 epochs took 35 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   34.6s finished


[CV] ................ C=21.544346900318832, score=0.759, total=  34.9s
[CV] C=59.94842503189409 .............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 40 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   40.5s finished


[CV] ................. C=59.94842503189409, score=0.720, total=  40.7s
[CV] C=59.94842503189409 .............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 35 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   35.7s finished


[CV] ................. C=59.94842503189409, score=0.756, total=  35.9s
[CV] C=166.81005372000593 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 40 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   40.2s finished


[CV] ................ C=166.81005372000593, score=0.716, total=  40.5s
[CV] C=166.81005372000593 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 35 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   35.1s finished


[CV] ................ C=166.81005372000593, score=0.753, total=  35.3s
[CV] C=464.15888336127773 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 41 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   40.8s finished


[CV] ................ C=464.15888336127773, score=0.714, total=  41.0s
[CV] C=464.15888336127773 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 35 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   35.1s finished


[CV] ................ C=464.15888336127773, score=0.751, total=  35.3s
[CV] C=1291.5496650148827 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 40 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   40.3s finished


[CV] ................ C=1291.5496650148827, score=0.713, total=  40.5s
[CV] C=1291.5496650148827 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 35 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   35.2s finished


[CV] ................ C=1291.5496650148827, score=0.750, total=  35.4s
[CV] C=3593.813663804626 .............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 40 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   40.4s finished


[CV] ................. C=3593.813663804626, score=0.713, total=  40.6s
[CV] C=3593.813663804626 .............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 35 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   35.0s finished


[CV] ................. C=3593.813663804626, score=0.750, total=  35.2s
[CV] C=10000.0 .......................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 41 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   40.4s finished


[CV] ........................... C=10000.0, score=0.713, total=  40.6s
[CV] C=10000.0 .......................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 35 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   35.1s finished


[CV] ........................... C=10000.0, score=0.750, total=  35.3s


[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 10.1min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


convergence after 97 epochs took 67 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.1min finished


In [23]:
y_pred = gs_logreg.predict(X_test)
print(classification_report(y_test, y_pred))
print(gs_logreg.best_params_)

                    precision    recall  f1-score   support

          AskIndia       0.59      0.61      0.60      3704
  Business/Finance       0.47      0.36      0.41      1430
       Coronavirus       0.62      0.49      0.55       346
     Non-Political       0.59      0.65      0.62     10232
       Photography       0.56      0.43      0.48       366
    Policy/Economy       0.55      0.53      0.54      3800
          Politics       0.71      0.74      0.73      9023
Science/Technology       0.50      0.39      0.44      1417
            Sports       0.58      0.66      0.62       519
     [R]eddiquette       0.29      0.07      0.12       970

          accuracy                           0.61     31807
         macro avg       0.54      0.49      0.51     31807
      weighted avg       0.60      0.61      0.60     31807

{'C': 21.544346900318832}


We will now try the Random Forest Classifier method

In [24]:
rfc = RandomForestClassifier(n_estimators=100, max_depth=160, verbose=5)
rfc.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


building tree 1 of 100


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.8s remaining:    0.0s


building tree 2 of 100


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   15.8s remaining:    0.0s


building tree 3 of 100


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   23.9s remaining:    0.0s


building tree 4 of 100


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   31.8s remaining:    0.0s


building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 1

[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 14.1min finished


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=160, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=5, warm_start=False)

In [25]:
y_pred = rfc.predict(X_test)
print(classification_report(y_test, y_pred))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.1s finished


                    precision    recall  f1-score   support

          AskIndia       0.57      0.39      0.46      3704
  Business/Finance       0.90      0.03      0.06      1430
       Coronavirus       0.72      0.26      0.38       346
     Non-Political       0.43      0.87      0.58     10232
       Photography       0.92      0.21      0.34       366
    Policy/Economy       0.78      0.11      0.19      3800
          Politics       0.73      0.62      0.67      9023
Science/Technology       0.79      0.06      0.12      1417
            Sports       0.79      0.40      0.53       519
     [R]eddiquette       0.96      0.02      0.04       970

          accuracy                           0.53     31807
         macro avg       0.76      0.30      0.34     31807
      weighted avg       0.64      0.53      0.48     31807

