We will do some data pre-processing and train a Naive Bayes and Random Forest model on the AG News data and test it's accuracy. 

In [18]:
from datasets import load_dataset
import pandas as pd
import spacy
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier

In [4]:
nlp = spacy.load('en_core_web_sm')
tqdm.pandas()

In [5]:
dataset = load_dataset('ag_news')

In [7]:
# Let's convert them to dataframes
df_train = pd.DataFrame(dataset['train'])
df_test = pd.DataFrame(dataset['test'])

In [8]:
# Lets create a function to remove stopwords, punctuations and numbers, and convert to lower case
def rem_stopwords(text):
    doc = nlp(text)
    tokens = [token for token in doc if not token.is_stop ]
    tokens = [token.text for token in tokens if token.is_alpha]
    return ' '.join(tokens).lower()

In [9]:
df_train['processed'] = df_train['text'].progress_apply(rem_stopwords)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 120000/120000 [20:49<00:00, 96.04it/s]


In [10]:
df_test['processed'] = df_test['text'].progress_apply(rem_stopwords)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7600/7600 [01:17<00:00, 98.56it/s]


In [11]:
df_train.head(10)

Unnamed: 0,text,label,processed
0,Wall St. Bears Claw Back Into the Black (Reute...,2,wall bears claw black reuters reuters short se...
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2,carlyle looks commercial aerospace reuters reu...
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2,oil economy cloud stocks outlook reuters reute...
3,Iraq Halts Oil Exports from Main Southern Pipe...,2,iraq halts oil exports main southern pipeline ...
4,"Oil prices soar to all-time record, posing new...",2,oil prices soar time record posing new menace ...
5,"Stocks End Up, But Near Year Lows (Reuters) Re...",2,stocks end near year lows reuters reuters stoc...
6,Money Funds Fell in Latest Week (AP) AP - Asse...,2,money funds fell latest week ap ap assets nati...
7,Fed minutes show dissent over inflation (USATO...,2,fed minutes dissent inflation retail sales bou...
8,Safety Net (Forbes.com) Forbes.com - After ear...,2,safety net earning sociology danny bazil riley...
9,Wall St. Bears Claw Back Into the Black NEW Y...,2,wall bears claw black new york reuters short s...


Let's fit a TF-IDF vectorizer to this.

In [12]:
vectorizer = TfidfVectorizer()
vectorizer.fit(df_train['processed'])

In [13]:
X_train = vectorizer.transform(df_train['processed'])
X_test = vectorizer.transform(df_test['processed'])

In [14]:
X_train.shape, X_test.shape

((120000, 60405), (7600, 60405))

Let's fit a Naive Bayes model.

In [15]:
clf_mnb = MultinomialNB()

In [16]:
clf_mnb.fit(X_train, df_train['label'])

In [19]:
print('Train Accuracy: {}, Test Accuracy: {}'.format(
    accuracy_score(y_true=df_train['label'], y_pred=clf_mnb.predict(X_train)), 
    accuracy_score(y_true=df_test['label'], y_pred=clf_mnb.predict(X_test)))
)

Train Accuracy: 0.9177666666666666, Test Accuracy: 0.9021052631578947


In [24]:
print('Train f-score: {}, Test f-score: {}'.format(
    f1_score(y_true=df_train['label'], y_pred=clf_mnb.predict(X_train), average='weighted'), 
    f1_score(y_true=df_test['label'], y_pred=clf_mnb.predict(X_test), average='weighted'))
)

Train f-score: 0.9176070023000131, Test f-score: 0.9018850978591217


Lets try with random forests now. 

In [25]:
clf_rf = RandomForestClassifier(n_estimators=1000, verbose=1, n_jobs=6)

In [26]:
clf_rf.fit(X_train, df_train['label'])

[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:   50.0s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:  3.8min
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:  8.7min
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed: 15.7min
[Parallel(n_jobs=6)]: Done 1000 out of 1000 | elapsed: 19.8min finished


In [27]:
print('Train Accuracy: {}, Test Accuracy: {}'.format(
    accuracy_score(y_true=df_train['label'], y_pred=clf_rf.predict(X_train)), 
    accuracy_score(y_true=df_test['label'], y_pred=clf_rf.predict(X_test)))
)

[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    1.1s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    5.1s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:   11.7s
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:   20.8s
[Parallel(n_jobs=6)]: Done 1000 out of 1000 | elapsed:   26.2s finished
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    0.1s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    0.4s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:    0.9s
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:    1.6s


Train Accuracy: 0.9994416666666667, Test Accuracy: 0.8976315789473684


[Parallel(n_jobs=6)]: Done 1000 out of 1000 | elapsed:    2.0s finished


In [28]:
print('Train f-score: {}, Test f-score: {}'.format(
    f1_score(y_true=df_train['label'], y_pred=clf_rf.predict(X_train), average='weighted'), 
    f1_score(y_true=df_test['label'], y_pred=clf_rf.predict(X_test), average='weighted'))
)

[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    1.1s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    5.0s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:   11.6s
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:   20.8s
[Parallel(n_jobs=6)]: Done 1000 out of 1000 | elapsed:   26.4s finished
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    0.1s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    0.4s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:    0.9s
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:    1.5s


Train f-score: 0.9994416887414893, Test f-score: 0.897225432163428


[Parallel(n_jobs=6)]: Done 1000 out of 1000 | elapsed:    2.0s finished
