We will do some data pre-processing and train a Naive Bayes and Random Forest model on the AG News data and test it's accuracy. 

In [1]:
from datasets import load_dataset
import pandas as pd
import spacy
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB

In [2]:
nlp = spacy.load('en_core_web_sm')
tqdm.pandas()

In [3]:
dataset = load_dataset('ag_news')

In [4]:
# Let's convert them to dataframes
df_train = pd.DataFrame(dataset['train'])
df_test = pd.DataFrame(dataset['test'])

In [5]:
# Lets create a function to remove stopwords, punctuations and numbers, and convert to lower case
def rem_stopwords(text):
    doc = nlp(text)
    tokens = [token for token in doc if not token.is_stop ]
    tokens = [token.text for token in tokens if token.is_alpha]
    return ' '.join(tokens).lower()

In [6]:
df_train['processed'] = df_train['text'].progress_apply(rem_stopwords)

100%|██████████| 120000/120000 [17:14<00:00, 116.03it/s]


In [7]:
df_test['processed'] = df_test['text'].progress_apply(rem_stopwords)

100%|██████████| 7600/7600 [01:05<00:00, 116.22it/s]


In [8]:
df_train.head(10)

Unnamed: 0,text,label,processed
0,Wall St. Bears Claw Back Into the Black (Reute...,2,wall bears claw black reuters reuters short se...
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2,carlyle looks commercial aerospace reuters reu...
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2,oil economy cloud stocks outlook reuters reute...
3,Iraq Halts Oil Exports from Main Southern Pipe...,2,iraq halts oil exports main southern pipeline ...
4,"Oil prices soar to all-time record, posing new...",2,oil prices soar time record posing new menace ...
5,"Stocks End Up, But Near Year Lows (Reuters) Re...",2,stocks end near year lows reuters reuters stoc...
6,Money Funds Fell in Latest Week (AP) AP - Asse...,2,money funds fell latest week ap ap assets nati...
7,Fed minutes show dissent over inflation (USATO...,2,fed minutes dissent inflation retail sales bou...
8,Safety Net (Forbes.com) Forbes.com - After ear...,2,safety net earning sociology danny bazil riley...
9,Wall St. Bears Claw Back Into the Black NEW Y...,2,wall bears claw black new york reuters short s...


Let's fit a TF-IDF vectorizer to this.

In [9]:
vectorizer = TfidfVectorizer()
vectorizer.fit(df_train['processed'])

In [10]:
X_train = vectorizer.transform(df_train['processed'])
X_test = vectorizer.transform(df_test['processed'])

In [11]:
X_train.shape, X_test.shape

((120000, 60405), (7600, 60405))

Let's fit a Naive Bayes model.

In [12]:
clf_gnb = GaussianNB()

In [13]:
clf_gnb.fit(X_train.toarray(), df_train['label'])

: 