**Let's first understand how to generate n-grams using CountVectorizer**

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()
v.fit(["Uzair is looking for ML job."])
v.vocabulary_

{'uzair': 5, 'is': 1, 'looking': 3, 'for': 0, 'ml': 4, 'job': 2}

- Bag of N-Grams

In [4]:
v = CountVectorizer(ngram_range=(1, 2))
v.fit(["Uzair is looking for ML job."])
v.vocabulary_

{'uzair': 9,
 'is': 2,
 'looking': 5,
 'for': 0,
 'ml': 7,
 'job': 4,
 'uzair is': 10,
 'is looking': 3,
 'looking for': 6,
 'for ml': 1,
 'ml job': 8}

In [5]:
v = CountVectorizer(ngram_range=(1, 3))
v.fit(["Uzair is looking for ML job."])
v.vocabulary_

{'uzair': 12,
 'is': 3,
 'looking': 7,
 'for': 0,
 'ml': 10,
 'job': 6,
 'uzair is': 13,
 'is looking': 4,
 'looking for': 8,
 'for ml': 1,
 'ml job': 11,
 'uzair is looking': 14,
 'is looking for': 5,
 'looking for ml': 9,
 'for ml job': 2}

- We will not take a simple collection of text documents, preprocess them to remove stop words, lemmatize etc and then generate bag of 1 grams and 2 grams from it.

In [6]:
corpus = [
    "Thor ate pizza",
    "Loki is tall",
    "Loki is eating pizza"
]

In [9]:
import spacy

# load the english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    # remove the stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [10]:
preprocess("Thor ate pizza")

'thor eat pizza'

In [11]:
preprocess("Loki is tall")

'Loki tall'

In [12]:
corpus_processed = [
    preprocess(text) for text in corpus
]
corpus_processed

['thor eat pizza', 'Loki tall', 'Loki eat pizza']

In [13]:
v = CountVectorizer(ngram_range=(1,2))
v.fit(corpus_processed)
v.vocabulary_

{'thor': 7,
 'eat': 0,
 'pizza': 5,
 'thor eat': 8,
 'eat pizza': 1,
 'loki': 2,
 'tall': 6,
 'loki tall': 4,
 'loki eat': 3}

- Now generate bag of n gram vector for few sample documents

In [14]:
v.transform(["Thor eat pizza"]).toarray()

array([[1, 1, 0, 0, 0, 1, 0, 1, 1]])

- Let's take a document that has out of vocabulary (OOV) term and see how bag of ngram generates vector out of it

In [15]:
v.transform(["Hulk eat pizza"]).toarray()

array([[1, 1, 0, 0, 0, 1, 0, 0, 0]])

### **News Category Classification Problem**

Okay now that we know basics of BAG of n grams vectorizer 😎 It is the time to work on a real problem. Here we want to do a news category classification. We will use bag of n-grams and traing a machine learning model that can categorize any news into one of the following categories,

- BUSINESS
- SPORTS
- CRIME
- SCIENCE

In [16]:
import pandas as pd

df = pd.read_json("./datasets/news_dataset.json")
print(df.shape)
print(df.head())

(12695, 2)
                                                text  category
0  Watching Schrödinger's Cat Die University of C...   SCIENCE
1     WATCH: Freaky Vortex Opens Up In Flooded Lake    SCIENCE
2  Entrepreneurs Today Don't Need a Big Budget to...  BUSINESS
3  These Roads Could Recharge Your Electric Car A...  BUSINESS
4  Civilian 'Guard' Fires Gun While 'Protecting' ...     CRIME


- Check category distribution in dataset

In [17]:
df.category.value_counts()

category
BUSINESS    4254
SPORTS      4167
CRIME       2893
SCIENCE     1381
Name: count, dtype: int64

## Handle class imbalance

As you can see above, SCIENCE category has almost 1/3rd data samples compared to BUSINESS and SPORTS categories. I initially trained a model without handling the imbalanced I saw a lower f1-score for SCIENCE category. Hence we need to address this imbalanced.

Out of those techniques, I will use **undersampling** technique here.

In undersampling, we take a minor class and sample those many samples from other classes, this means we are not utilizing all the data samples for training and in ML world - Not using all the data for training is considered a SIN! 😵 In real life, you are advised to use a technique such as SMOTE so that you can utilize all of your dataset for the training but since this tutorial is more about bag of n-grams then class imbalance itself, I'd go with a simple technique of undersampling.

In [20]:
# we have these many SCIENCE articles and SCIENCE is our minority class
min_sample = 1381

df_BUSINESS = df[df.category == "BUSINESS"].sample(min_sample, random_state=42)
df_SPORTS = df[df.category == "SPORTS"].sample(min_sample, random_state=42)
df_CRIME = df[df.category == "CRIME"].sample(min_sample, random_state=42)
df_SCIENCE = df[df.category == "SCIENCE"].sample(min_sample, random_state=42)

In [22]:
df_balanced = pd.concat([df_BUSINESS, df_SPORTS, df_CRIME, df_SCIENCE], axis=0)
df_balanced.category.value_counts()

category
BUSINESS    1381
SPORTS      1381
CRIME       1381
SCIENCE     1381
Name: count, dtype: int64

- **Convert text category to a number**

In [23]:
target = {"BUSINESS": 0, "SPORTS": 1, "CRIME": 2, "SCIENCE": 3}

df_balanced['category_num'] = df_balanced.category.map(target)

In [26]:
df_balanced.sample(5)

Unnamed: 0,text,category,category_num
12166,U.S. Women's 4x100 Relay Team Scores Gold Afte...,SPORTS,1
5445,Report: New York Police Recruiting Muslim Info...,CRIME,2
8801,Centuries-Old 'Meditating' Mummy Found,SCIENCE,3
6150,New Evidence of Prehistoric Trade With Asia Fo...,SCIENCE,3
4828,Researchers Find More Women Buried At Stonehen...,SCIENCE,3


- **Build a model with original text (no pre processing)**

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_balanced.text,
    df_balanced.category_num,
    test_size=0.2,
    random_state=42,
    stratify=df_balanced.category_num
)

In [29]:
print(X_train.shape)
X_train.head()

(4419,)


6414     Arby's Employee Keeps Job After Refusing To Se...
1318     Colorful NASA Image Shows Off Pluto’s Psychede...
4170     Women in Business Q&A: Sophie Delafontaine, Ar...
11310    5 Formalized Referral Systems to Grow Your Sal...
4188     Hawaii's Kilauea Volcano Sees A Mesmerizing Ri...
Name: text, dtype: object

In [30]:
y_train.value_counts()

category_num
0    1105
3    1105
1    1105
2    1104
Name: count, dtype: int64

In [31]:
y_test.value_counts()

category_num
2    277
0    276
1    276
3    276
Name: count, dtype: int64

- **Attempt 1 : Use 1-gram which is nothing but a Bag Of Words (BOW) model**

In [32]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range=(1,1))),
    ('Multi NB', MultinomialNB())
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.92      0.84       276
           1       0.92      0.85      0.88       276
           2       0.91      0.89      0.90       277
           3       0.89      0.82      0.85       276

    accuracy                           0.87      1105
   macro avg       0.88      0.87      0.87      1105
weighted avg       0.88      0.87      0.87      1105



In [33]:
X_test[:5]

12446    Shocking Video Of Officer Punching Woman Ignit...
3868     YOLO ATTACK: Birthday Stabbing Turns Party Int...
3301                             Great News For Obamacare 
11543    Katie Nolan Calls On Dallas Cowboys To Get Hel...
9501     Markets Tumble On Fears Of Global Economic Slo...
Name: text, dtype: object

In [35]:
y_pred[:5]

array([2, 2, 0, 1, 0])

In [34]:
y_test[:5]

12446    2
3868     2
3301     0
11543    1
9501     0
Name: category_num, dtype: int64

- **Attempt 2 : Use 1-gram and bigrams**

In [36]:
clf = Pipeline([
    ('vectorizer_1_2_grams', CountVectorizer(ngram_range=(1,2))),
    ('Multi NB', MultinomialNB())
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.73      0.95      0.82       276
           1       0.92      0.83      0.87       276
           2       0.91      0.88      0.89       277
           3       0.93      0.79      0.85       276

    accuracy                           0.86      1105
   macro avg       0.87      0.86      0.86      1105
weighted avg       0.87      0.86      0.86      1105



- **Attempt 3 : Use 1-gram to trigrams**

In [37]:
clf = Pipeline([
    ('vectorizer_1_2_grams', CountVectorizer(ngram_range=(1,3))),
    ('Multi NB', MultinomialNB())
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.71      0.95      0.81       276
           1       0.92      0.80      0.85       276
           2       0.91      0.88      0.89       277
           3       0.92      0.77      0.84       276

    accuracy                           0.85      1105
   macro avg       0.86      0.85      0.85      1105
weighted avg       0.86      0.85      0.85      1105



- **Use text pre-processing to remove stop words, punctuations and apply lemmatization**
  
You may wonder, we have not done any text-processing yet to remove stop words, punctuations, apply lemmatization etc. Well we wanted to train the model without any preprocessing first and check the performance. Now we will re-do same thing but with preprocessing of text

In [38]:
df_balanced['preprocessed_txt'] = df_balanced['text'].apply(preprocess)

In [40]:
df_balanced.head()

Unnamed: 0,text,category,category_num,preprocessed_txt
594,How to Develop the Next Generation of Innovato...,BUSINESS,0,develop Generation Innovators stop treat way g...
3093,"Madoff Victims' Payout Nears $7.2 Billion, Tru...",BUSINESS,0,Madoff Victims Payout near $ 7.2 billion Trust...
7447,Bay Area Floats 'Sanctuary In Transit Policy' ...,BUSINESS,0,Bay Area Floats Sanctuary Transit Policy prote...
10388,Microsoft Agrees To Acquire LinkedIn For $26.2...,BUSINESS,0,Microsoft agree acquire linkedin $ 26.2 billio...
1782,"Inside A Legal, Multibillion Dollar Weed Market",BUSINESS,0,inside Legal Multibillion Dollar Weed Market


- **Build a model with pre processed text**

In [41]:
X_train, X_test, y_train, y_test = train_test_split(
    df_balanced.preprocessed_txt, 
    df_balanced.category_num, 
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df_balanced.category_num
)

In [42]:
print(X_train.shape)
X_train.head()

(4419,)


5230    fairy Witches Astronauts boy run away girl lik...
2111    anticipation psychology wait Line spend lot ti...
7443           Jake Snake Roberts Intensive Care Collapse
1631    Jeweler order pay $ 34,500 Trashing Rival Fake...
7066    7 kill Australia Worst Mass Shooting 1996 desc...
Name: preprocessed_txt, dtype: object

In [43]:
y_train.value_counts()

category_num
3    1105
2    1105
0    1105
1    1104
Name: count, dtype: int64

In [44]:
y_test.value_counts()

category_num
1    277
0    276
3    276
2    276
Name: count, dtype: int64

In [45]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range = (1, 2))), #using the ngram_range parameter 
    ('Multi NB', MultinomialNB())
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.90      0.85       276
           1       0.93      0.82      0.87       277
           2       0.84      0.92      0.88       276
           3       0.90      0.82      0.86       276

    accuracy                           0.86      1105
   macro avg       0.87      0.86      0.86      1105
weighted avg       0.87      0.86      0.86      1105



- **Plot Confusion Matrix**

In [46]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[248,   8,  10,  10],
       [ 16, 228,  21,  12],
       [ 16,   4, 254,   2],
       [ 30,   5,  16, 225]])

In [None]:
from matplotlib import pyplot as plt
import seaborn as sn
plt.figure(figsize = (10,7))
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Prediction')
plt.ylabel('Truth')