In a language the order of words is important. in this model, we are capturing individual words but we are not capturing the relationship between the words

Meaning of a sentence is determined by the order of words and this has to be captured by the algorithm we are using.

To do this, one way is to capture a pair of words instead of just one word (as a sliding window). This is called a bi-gram

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(ngram_range=(1,2))
v.fit(["Thor hathodawala is looking for a job"])
v.vocabulary_

{'thor': 9,
 'hathodawala': 2,
 'is': 4,
 'looking': 7,
 'for': 0,
 'job': 6,
 'thor hathodawala': 10,
 'hathodawala is': 3,
 'is looking': 5,
 'looking for': 8,
 'for job': 1}

In [15]:
corpus = ["Thor ate pizza", "Loki is tall", "Loki is eating pizza"]

In [16]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [17]:
def preprocess(text):
    doc = nlp(text)
    
    processed = []
    
    for token in doc:
        if not token.is_stop and not token.is_punct:
            processed.append(token.lemma_)
            
    return " ".join(processed) #returns string

In [18]:
preprocess("Thor ate pizza")

'thor eat pizza'

In [19]:
preprocess("Loki is eating pizza")

'Loki eat pizza'

In [20]:
corpus_processed = [preprocess(word) for word in corpus]

In [21]:
corpus_processed

['thor eat pizza', 'Loki tall', 'Loki eat pizza']

In [22]:
v = CountVectorizer(ngram_range=(1,2))

In [23]:
v.fit(corpus_processed)

In [24]:
v.vocabulary_

{'thor': 7,
 'eat': 0,
 'pizza': 5,
 'thor eat': 8,
 'eat pizza': 1,
 'loki': 2,
 'tall': 6,
 'loki tall': 4,
 'loki eat': 3}

In [27]:
#convert sentence into vector using bag of n-grams model
v.transform(["Thor eat pizza"]).toarray()

array([[1, 1, 0, 0, 0, 1, 0, 1, 1]])

In [28]:
v.transform(["Hulk eat pizza"]).toarray()

array([[1, 1, 0, 0, 0, 1, 0, 0, 0]])

In [30]:
import pandas as pd

In [32]:
df = pd.read_json("/Users/raghavraahul/Downloads/News_Category_Dataset_v3.json", lines = True)

In [33]:
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [34]:
df.category.value_counts()

POLITICS          35602
WELLNESS          17945
ENTERTAINMENT     17362
TRAVEL             9900
STYLE & BEAUTY     9814
PARENTING          8791
HEALTHY LIVING     6694
QUEER VOICES       6347
FOOD & DRINK       6340
BUSINESS           5992
COMEDY             5400
SPORTS             5077
BLACK VOICES       4583
HOME & LIVING      4320
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3653
WOMEN              3572
CRIME              3562
IMPACT             3484
DIVORCE            3426
WORLD NEWS         3299
MEDIA              2944
WEIRD NEWS         2777
GREEN              2622
WORLDPOST          2579
RELIGION           2577
STYLE              2254
SCIENCE            2206
TECH               2104
TASTE              2096
MONEY              1756
ARTS               1509
ENVIRONMENT        1444
FIFTY              1401
GOOD NEWS          1398
U.S. NEWS          1377
ARTS & CULTURE     1339
COLLEGE            1144
LATINO VOICES      1130
CULTURE & ARTS     1074
EDUCATION       

In [41]:
#handling imbalanced dataset

#undersampling

#lowest is 1014
#so we use 1014 samples for each category

#but in real life, wasting training data is not advised

min_samples = 1381
#for example to obtain 1014 sample of arts category
df_business = df[df.category == "BUSINESS"].sample(min_samples, random_state= 1) #same random_state number ensures sampling will be similar
df_sports = df[df.category == "SPORTS"].sample(min_samples, random_state= 1) #same random_state number ensures sampling will be similar
df_crime = df[df.category == "CRIME"].sample(min_samples, random_state= 1) #same random_state number ensures sampling will be similar
df_science = df[df.category == "SCIENCE"].sample(min_samples, random_state= 1) #same random_state number ensures sampling will be similar

#just going to use business, sports, crime, science for this example

In [42]:
df_balanced = pd.concat([df_business, df_sports, df_crime, df_science], axis = 0)
df_balanced.category.value_counts()

BUSINESS    1381
SPORTS      1381
CRIME       1381
SCIENCE     1381
Name: category, dtype: int64

In [43]:
#convert category to numerical values for machine learning
target = {"BUSINESS": 0, "SPORTS": 1, "CRIME": 2, "SCIENCE": 3}
df_balanced['category_num'] = df_balanced.category.map(target)

In [44]:
df_balanced.head(5)

Unnamed: 0,link,headline,category,short_description,authors,date,category_num
11814,https://www.huffingtonpost.com/entry/monster-e...,Trapped Inside The Monster Energy Frat House,BUSINESS,A woman who worked at the drink company said s...,Emily Peck,2018-03-29,0
130521,https://www.huffingtonpost.com/entry/pfizer-as...,Pfizer Is Abandoning Controversial Plan,BUSINESS,,,2014-05-26,0
173584,https://www.huffingtonpost.com/entry/officemax...,"OfficeMax, Office Depot Merger Could Happen Th...",BUSINESS,"The deal is not yet done, and talks could stil...","Reuters, Reuters",2013-02-18,0
201085,https://www.huffingtonpost.com/entry/las-cruce...,"Las Cruces, New Mexico, Threatens To Shut Off ...",BUSINESS,CORRECTION: A previous version of this article...,Harry Bradford,2012-04-28,0
117165,https://www.huffingtonpost.com/entry/how-sophi...,How Sophia Broke the Rules for Advice Based Bu...,BUSINESS,You see Sophia completely smashes the myth tha...,"John Murphy, ContributorBusiness Coach and Adv...",2014-10-25,0


In [45]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_balanced.headline, 
                                                    df_balanced.category_num, 
                                                    test_size = 0.2, random_state = 1, 
                                                    stratify= df_balanced.category_num)

#stratify here creates equal number of samples from all the classes in train and test

In [46]:
X_train.head()

94508     Spurs Assistant Coach Becky Hammon Just Made H...
208809    Top 10 Retailers With The Most Sales Worldwide...
67776     ESPN To Have First Female Analyst Call Top Soc...
56288     Police Capture Ahmad Khan Rahami, Manhattan Bo...
111554              WATCH: An Astronaut's Guide To Optimism
Name: headline, dtype: object

In [47]:
y_train.value_counts()

0    1105
2    1105
3    1105
1    1104
Name: category_num, dtype: int64

In [48]:
#bag of words model

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [49]:
#naived based model is recommend for text-based problems

In [50]:
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.88      0.85       276
           1       0.88      0.85      0.87       277
           2       0.86      0.89      0.88       276
           3       0.87      0.81      0.84       276

    accuracy                           0.86      1105
   macro avg       0.86      0.86      0.86      1105
weighted avg       0.86      0.86      0.86      1105



In [51]:
clf1 = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1,2))),
    ('nb', MultinomialNB())
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.88      0.85       276
           1       0.88      0.85      0.87       277
           2       0.86      0.89      0.88       276
           3       0.87      0.81      0.84       276

    accuracy                           0.86      1105
   macro avg       0.86      0.86      0.86      1105
weighted avg       0.86      0.86      0.86      1105



In [52]:
clf1 = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1,3))),
    ('nb', MultinomialNB())
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.88      0.85       276
           1       0.88      0.85      0.87       277
           2       0.86      0.89      0.88       276
           3       0.87      0.81      0.84       276

    accuracy                           0.86      1105
   macro avg       0.86      0.86      0.86      1105
weighted avg       0.86      0.86      0.86      1105



In [53]:
#here, we didn't do any preprocessing

#let's see whether preprocessing will help improve accuracy of the model
df_balanced["preprocessed_text"] = df_balanced.headline.apply(preprocess)

In [54]:
df_balanced.head()

Unnamed: 0,link,headline,category,short_description,authors,date,category_num,preprocessed_text
11814,https://www.huffingtonpost.com/entry/monster-e...,Trapped Inside The Monster Energy Frat House,BUSINESS,A woman who worked at the drink company said s...,Emily Peck,2018-03-29,0,trap inside Monster Energy Frat House
130521,https://www.huffingtonpost.com/entry/pfizer-as...,Pfizer Is Abandoning Controversial Plan,BUSINESS,,,2014-05-26,0,Pfizer abandon Controversial Plan
173584,https://www.huffingtonpost.com/entry/officemax...,"OfficeMax, Office Depot Merger Could Happen Th...",BUSINESS,"The deal is not yet done, and talks could stil...","Reuters, Reuters",2013-02-18,0,OfficeMax Office Depot Merger happen week report
201085,https://www.huffingtonpost.com/entry/las-cruce...,"Las Cruces, New Mexico, Threatens To Shut Off ...",BUSINESS,CORRECTION: A previous version of this article...,Harry Bradford,2012-04-28,0,Las Cruces New Mexico threaten shut Public Uti...
117165,https://www.huffingtonpost.com/entry/how-sophi...,How Sophia Broke the Rules for Advice Based Bu...,BUSINESS,You see Sophia completely smashes the myth tha...,"John Murphy, ContributorBusiness Coach and Adv...",2014-10-25,0,Sophia break rule Advice Based business


In [55]:
X_train, X_test, y_train, y_test = train_test_split(df_balanced.preprocessed_text, 
                                                    df_balanced.category_num, 
                                                    test_size = 0.2, random_state = 1, 
                                                    stratify= df_balanced.category_num)

In [56]:
clf1 = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1,2))),
    ('nb', MultinomialNB())
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.86      0.86       276
           1       0.86      0.87      0.87       277
           2       0.85      0.92      0.88       276
           3       0.90      0.83      0.86       276

    accuracy                           0.87      1105
   macro avg       0.87      0.87      0.87      1105
weighted avg       0.87      0.87      0.87      1105



In [None]:
#preprocessing improves accuracy metrics in general compared to unprocessed text