@dataset{
    dataset,
    author = {Timilsina, Bimal},
    year = {2021},
    month = {08},
    pages = {},
    title = {News Article Category Dataset},
}

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../input/newsarticlecategories/news-article-categories.csv')

In [3]:
display(df)

Unnamed: 0,category,title,body
0,ARTS & CULTURE,Modeling Agencies Enabled Sexual Predators For...,"In October 2017, Carolyn Kramer received a dis..."
1,ARTS & CULTURE,Actor Jeff Hiller Talks “Bright Colors And Bol...,This week I talked with actor Jeff Hiller abou...
2,ARTS & CULTURE,New Yorker Cover Puts Trump 'In The Hole' Afte...,The New Yorker is taking on President Donald T...
3,ARTS & CULTURE,Man Surprises Girlfriend By Drawing Them In Di...,"Kellen Hickey, a 26-year-old who lives in Huds..."
4,ARTS & CULTURE,This Artist Gives Renaissance-Style Sculptures...,There’s something about combining the traditio...
...,...,...,...
6872,WOMEN,Casually Fearless: Why Millennials Are Natural...,I still think about that Tuesday night dinner ...
6873,WOMEN,Happy Birthday To Us,I remember the morning of my high school gradu...
6874,WOMEN,The Culture of Love,"My husband, Gene, doesn't wear pajamas. I aske..."
6875,WOMEN,"Carpe Diem, Oprah Style","\nBy AntonioGuillem, via ThinkStock\nBy Lisa ..."


# Prepare Data

## Remove stop words

Stop words are words that do not significantly contribute to the meaning of the text. Words like 'is', 'a', and 'the' can be removed as part of the data preparation so that the categorization can focus on the words that contribute the most meaning.

To accomplish this, we'll import the Natural Language Toolkit and then download the English stop words dataset.

In [4]:
import nltk
from nltk.corpus import stopwords
 
nltk.download('stopwords')
stopwords = (stopwords.words('english'))

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
import re
from nltk import pos_tag, word_tokenize

def tokenize_title(title):
    tokens = word_tokenize(title)
    cleaned = []
    for t in tokens:
        if t.lower() in stopwords:
            continue
        filtered_word = re.sub(r'[\'"\s+,]', '', t).strip()
        if len(filtered_word) > 1:
            cleaned.append(filtered_word)
    return cleaned

df['tokenized'] = df['title'].map(tokenize_title)

In [6]:
print(df['tokenized'])

0       [Modeling, Agencies, Enabled, Sexual, Predator...
1       [Actor, Jeff, Hiller, Talks, Bright, Colors, B...
2       [New, Yorker, Cover, Puts, Trump, In, Hole, Ra...
3       [Man, Surprises, Girlfriend, Drawing, Differen...
4       [Artist, Gives, Renaissance-Style, Sculptures,...
                              ...                        
6872    [Casually, Fearless, Millennials, Natural, Ent...
6873                                [Happy, Birthday, Us]
6874                                      [Culture, Love]
6875                          [Carpe, Diem, Oprah, Style]
6876                       [Month, Online, Dating, Detox]
Name: tokenized, Length: 6877, dtype: object
