<a href="https://colab.research.google.com/github/tcarlon94/Cap_3_News_Categorizatipn/blob/main/Cap3_DataWrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Category Automation NLP

# Data Wrangling

## Import Modules

In [12]:
import pandas as pd
import numpy as np
import json

## Load Data

In [13]:
file_name = 'News_Category_Dataset_v3.json'
# Attempt to read JSON with error handling for bad lines
try:
    data = pd.read_json(file_name, lines=True, error_bad_lines=False)
except TypeError:
    # handle error with unmatched string
    data = []
    with open(file_name, 'r') as f:
        for i, line in enumerate(f):
            try:
                data.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON on line {i+1}: {e}")
                continue # Skip the problematic line
    data = pd.DataFrame(data)

In [14]:
data.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209527 entries, 0 to 209526
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   link               209527 non-null  object
 1   headline           209527 non-null  object
 2   category           209527 non-null  object
 3   short_description  209527 non-null  object
 4   authors            209527 non-null  object
 5   date               209527 non-null  object
dtypes: object(6)
memory usage: 9.6+ MB


## Explore Data

In [16]:
#check for duplicates
duplicates = data.duplicated()
data[duplicates]

Unnamed: 0,link,headline,category,short_description,authors,date
67677,https://www.huffingtonpost.comhttp://www.mothe...,"On Facebook, Trump's Longtime Butler Calls For...",POLITICS,"Anthony Senecal, who worked as Donald Trump's ...",,2016-05-12
67923,https://www.huffingtonpost.comhttp://gizmodo.c...,Former Facebook Workers: We Routinely Suppress...,TECH,Facebook workers routinely suppressed news sto...,,2016-05-09
70239,https://www.huffingtonpost.comhttp://www.cnbc....,"On Equal Pay Day, The Gap Is Still Too Wide",WOMEN,Equal Pay Day falls on April 12 in 2016. It's ...,,2016-04-12
139830,https://www.huffingtonpost.comhttp://www.cnn.c...,The World's Most Dangerous Workout?,WELLNESS,"Is the ""sport of fitness"" the world's most dan...",,2014-02-10
144409,https://www.huffingtonpost.comhttp://www.upwor...,Some People Call It 'The Best Anti-Smoking Ad ...,WELLNESS,Almost all smokers know cigarettes are bad for...,,2013-12-22
145142,https://www.huffingtonpost.comhttp://www.weath...,10 Cities That Could Run Out Of Water - Weathe...,ENVIRONMENT,"Securing access to plentiful, renewable source...",,2013-12-15
178155,https://www.huffingtonpost.comhttp://www.busin...,Google Is Attacking Apple From The Inside Out ...,TECH,After years of hammering away at Apple's share...,,2013-01-01
187329,https://www.huffingtonpost.comhttp://www.nytim...,"Eating For Health, Not Weight",WELLNESS,Almost half of Americans are on a diet -- not ...,,2012-09-23
194596,https://www.huffingtonpost.comhttp://blogs.wsj...,Apple Removes Green EPEAT Electronics Certific...,TECH,Apple has pulled its products off the U.S. gov...,,2012-07-07
194598,https://www.huffingtonpost.comhttp://www.theda...,Microsoft's $6.2 Billion Writedown Shows It's ...,TECH,Fighting for online advertising dominance with...,,2012-07-07


In [17]:
#drop duplicates
data.drop_duplicates(inplace=True)

#confirm no duplicates
data.duplicated().sum()

np.int64(0)

In [18]:
#count of each category
data['category'].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
POLITICS,35601
WELLNESS,17942
ENTERTAINMENT,17362
TRAVEL,9900
STYLE & BEAUTY,9811
PARENTING,8791
HEALTHY LIVING,6694
QUEER VOICES,6347
FOOD & DRINK,6340
BUSINESS,5992


There are a lot of categories and many of them are similar. It could be worth consolidating some categories such as Arts, Arts & Culture, and Culture & Arts.

Some other categories will have to be explored further to understand what they mean such as The Worldpost, Taste, Money, Fifty, and Good News

## Explore Categories

In [19]:
#The WorldPost
data[data['category'] == 'THE WORLDPOST'].head(10)

Unnamed: 0,link,headline,category,short_description,authors,date
22902,https://www.huffingtonpost.com/entry/north-kor...,It's Too Late To Stop North Korea As A Nuclear...,THE WORLDPOST,The window to roll back Pyongyang's weapons pr...,"Cui Lei, ContributorResearch fellow, China Ins...",2017-10-13
23954,https://www.huffingtonpost.com/entry/weekend-r...,Weekend Roundup: The Battle For Europe,THE WORLDPOST,Wary German voters and Catalan separatists are...,"Nathan Gardels, ContributorEditor-in-chief, Th...",2017-09-29
23979,https://www.huffingtonpost.com/entry/eu-france...,The Fate Of Europe Rests On The French-German ...,THE WORLDPOST,Macron and Merkel must figure out a common way...,"Sébastien Maillard, ContributorDirector, Jacqu...",2017-09-29
24046,https://www.huffingtonpost.com/entry/macron-me...,"After Merkel's Victory, Macron Sets Sights On ...",THE WORLDPOST,France’s transformation under Macron will push...,"Sylvie Goulard, ContributorFormer French defen...",2017-09-28
24460,https://www.huffingtonpost.com/entry/weekend-r...,Weekend Roundup: Trump's U.N. Speech Marks The...,THE WORLDPOST,We are leaving the postwar era that saw the U....,"Nathan Gardels, ContributorEditor-in-chief, Th...",2017-09-23
24477,https://www.huffingtonpost.com/entry/amal-cloo...,"Amal Clooney: 'Finally, We Have A Coordinated ...",THE WORLDPOST,"Thanks to a new international investigation, v...","Amal Clooney, ContributorLawyer, Activist and ...",2017-09-23
24544,https://www.huffingtonpost.com/entry/china-hol...,Why The China-Hollywood Relationship Is Compli...,THE WORLDPOST,"With China, Hollywood must navigate a constant...",Suzanne Gaber,2017-09-22
24654,https://www.huffingtonpost.com/entry/mike-meda...,Why The New Hollywood Will Never Live Up To Ol...,THE WORLDPOST,The industry has changed since the legendary t...,Rosa O'Hara,2017-09-21
24666,https://www.huffingtonpost.com/entry/trump-nat...,Why We Won’t See Trump-Like Nationalism In Ind...,THE WORLDPOST,"""America First"" is based in exclusivism, but I...","David T. Hill and Krishna Sen, Contributors",2017-09-21
24842,https://www.huffingtonpost.com/entry/islam-pop...,Islamist And Western Populism Have More In Com...,THE WORLDPOST,Reactive populism has become a decisive force ...,"Rainer Heufers, ContributorExecutive Director,...",2017-09-19


This seems like it is a combination of entertainment, world news/politics, and some tech. We'll leave this as it's own for now

In [20]:
# Taste Category
data[data['category'] == 'TASTE'].head(10)

Unnamed: 0,link,headline,category,short_description,authors,date
16173,https://www.huffingtonpost.com/entry/ice-water...,It's Weird That American Restaurants Serve Ice...,TASTE,But why do we even have ice in our drinks in t...,Todd Van Luling,2018-01-16
16242,https://www.huffingtonpost.com/entry/pineapple...,"Pineapple Casserole, The Southern Dish That's ...",TASTE,"It's got pineapple, cheddar and a whole lot of...",Kristen Aiken,2018-01-16
16516,https://www.huffingtonpost.com/entry/how-to-ge...,How To Actually Get A Bartender's Attention,TASTE,Plus other things they wish you knew.,Taylor Pittman,2018-01-11
16599,https://www.huffingtonpost.com/entry/diet-coke...,Diet Coke's Millennial-Inspired Makeover Leave...,TASTE,"It's not like a regular soda, it's a cool soda.",Abigail Williams,2018-01-10
16776,https://www.huffingtonpost.com/entry/sunions-t...,We Tested The New 'Tearless' Onions To See If ...,TASTE,"Put away your goggles, people.",Kristen Aiken,2018-01-08
16993,https://www.huffingtonpost.com/entry/in-n-out-...,In-N-Out Burger Now Serves Hot Cocoa And It's ...,TASTE,Some people are so excited.,Ron Dicker,2018-01-04
17007,https://www.huffingtonpost.com/entry/kfc-troll...,KFC Trolls Trump With Spoof Ronald McDonald Th...,TASTE,"""Mine is a box meal which is bigger and more p...",Lee Moran,2018-01-04
17138,https://www.huffingtonpost.com/entry/best-mock...,The Best Mocktails To Order At The Bar,TASTE,These aren't your lame old virgin daiquiris.,Kristen Aiken,2018-01-03
17144,https://www.huffingtonpost.com/entry/how-do-ai...,"How The Heck Do Air Fryers Work, Anyway?",TASTE,Make everyday Fry Day 🍟,Brittany Nims,2018-01-03
17271,https://www.huffingtonpost.com/entry/coffee-ca...,Coffee Cake To Cocktails: 20 Festive New Year'...,TASTE,There’s no better way to ring in the New Year ...,"Jennifer Segal, ContributorChef, Cookbook Auth...",2017-12-31


This can definitely be combined with the food & drink category

In [21]:
# Money
data[data['category'] == 'MONEY'].head(10)

Unnamed: 0,link,headline,category,short_description,authors,date
1796,https://www.huffpost.com/entry/holiday-shoppin...,Why You Should Get Your Holiday Shopping Done ...,MONEY,"Supply chain issues are out of your control, b...",Caroline Bologna,2021-10-21
3259,https://www.huffpost.com/entry/gamestop-stock-...,Investors Who Made Money Trading GameStop Have...,MONEY,"Those who recently flipped shares of GameStop,...",Casey Bond,2021-02-04
3335,https://www.huffpost.com/entry/tax-season-dela...,The IRS Delayed Tax Season. Here's How To Get ...,MONEY,Last-minute changes to tax laws in 2020 mean t...,Casey Bond,2021-01-22
4623,https://www.huffpost.com/entry/stimulus-check-...,Where's My Stimulus Check? How To Track Your C...,MONEY,"If your check was sent but you never got it, y...",Casey Bond,2020-06-16
4854,https://www.huffpost.com/entry/pay-taxes-coron...,Do You Have To Pay Taxes On Your Coronavirus S...,MONEY,Getting a payment won't mean a higher tax bill...,Casey Bond,2020-05-08
5182,https://www.huffpost.com/entry/bear-market-exp...,This Is What Happens When Stocks Enter A Bear ...,MONEY,"Experts say on average, bear markets have last...","Alex Viega, AP",2020-03-12
5183,https://www.huffpost.com/entry/coronavirus-cru...,Cruise Lines Are Paying Customers Not To Cance...,MONEY,"Would you risk your health ― and potentially, ...",Casey Bond,2020-03-12
5190,https://www.huffpost.com/entry/coronavirus-tim...,"Thanks To The Coronavirus, Now Is A Great Time...",MONEY,The only catch? There's much more demand than ...,Casey Bond,2020-03-10
5191,https://www.huffpost.com/entry/panic-buying-co...,The Psychology Behind Panic-Buying – And How T...,MONEY,"People are stockpiling – or ""panic-buying"" – t...",Natasha Hinde,2020-03-10
5239,https://www.huffpost.com/entry/buy-nothing-gro...,'Buy Nothing' Groups: Stop Spending Money And ...,MONEY,These Facebook groups prohibit exchanging mone...,Casey Bond,2020-02-28


This seems to be a lifestyle/shopping type of category rather than a business/financial focused

In [22]:
# Fifty
data[data['category'] == 'FIFTY'].head(10)

Unnamed: 0,link,headline,category,short_description,authors,date
43952,https://www.huffingtonpost.com/entry/love-face...,"Love, Facebook and Infidelity",FIFTY,,"Roz Warren, ContributorAuthor of OUR BODIES, O...",2017-02-05
47074,https://www.huffingtonpost.com/entry/boomers-w...,"Boomers Were Time's ""Man of the Year"" Fifty Ye...",FIFTY,,"Candy Leonard, ContributorSociologist, author ...",2017-01-02
47660,https://www.huffingtonpost.com/entry/be-gratef...,Be Grateful At The Holidays For Sprinkles Of H...,FIFTY,,"Honey Good, ContributorFounder of HoneyGood.com",2016-12-26
47711,https://www.huffingtonpost.com/entry/a-no-bull...,A No Bullsh-t Holiday Letter,FIFTY,,"Iris Ruth Pastor, ContributorSlice-of-life col...",2016-12-25
48303,https://www.huffingtonpost.com/entry/vocabular...,How Our Vocabulary Gives Away Our Age,FIFTY,We may look much younger than we really are. W...,"Delfín Carbonell, ContributorPh.D. in Philolog...",2016-12-18
48377,https://www.huffingtonpost.com/entry/living-ab...,The Truth About Retiring And Living Abroad Wit...,FIFTY,If you're hesitating making an international m...,"Kathleen Peddicord, ContributorPublisher, Live...",2016-12-17
48389,https://www.huffingtonpost.com/entry/joy-alzhe...,Bringing Joy To A Loved One With Alzheimer's,FIFTY,The concert would either bring Ed great joy or...,"Marie Marley, Contributoraward-winning author,...",2016-12-17
48966,https://www.huffingtonpost.com/entry/age-frien...,Age Friendliness on Our Minds,FIFTY,,"Marian L. Knapp, ContributorAuthor, columnist,...",2016-12-10
50141,https://www.huffingtonpost.com/entry/sexually-...,How To Become A Sexually Empowered Woman,FIFTY,There is a 'Divine Feminine Goddess' movement ...,"Pamela Madsen, ContributorSexuality and Relati...",2016-11-27
50142,https://www.huffingtonpost.com/entry/growing-u...,5 Things We Did As Children In The 1970s That ...,FIFTY,II grew up in the 70s and it was great. I had ...,"James Baxley, ContributorFreelance Writer, She...",2016-11-27


This is a category for the 50+ community. More focused on lifestyle than anything. Also seems to be a lot of missing descriptions here which we will have to tackle.

## Combine some similar categories

In [23]:
# Define Function for combining categories
def combine_categories(category1, category2):
    # Use .isin() to check if the category is in the list of categories to be combined
    data.loc[data['category'] == category2, 'category'] = category1
    return data

In [24]:
# Style + Style & Beauty
data = combine_categories('STYLE & BEAUTY', 'STYLE')

In [25]:
# Green + Environment
data = combine_categories('ENVIRONMENT', 'GREEN')

In [26]:
# Arts & Culture
data = combine_categories('ARTS & CULTURE', 'CULTURE & ARTS')
data = combine_categories('ARTS & CULTURE', 'ARTS')

In [27]:
# Check categories
data['category'].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
POLITICS,35601
WELLNESS,17942
ENTERTAINMENT,17362
STYLE & BEAUTY,12065
TRAVEL,9900
PARENTING,8791
HEALTHY LIVING,6694
QUEER VOICES,6347
FOOD & DRINK,6340
BUSINESS,5992


We answered for a few of the similar categories to consolidate it a little

## Review Missing data

In [28]:
# Review missing data
data.isna().sum()

Unnamed: 0,0
link,0
headline,0
category,0
short_description,0
authors,0
date,0


This shows no missing values but we saw missing values in the short descriptions previously. We'll check for empty strings instead.

In [29]:
#check if empty string in headlines & description
headline_missing = data[data['headline'] == '']
description_missing = data[data['short_description'] == '']
print(headline_missing.count())
print(description_missing.count())

link                 6
headline             6
category             6
short_description    6
authors              6
date                 6
dtype: int64
link                 19712
headline             19712
category             19712
short_description    19712
authors              19712
date                 19712
dtype: int64


There are only a few headlines missing which we could easily look up the headlines and add them. However about 10% of descriptions are empty so it may be worth it to scrape these pages for the headline and description

In [35]:
import requests
from bs4 import BeautifulSoup

#Scraping function
def scrape_missing_data(url, index):
  """Scrapes page for headine and description. Returns headline and description"""
  headline = ''
  description = ''
  try:
    response = requests.get(url)
    response.raise_for_status() #raise error for bad response
    soup = BeautifulSoup(response.content, 'html.parser')

    # Attempt to find the headline
    headline_tag = soup.find(['h1'])
    if headline_tag:
        headline = headline_tag.text.strip()

    # Attempt to find description-like content (can be adjusted based on typical website structure)
    description_tags = soup.find_all(['p', 'div'], class_=['dek', 'excerpt', 'description', 'summary']) # Common classes for descriptions
    if not description_tags: # If common classes not found, try to get text from p tags
        description_tags = soup.find_all('p')

    description_text = []
    for tag in description_tags:
      description_text.append(tag.text.strip())
    description = ' '.join(description_text)

  except HTTPError:
    pass
  except ConnectionError:
    pass
  except Timeout:
    pass
  except TooManyRedirects:
    pass
  except Exception as e:
    pass

  return headline, description

In [31]:
#combine headline & description missing
total_missing = pd.concat([headline_missing, description_missing])
total_missing.head()

Unnamed: 0,link,headline,category,short_description,authors,date
90944,https://www.huffingtonpost.com/entry/lincoln-2...,,POLITICS,,"Robert Moran, ContributorRobert Moran leads Br...",2015-08-22
95567,https://www.huffingtonpost.com/entry/post_9671...,,RELIGION,Let everyone not wrapped in tired and disprove...,"Matthew Fox, ContributorRadical theologian Mat...",2015-06-30
103675,https://www.huffingtonpost.com/entry/us-and-eu...,,WORLDPOST,,"Natasha Srdoc, ContributorAuthor, Economist, C...",2015-03-29
109100,https://www.huffingtonpost.com/entry/disney-ce...,,BUSINESS,,"Gary Snyder, ContributorWriter and Media Strat...",2015-01-25
110153,https://www.huffingtonpost.com/entry/beverly-h...,,MEDIA,,"Gary Snyder, ContributorWriter and Media Strat...",2015-01-13


In [36]:
from requests.exceptions import HTTPError, ConnectionError, Timeout, TooManyRedirects

#scrape for missing headline/description and add to total_missing
for index, row in total_missing.iterrows():
  url = row['link']
  headline, description = scrape_missing_data(url, index)
  try:
    if description:
      total_missing.loc[index, 'short_description'] = description
    if headline:
      total_missing.loc[index, 'headline'] = headline
  except HTTPError:
    pass
  except ConnectionError:
    pass
  except Timeout:
    pass
  except TooManyRedirects:
    pass
  except Exception as e:
    pass

total_missing.head()

HTTP error for https://www.huffingtonpost.com/entry/smoke-and-mirrors-behind-_b_13919130.html
HTTP error for https://www.huffingtonpost.com/entry/an-empty-free-library-nor_b_13851526.html
HTTP error for https://www.huffingtonpost.com/entry/overdosing-on-rape-cultur_b_12594064.html
HTTP error for https://www.huffingtonpost.com/entry/billy-bush-to-donate-10-m_b_12508684.html
HTTP error for https://www.huffingtonpost.com/entry/best-of-new-york-fashion_b_11967658.html
HTTP error for https://www.huffingtonpost.com/entry/will-heaven-be-boring_b_10663512.html
HTTP error for https://www.huffingtonpost.com/entry/the-devil-is-in-the-detai_b_10674622.html
HTTP error for https://www.huffingtonpost.com/entry/jeremy-scott-gallery-open_b_10198624.html
Connection error for https://www.huffingtonpost.comhttp://www.vocativ.com/322545/the-worlds-greatest-crying-jordan-artist/
HTTP error for https://www.huffingtonpost.com/entry/vacation-jason-of-the-chr_b_9708430.html
Connection error for https://www.huff

Unnamed: 0,link,headline,category,short_description,authors,date
90944,https://www.huffingtonpost.com/entry/lincoln-2...,Lincoln 2.0?,POLITICS,Could we ever vote for an Abraham Lincoln 2.0 ...,"Robert Moran, ContributorRobert Moran leads Br...",2015-08-22
95567,https://www.huffingtonpost.com/entry/post_9671...,,RELIGION,Already contributed? Log in to hide these mess...,"Matthew Fox, ContributorRadical theologian Mat...",2015-06-30
103675,https://www.huffingtonpost.com/entry/us-and-eu...,,WORLDPOST,Already contributed? Log in to hide these mess...,"Natasha Srdoc, ContributorAuthor, Economist, C...",2015-03-29
109100,https://www.huffingtonpost.com/entry/disney-ce...,,BUSINESS,Already contributed? Log in to hide these mess...,"Gary Snyder, ContributorWriter and Media Strat...",2015-01-25
110153,https://www.huffingtonpost.com/entry/beverly-h...,,MEDIA,Already contributed? Log in to hide these mess...,"Gary Snyder, ContributorWriter and Media Strat...",2015-01-13


In [37]:
# Check number of missing
total_missing[total_missing['short_description'] == ''].count()

Unnamed: 0,0
link,344
headline,344
category,344
short_description,344
authors,344
date,344


In [38]:
#view rows with headline/description scraped
pd.set_option('display.max_colwidth', None)
total_missing.head(25)

Unnamed: 0,link,headline,category,short_description,authors,date
90944,https://www.huffingtonpost.com/entry/lincoln-20_b_8023742.html,Lincoln 2.0?,POLITICS,Could we ever vote for an Abraham Lincoln 2.0 - an artificial intelligence built to replicate the decision making style of one of our most accomplished Presidents?,"Robert Moran, ContributorRobert Moran leads Brunswick Insight, and writes and speaks on...",2015-08-22
95567,https://www.huffingtonpost.com/entry/post_9671_b_7683632.html,,RELIGION,"Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. For two decades, HuffPost has reported on the stories that matter — from the personal to the political — and stood strong through shifting administrations, economic upheavals, and cultural reckonings. In an era where the media is pressured to bow to power, hard-won LGBTQ+ rights are being rolled back, immigration policies grow harsher, and economic policies like tariffs reshape everyday lives, our mission is more urgent than ever.Support journalism that speaks truth to power. We can't do it without you. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how.","Matthew Fox, ContributorRadical theologian Matthew Fox is the author of more than 30 b...",2015-06-30
103675,https://www.huffingtonpost.com/entry/us-and-europes-economic-a_b_6962262.html,,WORLDPOST,"Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. For two decades, HuffPost has reported on the stories that matter — from the personal to the political — and stood strong through shifting administrations, economic upheavals, and cultural reckonings. In an era where the media is pressured to bow to power, hard-won LGBTQ+ rights are being rolled back, immigration policies grow harsher, and economic policies like tariffs reshape everyday lives, our mission is more urgent than ever.Support journalism that speaks truth to power. We can't do it without you. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how.","Natasha Srdoc, ContributorAuthor, Economist, Co-Founder, Adriatic Institute and Internat...",2015-03-29
109100,https://www.huffingtonpost.com/entry/disney-ceo-iger-readies-m_b_6520290.html,,BUSINESS,"Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. For two decades, HuffPost has reported on the stories that matter — from the personal to the political — and stood strong through shifting administrations, economic upheavals, and cultural reckonings. In an era where the media is pressured to bow to power, hard-won LGBTQ+ rights are being rolled back, immigration policies grow harsher, and economic policies like tariffs reshape everyday lives, our mission is more urgent than ever.Support journalism that speaks truth to power. We can't do it without you. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how.","Gary Snyder, ContributorWriter and Media Strategist",2015-01-25
110153,https://www.huffingtonpost.com/entry/beverly-hills-hotel-caugh_b_6414708.html,,MEDIA,"Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. For two decades, HuffPost has reported on the stories that matter — from the personal to the political — and stood strong through shifting administrations, economic upheavals, and cultural reckonings. In an era where the media is pressured to bow to power, hard-won LGBTQ+ rights are being rolled back, immigration policies grow harsher, and economic policies like tariffs reshape everyday lives, our mission is more urgent than ever.Support journalism that speaks truth to power. We can't do it without you. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how.","Gary Snyder, ContributorWriter and Media Strategist",2015-01-13
122145,https://www.huffingtonpost.com/entry/beverly-hills-hotel-boyco_b_5711931.html,,QUEER VOICES,"Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. For two decades, HuffPost has reported on the stories that matter — from the personal to the political — and stood strong through shifting administrations, economic upheavals, and cultural reckonings. In an era where the media is pressured to bow to power, hard-won LGBTQ+ rights are being rolled back, immigration policies grow harsher, and economic policies like tariffs reshape everyday lives, our mission is more urgent than ever.Support journalism that speaks truth to power. We can't do it without you. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how.","Gary Snyder, ContributorWriter and Media Strategist",2014-08-28
20773,https://www.huffingtonpost.com/entry/the-big-blue-wave_us_5a050838e4b0ee8ec3694054,The Big Blue Wave,COMEDY,The Big Blue Wave,"Shan Wells, ContributorSciency sun venerator + cartoons",2017-11-10
21523,https://www.huffingtonpost.com/entry/inside-rukban-camp-one-of-syrias-most-desperate-settlements_us_59f87fd8e4b0aec1467ace64,"Inside Rukban Camp, One Of Syria’s Most Desperate Settlements",WORLD NEWS,"Inside Rukban Camp, One Of Syria’s Most Desperate Settlements","Yasser Allawi, Syria Deeply",2017-10-31
22793,https://www.huffingtonpost.com/entry/syrian-refugees-return-from-lebanon-only-to-flee-war-yet-again_us_59e263d6e4b03a7be580fff6,Syrian Refugees Return From Lebanon Only To Flee War Yet Again,WORLD NEWS,Syrian Refugees Return From Lebanon Only To Flee War Yet Again,"Abby Sewell, Refugees Deeply",2017-10-14
32223,https://www.huffingtonpost.com/entry/your-guide-to-the-best-bbq-in-st-louis_us_594806e8e4b0cddbb0088011,Your Guide To The Best BBQ In St. Louis According to Zagat,TASTE,"Your Guide To The Best BBQ In St. Louis According to Zagat Tourist or local, this place has great BBQ",,2017-06-19


In [40]:
#removing urls that redirected to home page
total_missing = total_missing[total_missing['short_description'] != "Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. For two decades, HuffPost has reported on the stories that matter — from the personal to the political — and stood strong through shifting administrations, economic upheavals, and cultural reckonings. In an era where the media is pressured to bow to power, hard-won LGBTQ+ rights are being rolled back, immigration policies grow harsher, and economic policies like tariffs reshape everyday lives, our mission is more urgent than ever.Support journalism that speaks truth to power. We can't do it without you. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. We remain committed to providing you with the unflinching, fact-based journalism everyone deserves.Thank you again for your support along the way. We’re truly grateful for readers like you! Your initial support helped get us here and bolstered our newsroom, which kept us strong during uncertain times. Now as we continue, we need your help more than ever. We hope you will join us once again. Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how. Big money interests are running the government — and influencing the news you read. While other outlets are retreating behind paywalls and bending the knee to political pressure, HuffPost is proud to be unbought and unfiltered. Will you help us keep it that way? Already contributed? Log in to hide these messages. Do you have info to share with HuffPost reporters? Here’s how."]

In [41]:
total_missing.head(25)

Unnamed: 0,link,headline,category,short_description,authors,date
90944,https://www.huffingtonpost.com/entry/lincoln-20_b_8023742.html,Lincoln 2.0?,POLITICS,Could we ever vote for an Abraham Lincoln 2.0 - an artificial intelligence built to replicate the decision making style of one of our most accomplished Presidents?,"Robert Moran, ContributorRobert Moran leads Brunswick Insight, and writes and speaks on...",2015-08-22
20773,https://www.huffingtonpost.com/entry/the-big-blue-wave_us_5a050838e4b0ee8ec3694054,The Big Blue Wave,COMEDY,The Big Blue Wave,"Shan Wells, ContributorSciency sun venerator + cartoons",2017-11-10
21523,https://www.huffingtonpost.com/entry/inside-rukban-camp-one-of-syrias-most-desperate-settlements_us_59f87fd8e4b0aec1467ace64,"Inside Rukban Camp, One Of Syria’s Most Desperate Settlements",WORLD NEWS,"Inside Rukban Camp, One Of Syria’s Most Desperate Settlements","Yasser Allawi, Syria Deeply",2017-10-31
22793,https://www.huffingtonpost.com/entry/syrian-refugees-return-from-lebanon-only-to-flee-war-yet-again_us_59e263d6e4b03a7be580fff6,Syrian Refugees Return From Lebanon Only To Flee War Yet Again,WORLD NEWS,Syrian Refugees Return From Lebanon Only To Flee War Yet Again,"Abby Sewell, Refugees Deeply",2017-10-14
32223,https://www.huffingtonpost.com/entry/your-guide-to-the-best-bbq-in-st-louis_us_594806e8e4b0cddbb0088011,Your Guide To The Best BBQ In St. Louis According to Zagat,TASTE,"Your Guide To The Best BBQ In St. Louis According to Zagat Tourist or local, this place has great BBQ",,2017-06-19
34488,https://www.huffingtonpost.com/entry/the-bechdel-test_us_5925be6be4b0dfb1ca3a106a,The Bechdel Test,COMEDY,The Bechdel Test,"Hilary Fitzgerald Campbell, ContributorHilary's cartoons have appeared in The New Yorker and other fu...",2017-05-24
35584,https://www.huffingtonpost.com/entry/how-to-add-some-activism-to-your-mothers-day-traditions_us_59147334e4b030d4f1f059d0,How To Add Some Activism To Your Mother's Day Traditions,WOMEN,Brought to you by the Women's March.,Jenavieve Hatch,2017-05-11
35692,https://www.huffingtonpost.com/entry/how-to-add-some-activism-to-your-mothers-day-traditions_us_59131c5de4b0a58297e13ded,How To Add Some Activism To Your Mother's Day Traditions,WOMEN,Brought to you by the Women's March.,Jenavieve Hatch,2017-05-10
40313,https://www.huffingtonpost.com/entry/fired-us-attorney-preet-bharara-said-to-have-been-investigating-hhs-secretary-tom-price_us_58cc2b1ee4b0be71dcf4a17b,Fired U.S. Attorney Preet Bharara Said to Have Been Investigating HHS Secretary Tom Price,POLITICS,REPORT: Preet Was Investigating Trump's Health Secretary At Time Of Firing,"Robert Faturechi, ProPublica",2017-03-17
43295,https://www.huffingtonpost.com/entry/love-needs-to-be-remember_b_14696322.html,"Love Needs To Be Remembered, Restored and Renewed",HEALTHY LIVING,"Love Needs To Be Remembered, Restored and Renewed","Ed and Deb Shapiro, ContributorMindfulness, Meditation, Yoga experts; Bestselling Authors: YO...",2017-02-12


In [42]:
#check where description and headline are exact match
total_missing[total_missing['short_description'] == total_missing['headline']].count()

Unnamed: 0,0
link,10451
headline,10451
category,10451
short_description,10451
authors,10451
date,10451


In [44]:
# drop where exact match
total_missing = total_missing[total_missing['short_description'] != total_missing['headline']]
total_missing.head(25)

Unnamed: 0,link,headline,category,short_description,authors,date
90944,https://www.huffingtonpost.com/entry/lincoln-20_b_8023742.html,Lincoln 2.0?,POLITICS,Could we ever vote for an Abraham Lincoln 2.0 - an artificial intelligence built to replicate the decision making style of one of our most accomplished Presidents?,"Robert Moran, ContributorRobert Moran leads Brunswick Insight, and writes and speaks on...",2015-08-22
32223,https://www.huffingtonpost.com/entry/your-guide-to-the-best-bbq-in-st-louis_us_594806e8e4b0cddbb0088011,Your Guide To The Best BBQ In St. Louis According to Zagat,TASTE,"Your Guide To The Best BBQ In St. Louis According to Zagat Tourist or local, this place has great BBQ",,2017-06-19
35584,https://www.huffingtonpost.com/entry/how-to-add-some-activism-to-your-mothers-day-traditions_us_59147334e4b030d4f1f059d0,How To Add Some Activism To Your Mother's Day Traditions,WOMEN,Brought to you by the Women's March.,Jenavieve Hatch,2017-05-11
35692,https://www.huffingtonpost.com/entry/how-to-add-some-activism-to-your-mothers-day-traditions_us_59131c5de4b0a58297e13ded,How To Add Some Activism To Your Mother's Day Traditions,WOMEN,Brought to you by the Women's March.,Jenavieve Hatch,2017-05-10
40313,https://www.huffingtonpost.com/entry/fired-us-attorney-preet-bharara-said-to-have-been-investigating-hhs-secretary-tom-price_us_58cc2b1ee4b0be71dcf4a17b,Fired U.S. Attorney Preet Bharara Said to Have Been Investigating HHS Secretary Tom Price,POLITICS,REPORT: Preet Was Investigating Trump's Health Secretary At Time Of Firing,"Robert Faturechi, ProPublica",2017-03-17
43951,https://www.huffingtonpost.com/entry/evolution-weekend-now-mor_b_14629804.html,Evolution Weekend: Now More Than Ever!,RELIGION,Evolution Weekend: Now More Than Ever!,"Michael Zimmerman, Ph.D., ContributorFounder, The Clergy Letter Project",2017-02-05
44676,https://www.huffingtonpost.com/entry/expanding-the-emergency-r_b_14444940.html,Expanding the Emergency Room Model: <br>'Central Care System' Could Help Americans <br>Gain Universal Health Care Access,HEALTHY LIVING,Expanding the Emergency Room Model:'Central Care System' Could Help AmericansGain Universal Health Care Access,"Dr. Sudip Bose, Contributor★Emergency Physician ★Iraq War Veteran ★CNN Hero for treating ...",2017-01-28
46508,https://www.huffingtonpost.com/entry/the-quick-search-you-shou_b_13679592.html,The Quick Search You Should Do in Your Rental Property Due Diligence,EDUCATION,The Quick Search You Should Do in Your Rental Property Due Diligence,"Dean Graziosi, ContributorNew York Times Best Selling Author",2017-01-08
46526,https://www.huffingtonpost.com/entry/stage-door-ute-lempers-so_b_14041358.html,"Stage Door: <i>Ute Lemper's Songs From The Broken Heart, Confucius</i>",ARTS & CULTURE,Stage Door:,"Fern Siegel, ContributorDeputy Editor, MediaPost",2017-01-08
46529,https://www.huffingtonpost.com/entry/goals-2017-tax-planning_b_14039352.html,Goals 2017: Tax Planning & Timing,BUSINESS,Goals 2017: Tax Planning & Timing,"Mark Steber, ContributorChief Tax Officer, Jackson Hewitt Tax Service Inc.",2017-01-08


It seems most of these are missing descriptions and just using the headline as a description. It will be best to drop those without descriptions for our model

In [45]:
#drop columns with missing descriptions from original df
data = data[data['short_description'] != '']
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 189802 entries, 0 to 209526
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   link               189802 non-null  object
 1   headline           189802 non-null  object
 2   category           189802 non-null  object
 3   short_description  189802 non-null  object
 4   authors            189802 non-null  object
 5   date               189802 non-null  object
dtypes: object(6)
memory usage: 10.1+ MB


## Combine headline & description

This will make it easier to have one category for NLP processes

In [46]:
#combine then drop individual columns
data['description'] = data['headline'] + ' ' + data['short_description']
data.drop(columns=['headline', 'short_description'], inplace=True)

data.head()

Unnamed: 0,link,category,authors,date,description
0,https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9,U.S. NEWS,"Carla K. Johnson, AP",2022-09-23,Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.
1,https://www.huffpost.com/entry/american-airlines-passenger-banned-flight-attendant-punch-justice-department_n_632e25d3e4b0e247890329fe,U.S. NEWS,Mary Papenfuss,2022-09-23,"American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles."
2,https://www.huffpost.com/entry/funniest-tweets-cats-dogs-september-17-23_n_632de332e4b0695c1d81dc02,COMEDY,Elyse Wanshel,2022-09-23,"23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23) ""Until you have a dog you don't understand what could be eaten."""
3,https://www.huffpost.com/entry/funniest-parenting-tweets_l_632d7d15e4b0d12b5403e479,PARENTING,Caroline Bologna,2022-09-23,"The Funniest Tweets From Parents This Week (Sept. 17-23) ""Accidentally put grown-up toothpaste on my toddler’s toothbrush and he screamed like I was cleaning his teeth with a Carolina Reaper dipped in Tabasco sauce."""
4,https://www.huffpost.com/entry/amy-cooper-loses-discrimination-lawsuit-franklin-templeton_n_632c6463e4b09d8701bd227e,U.S. NEWS,Nina Golgowski,2022-09-22,Woman Who Called Cops On Black Bird-Watcher Loses Lawsuit Against Ex-Employer Amy Cooper accused investment firm Franklin Templeton of unfairly firing her and branding her a racist after video of the Central Park encounter went viral.


# Export Data

In [47]:
#export cleaned df
data.to_csv('news_data.csv', index=False)