# Clean Parsed csv datasets

To do this task, i have parsed the raw dataset in mbox file into the csv file called 2000_news_headlines.csv that we can use for analysis.

__Objectives__

The primary purpose of this notebook is to clean news_headlines dataset, this dataset will be turned as a cleaned dataset with all necessary data that we will use for research question analysis

In [228]:
import numpy as np
import pandas as pd

In [229]:
# The directory used to store the raw datasets.
datasets_dir = '../../FYP/Raw_dataset/'
# The path to the dataset that is produced by this notebook.
headlines_dir = '../../FYP/Clean_dataset/'

In [230]:
df = pd.read_csv(datasets_dir+'1000_news_headlines.csv')

In [231]:
df_clean = df.copy()
df_clean.sort_values(by='Date', inplace=True)

In [232]:
df_clean

Unnamed: 0,Date,Headlines,Author,Tag Text
592,2018-03-17,Ritual killing? Outrage in Kakamega as missing...,SDE Entertainment News,"Occurrences of ritual killings in Kakamega, wh..."
583,2018-03-17,Meal-ordering app Ritual exposes government em...,The Verge,A couple months after Strava unintentionally e...
584,2018-03-17,"In Spanish Basque Country, Sampling Cider and ...",New York Times,No one really tells you what to do when you fi...
585,2018-03-17,Perspectives | Scapegoating Becomes a Pre-Elec...,EurasiaNet,Perspectives | Scapegoating Becomes a Pre-Elec...
586,2018-03-17,Ready for the new moon? Try this guided ritual...,Well+Good,Mindfulness rockstar Kelly Morris is here to l...
...,...,...,...,...
1636,2022-08-27,"9 New Moon Rituals For Intention Setting, Mani...",Experts - MindBodyGreen,MindBodyGreenNew moons are an excellent time t...
1645,2022-08-27,The Importance of Fire Ritual | Burning Man Jo...,Burning Man Journal,The Importance of Fire Ritual · Extracting a f...
1644,2022-08-27,Pune: Woman Made To Bathe In Public As Per Rit...,In Laws,Outlook IndiaA woman in Maharashtra's Pune has...
1642,2022-08-27,Cult Of The Lamb: The Best Rituals (& When To ...,Game Rant,The Lamb gets one free Ritual when they first ...


#### Few rows contains tag text in author column because the author was missing out in the raw dataset so we need to handle it.

In [233]:
# Filter rows where 'Author' starts with '<https://www' url.
mask = df_clean['Tag Text'].str.startswith('<https://www')
df_clean.loc[mask]

Unnamed: 0,Date,Headlines,Author,Tag Text
1171,2018-08-25,Ritual,"White Lies - Ritual. Buy Vinyl, CD.",<https://www.google.com/url?rct=j&sa=t&url=htt...
1927,2018-11-03,Funerary Cheek-Piercing Ritual,Funerary Cheek-Piercing Ritual (Nayarit #1997....,<https://www.google.com/url?rct=j&sa=t&url=htt...
984,2019-03-16,Living Ritual,"Rituals that heal us, feed us, change us, and ...",<https://www.google.com/url?rct=j&sa=t&url=htt...
1114,2019-06-01,Ritual,"Ritual, an album by Tiësto, Jonas Blue, Rita O...",<https://www.google.com/url?rct=j&sa=t&url=htt...
1395,2019-07-13,SERaT--Database of Ritual Scenes,,<https://www.google.com/url?rct=j&sa=t&url=htt...
622,2019-12-14,Try a shut-down ritual,Ending your day well can set up a great day to...,<https://www.google.com/url?rct=j&sa=t&url=htt...


In [234]:
df_clean.loc[mask, 'Tag Text'] = df_clean.loc[mask, 'Author']

In [235]:
df_clean = df_clean.dropna(subset=['Tag Text'])
mask = df_clean['Tag Text'].str.startswith('<https://www')
mask.sum()

0

#### Handle duplicated values

In [236]:
# Check any duplicated value
df_clean.duplicated().sum()

4

In [237]:
df_clean[df_clean.duplicated()]

Unnamed: 0,Date,Headlines,Author,Tag Text
1416,2018-09-08,Ritual Sacrifice May Have Shaped Dog Domestica...,Discover Magazine (blog),"In the city of Salekhard, Russia, where it mee..."
890,2018-11-17,Five Maryland prep football players charged wi...,CBSSports.com,Five members of a Maryland high school footbal...
1076,2020-05-30,Communion ritual unchanged in Orthodox Church ...,,"One by one, the children and adults line up fo..."
900,2022-04-02,Bloodstained: Ritual of the Night - New Playab...,YouTube,YouTubePlay as Aurora from Child of Light. Blo...


In [238]:
#  For all duplicated records, keep one record and remove its duplicates.
df_clean = df_clean.drop_duplicates()

#### Handle missing values

In [239]:
df_clean

Unnamed: 0,Date,Headlines,Author,Tag Text
592,2018-03-17,Ritual killing? Outrage in Kakamega as missing...,SDE Entertainment News,"Occurrences of ritual killings in Kakamega, wh..."
583,2018-03-17,Meal-ordering app Ritual exposes government em...,The Verge,A couple months after Strava unintentionally e...
584,2018-03-17,"In Spanish Basque Country, Sampling Cider and ...",New York Times,No one really tells you what to do when you fi...
585,2018-03-17,Perspectives | Scapegoating Becomes a Pre-Elec...,EurasiaNet,Perspectives | Scapegoating Becomes a Pre-Elec...
586,2018-03-17,Ready for the new moon? Try this guided ritual...,Well+Good,Mindfulness rockstar Kelly Morris is here to l...
...,...,...,...,...
1636,2022-08-27,"9 New Moon Rituals For Intention Setting, Mani...",Experts - MindBodyGreen,MindBodyGreenNew moons are an excellent time t...
1645,2022-08-27,The Importance of Fire Ritual | Burning Man Jo...,Burning Man Journal,The Importance of Fire Ritual · Extracting a f...
1644,2022-08-27,Pune: Woman Made To Bathe In Public As Per Rit...,In Laws,Outlook IndiaA woman in Maharashtra's Pune has...
1642,2022-08-27,Cult Of The Lamb: The Best Rituals (& When To ...,Game Rant,The Lamb gets one free Ritual when they first ...


In [240]:
# Check any missing values
df_clean.isnull().sum()

Date           0
Headlines      0
Author       247
Tag Text       0
dtype: int64

In [241]:
# In order to not lose any data from other column i decided to replace missing author with a generic name "Unknow"
df_clean.loc[df_clean['Author'].isna(), 'Author'] = 'Unknown'

In [242]:
print('write the dataFrame of extracted data to news_headlines.csv. (%s, %s.)' % df_clean.shape)
df_clean.to_csv(headlines_dir+'2000_news_headlines.csv', index=False)
df_clean.shape, df_clean.columns

write the dataFrame of extracted data to news_headlines.csv. (2200, 4.)


((2200, 4), Index(['Date', 'Headlines', 'Author', 'Tag Text'], dtype='object'))

#### Create the dataframe with a new column and save it,  i will manually label a subset of headlines that represent the entire datasetas by either positive, negative, or neutral to create a "ground truth" dataset using excel.

In [243]:
# Load the dataset
df = pd.read_csv(headlines_dir+'2000_news_headlines.csv')

df['True_Label'] = np.nan

# Save the updated dataframe back to CSV
df.to_csv(headlines_dir+'2000_news_headlines_TrueGround.csv', index=False)