# Data Collection

I have downloaded the `wikihowAll.csv` from the [dataset repo](https://github.com/mahnazkoupaee/WikiHow-Dataset) and will be cleaning and parsing it with the following code.

In [1]:
import numpy as np
import os
import pandas as pd

In [17]:
raw_data = pd.read_csv('wikihowAll.csv', delimiter=',')
raw_data.head()

Unnamed: 0,headline,title,text
0,"\nKeep related supplies in the same area.,\nMa...",How to Be an Organized Artist1,"If you're a photographer, keep all the necess..."
1,\nCreate a sketch in the NeoPopRealist manner ...,How to Create a Neopoprealist Art Work,See the image for how this drawing develops s...
2,"\nGet a bachelor’s degree.,\nEnroll in a studi...",How to Be a Visual Effects Artist1,It is possible to become a VFX artist without...
3,\nStart with some experience or interest in ar...,How to Become an Art Investor,The best art investors do their research on t...
4,"\nKeep your reference materials, sketches, art...",How to Be an Organized Artist2,"As you start planning for a project or work, ..."


In [18]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215365 entries, 0 to 215364
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   headline  214547 non-null  object
 1   title     215364 non-null  object
 2   text      214294 non-null  object
dtypes: object(3)
memory usage: 4.9+ MB


## Data Pre-processing
From the info above, it seems that some entries are missing the text field. Looking at the data description below, we can see that the top `text` value is in fact an empty string, and there seem to be some duplicate values for headlines and text bodies.

In [19]:
raw_data.describe()

Unnamed: 0,headline,title,text
count,214547,215364,214294
unique,214096,215364,209178
top,"\nAcquire a pot.,\nGather the ingredients need...",How to Be an Organized Artist1,",,"
freq,11,1,524


This is confirmed by looking at the null counts in the dataset.

In [29]:
raw_data.isnull().sum()

headline     818
title          1
text        1071
dtype: int64

We will now remove any rows that are missing an entry and drop rows that contain duplicate `text` values so that all entries are from unique articles.

In [33]:
df = raw_data.copy()
df.dropna(inplace=True)
df.isnull().sum()

headline    0
title       0
text        0
dtype: int64

In [35]:
df.drop_duplicates(subset=['text'],inplace=True)
df.describe()

Unnamed: 0,headline,title,text
count,209178,209178,209178
unique,208821,209178,209178
top,"\nAcquire a pot.,\nGather the ingredients need...",How to Be an Organized Artist1,"If you're a photographer, keep all the necess..."
freq,11,1,1


Next, we will preprocess the text as done in [WikiHow-Dataset/process.py](https://github.com/mahnazkoupaee/WikiHow-Dataset/blob/master/process.py) to:
1. remove short articles with long summaries
2. remove extra commas in abstracts (headlines)
3. remove extra commas in articles (text)

In [37]:
df = df[df['headline'].str.len() < 0.75*df['text'].str.len()]
df['headline'].str.replace(".,", ".")
df['text'].replace(to_replace=r"[.]+[\n]+[,]", value=r".\n", regex=True)


  df['headline'].str.replace(".,", ".")


Unnamed: 0,headline,title,text
0,"\nKeep related supplies in the same area.,\nMa...",How to Be an Organized Artist1,"If you're a photographer, keep all the necess..."
1,\nCreate a sketch in the NeoPopRealist manner ...,How to Create a Neopoprealist Art Work,See the image for how this drawing develops s...
2,"\nGet a bachelor’s degree.,\nEnroll in a studi...",How to Be a Visual Effects Artist1,It is possible to become a VFX artist without...
3,\nStart with some experience or interest in ar...,How to Become an Art Investor,The best art investors do their research on t...
4,"\nKeep your reference materials, sketches, art...",How to Be an Organized Artist2,"As you start planning for a project or work, ..."
...,...,...,...
215360,\nConsider changing the spelling of your name....,How to Pick a Stage Name3,"If you have a name that you like, you might f..."
215361,"\nTry out your name.,\nDon’t legally change yo...",How to Pick a Stage Name4,Your name might sound great to you when you s...
215362,"\nUnderstand the process of relief printing.,\...",How to Identify Prints1,Relief printing is the oldest and most tradit...
215363,\nUnderstand the process of intaglio printing....,How to Identify Prints2,"Intaglio is Italian for ""incis­ing,"" and corr..."


## Data Cleaning
Now, we will clean the text by getting rid of punctuation, stopwords, extraneous characters, and standardizing the case.

In [39]:
from bs4 import BeautifulSoup
from nltk.corpus import stopwords