# Preparing Collected Data

In this notebook we'll go through the process of preparing the collected article data. The steps outlined here will allow for reproducing the results in production. The main interest here is to normalize the data in such a way that it can be easily utilized in exploration and modeling in production.

## Imports

These are all the modules that we'll need to run the code in this notebook.

In [1]:
import numpy as np
import pandas as pd

## Pulling the Training Data

A set of article data has been set aside for the purposes of data analysis and model training. Here I'll load that data and take a look at what features we're working with.

In [8]:
df = pd.read_csv('../data/articles-with-topic-label.csv')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   user_id         250 non-null    object 
 1   publication_id  250 non-null    object 
 2   title           250 non-null    object 
 3   subtitle        248 non-null    object 
 4   date            250 non-null    object 
 5   word_count      250 non-null    float64
 6   read_time       250 non-null    float64
 7   url             250 non-null    object 
 8   tags            250 non-null    object 
 9   topics          250 non-null    object 
 10  lang            250 non-null    object 
 11  author          244 non-null    object 
 12  publication     245 non-null    object 
dtypes: float64(2), object(11)
memory usage: 25.5+ KB


## Removing Null Values

The first that needs to be done is removing null observations that won't provide any value to us. Anything missing in the author and publication columns won't be of any use to us since these articles are not visible publicly. Details about this issue can be found in the building_labeled_data.ipynb notebook. Here we'll simply remove these rows. Null values in the subtitle column are not a problem for us since this is normal for some articles, but we'll probably want to change these to have empty strings instead of np.NaN.

In [12]:
# Here we're filling nulls in the subtitle column with an empty string.
df[['subtitle']] = df[['subtitle']].fillna('')

In [13]:
# Here we're removing all remaining rows with missing values.
df = df.dropna()

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 241 entries, 0 to 249
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   user_id         241 non-null    object 
 1   publication_id  241 non-null    object 
 2   title           241 non-null    object 
 3   subtitle        241 non-null    object 
 4   date            241 non-null    object 
 5   word_count      241 non-null    float64
 6   read_time       241 non-null    float64
 7   url             241 non-null    object 
 8   tags            241 non-null    object 
 9   topics          241 non-null    object 
 10  lang            241 non-null    object 
 11  author          241 non-null    object 
 12  publication     241 non-null    object 
dtypes: float64(2), object(11)
memory usage: 26.4+ KB
