# 0.1.1 - Preprocessing: Clean Features

**Overview**: This notebook is responsible for cleaning up the raw dataset.

**Actions**: This notebook performs the following actions:

- Remove whitespace from field names.
- Reorder fields in the dataset.
- Drop redundant, irrelevant fields from the dataset.

**Dependencies**: This notebook depends on the following artifact(s):

- `data/interim/ecommerce_data-cleaned-0.1.0.csv`

**Targets**: This notebook outputs one (1) artifact:

- `data/interim/ecommerce_data-cleaned-0.1.1.csv`

## Setup

The following cells import required libraries for python analysis, import the module path to access the project's `src/` module scripts, and enable autoreloading for the hot-reloading of source files outside of the notebook. These are all optional and should be included if needed for development.

In [1]:
# Enable hot-reloading of external scripts.
%load_ext autoreload
%autoreload 2

# Set project directory to project root.
from pathlib import Path
PROJECT_DIR = Path.cwd().resolve().parents[0]
%cd {PROJECT_DIR}

# Import utilities.
from src.data import *
from src.features import *

A:\Library\My Repositories\rit\2211_FALL\ISTE780\Project


## Load Data

In [2]:
# Read dataset into pandas dataframe.
input_filepath = get_interim_filepath("0.1.0", tag="cleaned")
input_filepath

WindowsPath('A:/Library/My Repositories/rit/2211_FALL/ISTE780/Project/data/interim/ecommerce_data-cleaned-0.1.0.csv')

In [3]:
df_input = pd.read_csv(input_filepath, 
                       index_col = 0, 
                       parse_dates=["Crawl Timestamp"])
df_input.info()
df_input.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   Crawl Timestamp  29604 non-null  datetime64[ns, UTC]
 1   Product Name     29604 non-null  object             
 2   Description      29552 non-null  object             
 3   List Price       29604 non-null  float64            
 4   Sale Price       29604 non-null  float64            
 5   Brand            29045 non-null  object             
 6   Category         29588 non-null  object             
dtypes: datetime64[ns, UTC](1), float64(2), object(4)
memory usage: 1.8+ MB


Unnamed: 0,Crawl Timestamp,Product Name,Description,List Price,Sale Price,Brand,Category
0,2019-12-18 10:20:52+00:00,"La Costena Chipotle Peppers, 7 OZ (Pack of 12)",We aim to show you accurate product informati...,31.93,31.93,La Costeï¿½ï¿½a,"Food | Meal Solutions, Grains & Pasta | Canned..."
1,2019-12-18 17:21:48+00:00,Equate Triamcinolone Acetonide Nasal Allergy S...,We aim to show you accurate product informati...,10.48,10.48,Equate,Health | Equate | Equate Allergy | Equate Sinu...
2,2019-12-18 17:46:41+00:00,AduroSmart ERIA Soft White Smart A19 Light Bul...,We aim to show you accurate product informati...,10.99,10.99,AduroSmart ERIA,Electronics | Smart Home | Smart Energy and Li...
3,2019-12-18 22:14:22+00:00,"24"" Classic Adjustable Balloon Fender Set Chro...",We aim to show you accurate product informati...,38.59,38.59,lowrider,Sports & Outdoors | Bikes | Bike Accessories |...
4,2019-12-18 06:56:02+00:00,Elephant Shape Silicone Drinkware Portable Sil...,We aim to show you accurate product informati...,5.81,5.81,Anself,Baby | Feeding | Sippy Cups: Alternatives to P...


## Fieldname Preprocessing

Feature names should be renamed so that they can be referenced consistently. Additionally, they shoul be ordered intuitively.

In [4]:
# Prepare map for field names.
fieldnames = {
    'Product Name': 'name',
    'Description': 'description',
    'List Price': 'price_raw',
    'Sale Price': 'discount_raw',
    'Brand': 'brand',
    'Category': 'category_raw',
}

df_clean1 = clean_features.rename_columns(df_input, mapping = fieldnames)

Renamed columns to Index(['Crawl Timestamp', 'name', 'description', 'price_raw', 'discount_raw',
       'brand', 'category_raw'],
      dtype='object')


In [5]:
# Prepare map for field names.
features = [
    'brand',
    'name',
    'description',
    'category_raw',
    'price_raw',
    'discount_raw',
]
df_clean2 = clean_features.reorganize_columns(df_clean1, features)

Reordered columns to Index(['brand', 'name', 'description', 'category_raw', 'price_raw',
       'discount_raw'],
      dtype='object')


## Save Interim Dataset

The dataset has renamed fields and is ready for the next step in the pipeline.

In [6]:
# Save the file
df_output = df_clean2
save_interim(df_output, "0.1.1")

Saving (cleaned) dataframe (29604, 6) to A:\Library\My Repositories\rit\2211_FALL\ISTE780\Project\data\interim\ecommerce_data-cleaned-0.1.1.csv.
File saved.
