# 0.1.2 - Preprocessing: Category Extraction

**Overview**: This notebook is responsible for extracting categories from the `category_raw` text feature.

**Actions**: This notebook performs the following actions:

- Load intermediate processed data.
- Select the `category_raw` feature vector.
- Preprocess the feature vector to remove the "|" character.
- Extract the `primary_category` feature.
- Extract the `other_category` feature.

**Dependencies**: This notebook depends on the following artifact(s):

- `ecommerce_data-cleaned-0.1.1.csv`

**Targets**: This notebook outputs one (1) artifact:

- `ecommerce_data-cleaned-0.1.2.csv`

## Setup

The following cells import required libraries for python analysis, import the module path to access the project's `src/` module scripts, and enable autoreloading for the hot-reloading of source files outside of the notebook. These are all optional and should be included if needed for development.

In [1]:
# Enable hot-reloading of external scripts.
%load_ext autoreload
%autoreload 2

# Update the working directory to be the project root directory.
from pathlib import Path, PurePosixPath
PROJECT_DIR = Path.cwd().resolve().parents[0]
%cd {PROJECT_DIR}

# Import utilities.
from IPython.core.display import display
from src.data import *

D:\Repositories\rit\ISTE780\Project


## Load Data

In [2]:
# Read dataset into pandas dataframe.
input_filepath = get_interim_filepath("0.1.1", tag="cleaned")
input_filepath

WindowsPath('D:/Repositories/rit/ISTE780/Project/data/interim/ecommerce_data-cleaned-0.1.1.csv')

In [3]:
df_input = pd.read_csv(input_filepath, index_col = 0)
df_input.info()
df_input.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   brand         29045 non-null  object 
 1   name          29604 non-null  object 
 2   description   29552 non-null  object 
 3   category_raw  29588 non-null  object 
 4   price_raw     29604 non-null  float64
 5   discount_raw  29604 non-null  float64
dtypes: float64(2), object(4)
memory usage: 1.6+ MB


Unnamed: 0,brand,name,description,category_raw,price_raw,discount_raw
0,La Costeï¿½ï¿½a,"La Costena Chipotle Peppers, 7 OZ (Pack of 12)",We aim to show you accurate product informati...,"Food | Meal Solutions, Grains & Pasta | Canned...",31.93,31.93
1,Equate,Equate Triamcinolone Acetonide Nasal Allergy S...,We aim to show you accurate product informati...,Health | Equate | Equate Allergy | Equate Sinu...,10.48,10.48
2,AduroSmart ERIA,AduroSmart ERIA Soft White Smart A19 Light Bul...,We aim to show you accurate product informati...,Electronics | Smart Home | Smart Energy and Li...,10.99,10.99
3,lowrider,"24"" Classic Adjustable Balloon Fender Set Chro...",We aim to show you accurate product informati...,Sports & Outdoors | Bikes | Bike Accessories |...,38.59,38.59
4,Anself,Elephant Shape Silicone Drinkware Portable Sil...,We aim to show you accurate product informati...,Baby | Feeding | Sippy Cups: Alternatives to P...,5.81,5.81


## Parsing Category

In [4]:
# Get dataframe containing only category text.
categories = df_input[["category_raw"]]
categories

Unnamed: 0,category_raw
0,"Food | Meal Solutions, Grains & Pasta | Canned..."
1,Health | Equate | Equate Allergy | Equate Sinu...
2,Electronics | Smart Home | Smart Energy and Li...
3,Sports & Outdoors | Bikes | Bike Accessories |...
4,Baby | Feeding | Sippy Cups: Alternatives to P...
...,...
29994,"Food | Snacks, Cookies & Chips | Chips & Crisp..."
29996,Sports & Outdoors | Bikes | Bike Components | ...
29997,"Food | Meal Solutions, Grains & Pasta | Canned..."
29998,Beauty | Hair Care | Hair Styling Tools | Flat...


In [5]:
# Split into multiple columns.
category_table = categories['category_raw'].str.split('|', expand=True).rename(columns = lambda x: "category_"+str(x+1))
category_table

Unnamed: 0,category_1,category_2,category_3,category_4,category_5,category_6,category_7
0,Food,"Meal Solutions, Grains & Pasta",Canned Goods,Canned Vegetables,,,
1,Health,Equate,Equate Allergy,Equate Sinus Congestion & Nasal Care,,,
2,Electronics,Smart Home,Smart Energy and Lighting,Smart Lighting,Smart Light Bulbs,,
3,Sports & Outdoors,Bikes,Bike Accessories,Bike Fenders,,,
4,Baby,Feeding,Sippy Cups: Alternatives to Plastic,,,,
...,...,...,...,...,...,...,...
29994,Food,"Snacks, Cookies & Chips",Chips & Crisps,Chips & Crisps,,,
29996,Sports & Outdoors,Bikes,Bike Components,Bike Forks,,,
29997,Food,"Meal Solutions, Grains & Pasta",Canned Goods,Canned Fruit,,,
29998,Beauty,Hair Care,Hair Styling Tools,Flat Irons,Hair Flat Irons,,


In [6]:
# See number of nulls in each category.
category_levels = ["category_"+str(x+1) for x in range(7)]
category_counts = category_table.describe()
category_nulls = category_counts.apply(lambda x: category_table[x.name].isnull().sum())
category_nulls.name = "nulls"
category_counts.append(category_nulls, ignore_index=False)

Unnamed: 0,category_1,category_2,category_3,category_4,category_5,category_6,category_7
count,29588,29588,29330,17898,5600,261,3
unique,33,311,1552,1619,769,95,2
top,Sports & Outdoors,Sports,Bike Components,All Bike Components,"All Lures, Baits and Attractants",All Flashlights,Shop All Table Lamps by Style
freq,10963,3746,1284,317,85,38,2
nulls,16,16,274,11706,24004,29343,29601


We will fill category 1, 2, and 3 with an "Other" category for missing entries.

In [7]:
category_primary = category_table[category_levels[0:3]]
category_primary = category_primary.fillna("Other")
category_primary.describe()

Unnamed: 0,category_1,category_2,category_3
count,29604,29604,29604
unique,34,312,1553
top,Sports & Outdoors,Sports,Bike Components
freq,10963,3746,1284


We will combine category sub-levels 4 through 7 as "keywords".

In [8]:
category_keywords = category_table[category_levels[3:7]].apply(lambda x : '{} {} {}'.format(x[0],x[1],x[2]), axis=1)
category_keywords = category_keywords.apply(lambda x: x.replace("None", "").strip()).replace('', "Unknown")
category_keywords.name = "keywords"
category_keywords.value_counts()

Unknown                                                    11690
All Bike Components                                          317
Golf Clothing                                                208
Sports Equipment Fan Shop                                    142
Golf Training Aids                                            95
                                                           ...  
Great Value Bags                                               1
Top Drug Test Brands   EZ Level Drug Test                      1
Nursing Bra                                                    1
Mens Cold Weather Clothing   Mens Cold Weather Shop All        1
Korean & Japanese Beauty   Personal Care   Ear Care            1
Name: keywords, Length: 2078, dtype: int64

In [9]:
# Combine new engineered features into one dataframe.
category_features = category_primary.join(category_keywords, how='outer')
category_features.describe()

Unnamed: 0,category_1,category_2,category_3,keywords
count,29604,29604,29604,29604
unique,34,312,1553,2078
top,Sports & Outdoors,Sports,Bike Components,Unknown
freq,10963,3746,1284,11690


In [10]:
# Add preprocessed features into the cleaned dataframe.
df_clean1 = df_input.join(category_features, how='outer')
# Filter and reorder columns.
output_columns = ["brand", "name", "description", "category_1", "category_2", "category_3", "keywords", "price_raw", "discount_raw"]
df_clean1 = df_clean1[output_columns]
df_clean1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   brand         29045 non-null  object 
 1   name          29604 non-null  object 
 2   description   29552 non-null  object 
 3   category_1    29604 non-null  object 
 4   category_2    29604 non-null  object 
 5   category_3    29604 non-null  object 
 6   keywords      29604 non-null  object 
 7   price_raw     29604 non-null  float64
 8   discount_raw  29604 non-null  float64
dtypes: float64(2), object(7)
memory usage: 3.3+ MB


## Save Interim Dataset

The dataset now contains the engineered category features.

In [11]:
# Save the file.
df_output = df_clean1
save_interim(df_output, "0.1.2")

Saving (cleaned) dataframe (29604, 9) to D:\Repositories\rit\ISTE780\Project\data\interim\ecommerce_data-cleaned-0.1.2.csv.
File saved.
