# 0.1.3 - Preprocessing: Clean Text

**Overview**: This notebook is responsible for cleaning up the raw dataset.

**Actions**: This notebook performs the following actions:

- Remove non-ASCII characters from the textual features.

**Dependencies**: This notebook depends on the following artifact(s):

- `data/interim/ecommerce_data-cleaned-0.1.2.csv`

**Targets**: This notebook outputs one (1) artifact:

- `data/interim/ecommerce_data-cleaned-0.1.3.csv`

## Setup

The following cells import required libraries for python analysis, import the module path to access the project's `src/` module scripts, and enable autoreloading for the hot-reloading of source files outside of the notebook. These are all optional and should be included if needed for development.

In [1]:
# Enable hot-reloading of external scripts.
%load_ext autoreload
%autoreload 2

# Set project directory to project root.
from pathlib import Path
PROJECT_DIR = Path.cwd().resolve().parents[0]
%cd {PROJECT_DIR}

# Import utilities.
from src.data import *
from src.features import *

D:\Repositories\rit\ISTE780\Project


## Load Data

In [2]:
# Read dataset into pandas dataframe.
input_filepath = get_interim_filepath("0.1.2", tag="cleaned")
input_filepath

WindowsPath('D:/Repositories/rit/ISTE780/Project/data/interim/ecommerce_data-cleaned-0.1.2.csv')

In [3]:
df_input = pd.read_csv(input_filepath, index_col = 0)
df_input.info()
df_input.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   brand         29045 non-null  object 
 1   name          29604 non-null  object 
 2   description   29552 non-null  object 
 3   category_1    29604 non-null  object 
 4   category_2    29604 non-null  object 
 5   category_3    29604 non-null  object 
 6   keywords      29604 non-null  object 
 7   price_raw     29604 non-null  float64
 8   discount_raw  29604 non-null  float64
dtypes: float64(2), object(7)
memory usage: 2.3+ MB


Unnamed: 0,brand,name,description,category_1,category_2,category_3,keywords,price_raw,discount_raw
0,La Costeï¿½ï¿½a,"La Costena Chipotle Peppers, 7 OZ (Pack of 12)",We aim to show you accurate product informati...,Food,"Meal Solutions, Grains & Pasta",Canned Goods,Canned Vegetables,31.93,31.93
1,Equate,Equate Triamcinolone Acetonide Nasal Allergy S...,We aim to show you accurate product informati...,Health,Equate,Equate Allergy,Equate Sinus Congestion & Nasal Care,10.48,10.48
2,AduroSmart ERIA,AduroSmart ERIA Soft White Smart A19 Light Bul...,We aim to show you accurate product informati...,Electronics,Smart Home,Smart Energy and Lighting,Smart Lighting Smart Light Bulbs,10.99,10.99
3,lowrider,"24"" Classic Adjustable Balloon Fender Set Chro...",We aim to show you accurate product informati...,Sports & Outdoors,Bikes,Bike Accessories,Bike Fenders,38.59,38.59
4,Anself,Elephant Shape Silicone Drinkware Portable Sil...,We aim to show you accurate product informati...,Baby,Feeding,Sippy Cups: Alternatives to Plastic,Unknown,5.81,5.81


## Text Preprocessing

We preprocess our text fields using a `PorterStemmer` and punctuation-targeting regex removal pattern.

In [4]:
# Get the stopwords.
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import regex

# Setup function for cleaning text fields of stop words.
words = stopwords.words("english")
stemmer = PorterStemmer()
def clean_text(x):
    return " ".join([stemmer.stem(i) for i in regex.sub("[^a-zA-Z0-9]", " ", x).split() if i not in words]).lower()
def clean_feature(feature):
    print("Cleaning {}...".format(feature.name))
    return feature.apply(clean_text)

In [5]:
# Select text fields to clean.
df_text = df_input[["brand", "name", "description", "category_1", "category_2", "category_3", "keywords"]]
df_text.describe()

Unnamed: 0,brand,name,description,category_1,category_2,category_3,keywords
count,29045,29604,29552,29604,29604,29604,29604
unique,10594,29308,28748,34,312,1553,2078
top,Unique Bargains,FOOTBALL AMERICA YOUTH INTEGRATED FOOTBALL PAN...,We aim to show you accurate product informati...,Sports & Outdoors,Sports,Bike Components,Unknown
freq,317,8,84,10963,3746,1284,11690


In [6]:
# Fill missing values with empty Strings.
df_text = df_text.fillna("")
df_text.describe()

Unnamed: 0,brand,name,description,category_1,category_2,category_3,keywords
count,29604.0,29604,29604,29604,29604,29604,29604
unique,10595.0,29308,28749,34,312,1553,2078
top,,FOOTBALL AMERICA YOUTH INTEGRATED FOOTBALL PAN...,We aim to show you accurate product informati...,Sports & Outdoors,Sports,Bike Components,Unknown
freq,559.0,8,84,10963,3746,1284,11690


In [8]:
%%time

# Apply the clean text function to all text fields.
df_filtered = df_text.apply(clean_feature)

Cleaning brand...
Cleaning name...
Cleaning description...
Cleaning category_1...
Cleaning category_2...
Cleaning category_3...
Cleaning keywords...
Wall time: 1min 18s


In [15]:
df_output = df_filtered.join(df_input[["price_raw", "discount_raw"]], how='outer')
df_output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   brand         29604 non-null  object 
 1   name          29604 non-null  object 
 2   description   29604 non-null  object 
 3   category_1    29604 non-null  object 
 4   category_2    29604 non-null  object 
 5   category_3    29604 non-null  object 
 6   keywords      29604 non-null  object 
 7   price_raw     29604 non-null  float64
 8   discount_raw  29604 non-null  float64
dtypes: float64(2), object(7)
memory usage: 3.3+ MB


## Save Interim Dataset

The dataset has renamed fields and is ready for the next step in the pipeline.

In [16]:
# Save the file
save_interim(df_output, "0.1.3")

Saving (cleaned) dataframe (29604, 9) to D:\Repositories\rit\ISTE780\Project\data\interim\ecommerce_data-cleaned-0.1.3.csv.
File saved.
