# 0.2.5 - EDA: PCA

**Overview**: This notebook is responsible for exploring the dataset. The purpose is to identify potential for using PCA and GridSearchCV for feature selection.

**Actions**: This notebook performs the following actions:

- Creates a pipeline that takes advantage of the GridSearchCV parameters.

**Dependencies**: This notebook depends on the following artifact(s):

- `data/interim/ecommerce_data-cleaned-0.1.3.csv`

**Targets**: This notebook does not output any artifacts.

## Setup

The following cells import required libraries for python analysis, import the module path to access the project's `src/` module scripts, and enable autoreloading for the hot-reloading of source files outside of the notebook. These are all optional and should be included if needed for development.

In [1]:
# Enable hot-reloading of external scripts.
%load_ext autoreload
%autoreload 2

# Set project directory to project root.
from pathlib import Path
PROJECT_DIR = Path.cwd().resolve().parents[0]
%cd {PROJECT_DIR}

# Import utilities.
from src.data import *
from src.features import *

D:\Repositories\rit\ISTE780\Project


## Load Data

In [2]:
# Read dataset into pandas dataframe.
input_filepath = get_interim_filepath("0.1.3", tag="cleaned")
input_filepath

WindowsPath('D:/Repositories/rit/ISTE780/Project/data/interim/ecommerce_data-cleaned-0.1.3.csv')

In [3]:
# Reference: https://stackoverflow.com/questions/10867028/get-pandas-read-csv-to-read-empty-values-as-empty-string-instead-of-nan
# We should treat empty strings as empty instead of missing for this file.
df_input = pd.read_csv(input_filepath, index_col = 0, keep_default_na=False)
df_input.info()
df_input.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   brand         29604 non-null  object 
 1   name          29604 non-null  object 
 2   description   29604 non-null  object 
 3   category_1    29604 non-null  object 
 4   category_2    29604 non-null  object 
 5   category_3    29604 non-null  object 
 6   keywords      29604 non-null  object 
 7   price_raw     29604 non-null  float64
 8   discount_raw  29604 non-null  float64
dtypes: float64(2), object(7)
memory usage: 2.3+ MB


Unnamed: 0,brand,name,description,category_1,category_2,category_3,keywords,price_raw,discount_raw
0,la cost,la costena chipotl pepper 7 oz pack 12,we aim show accur product inform manufactur su...,food,meal solut grain pasta,can good,can veget,31.93,31.93
1,equat,equat triamcinolon acetonid nasal allergi spra...,we aim show accur product inform manufactur su...,health,equat,equat allergi,equat sinu congest nasal care,10.48,10.48
2,adurosmart eria,adurosmart eria soft white smart a19 light bul...,we aim show accur product inform manufactur su...,electron,smart home,smart energi light,smart light smart light bulb,10.99,10.99
3,lowrid,24 classic adjust balloon fender set chrome bi...,we aim show accur product inform manufactur su...,sport outdoor,bike,bike accessori,bike fender,38.59,38.59
4,anself,eleph shape silicon drinkwar portabl silicon c...,we aim show accur product inform manufactur su...,babi,feed,sippi cup altern plastic,unknown,5.81,5.81


## Validation Subset

We split our data into appropriately sized train/test splits for validation.

In [4]:
from sklearn.model_selection import train_test_split

# Create a 50%/50% train-test split.
X = df_input.drop(columns=['price_raw', 'discount_raw'])
y = df_input['price_raw'].astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state=100)

## Pipeline Creation

First we prepare our data for use within a learning model, encoding our text features appropriately.

In [7]:
# Function for creating the pipeline.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestRegressor

def get_feature_transformer(columns, vectorizer):
    """Prepare the ColumnTransformer."""
    return ColumnTransformer([(feature, vectorizer, feature) for feature in columns], remainder = 'drop', verbose_feature_names_out=True)

vectorizer = TfidfVectorizer(stop_words="english", sublinear_tf=True, norm="l2")
column_transformer = get_feature_transformer(["brand", "name", "description", "category_1", "category_2", "category_3", "keywords"], vectorizer)

def get_pipeline():
    """Get the composed Pipeline"""
    return Pipeline([
        ("vect", column_transformer),
        ("dim", "passthrough"),
        ("clf", RandomForestRegressor())
    ])

N_FEATURES = [2, 4, 10]

def get_param_grid():
    return [
        {
            "dim": [PCA(iterated_power=7), TruncatedSVD(n_iter=7)],
            "dim__n_components": N_FEATURES,
        },
        {
            "dim": [SelectKBest(chi2)],
            "dim__k": N_FEATURES,
        },
    ]

reducer_labels = ["PCA", "TruncatedSVD", "KBest(chi2)"]

# GridSearchCV.
grid = GridSearchCV(get_pipeline(), n_jobs=1, param_grid=get_param_grid())

In [8]:
%%time

grid.fit(X_train, y_train)

15 fits failed out of a total of 45.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\effen\.conda\envs\iste780\lib\site-packages\sklearn\model_selection\_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\effen\.conda\envs\iste780\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\effen\.conda\envs\iste780\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\effen\.conda\envs\iste780\lib\site-packages\joblib\memory.py", line 349, in __call__
    

Wall time: 3min 7s


GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        ColumnTransformer(transformers=[('brand',
                                                                         TfidfVectorizer(stop_words='english',
                                                                                         sublinear_tf=True),
                                                                         'brand'),
                                                                        ('name',
                                                                         TfidfVectorizer(stop_words='english',
                                                                                         sublinear_tf=True),
                                                                         'name'),
                                                                        ('description',
                                                                         TfidfVectorizer(stop_wo