## High-Value Imports Data

This notebook processes the 100 products to be usable as input for predictions from our trained model. For our products to be used as input we'll have to duplicate the list per each importer (Trading Partner) and we'll also duplicate the products with line numbers between 1 and 4.

In [14]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [15]:
from data_util import load_hvi

hvi_products_df = load_hvi()
hvi_products_df.head()

Unnamed: 0,product_id,product,category,individual_category,brand_name,description,price_usd
0,2296012,Wearing Apparel/Accessories,Bottom Wear,jeans,Roadster,roadster men navy blue slim fit mid rise clean...,17.9092
1,13780156,Wearing Apparel/Accessories,Bottom Wear,track-pants,LOCOMOTIVE,locomotive men black white solid slim fit tra...,13.727599
2,11895958,Wearing Apparel/Accessories,Topwear,shirts,Roadster,roadster men navy white black geometric print...,16.714456
3,4335679,Wearing Apparel/Accessories,Lingerie & Sleep Wear,shapewear,Zivame,zivame women black saree shapewear zi3023core0...,15.471924
4,11690882,Wearing Apparel/Accessories,Western,tshirts,Roadster,roadster women white solid v neck pure cotton ...,7.156511


Since the input data must match or have the same number of with which the model was trained, 

  1. `brand_name`
  2. `description`
  3. `individual_category`
  4. `product`

Furthermore, we'll rename the following columns:

  1. `price_usd` -> `msrp`

Finally, we'll keep the following features, but they will have to be removed immediately before training the model:

  1. `product_id`
  2. `brand_name`

In [16]:
from feature_util import load_model_pickle

ipr_model = load_model_pickle('models/ipr_model_202408180449.pkl')
print(ipr_model.feature_names_in_)

['line_count' 'msrp' 'trading_partner_cn' 'trading_partner_hk'
 'trading_partner_other_countries' 'trading_partner_sg'
 'trading_partner_tr']


To determine line counts, we'll load the processed IPR data and use percentiles as the common values:

In [17]:
ipr_processed_df = pd.concat(
    [
        pd.read_csv(f)
        for f in [
            "data/processed/ipr_data_processed_cv.csv",
            "data/processed/ipr_data_processed_train.csv",
            "data/processed/ipr_data_processed_test.csv",
        ]
    ]
)

ipr_processed_df.head()

Unnamed: 0,msrp,trading_partner_cn,trading_partner_hk,trading_partner_other_countries,trading_partner_sg,trading_partner_tr,line_count,seized
0,3898.05,1,0,0,0,0,277,1.0
1,15.0,0,1,0,0,0,10,1.0
2,8400.0,0,0,0,0,1,25,1.0
3,299000.0,0,1,0,0,0,1,1.0
4,815.0,0,1,0,0,0,7,1.0


In [18]:
ipr_processed_df.line_count.describe()

count    444113.00000
mean         24.08539
std          56.99873
min           1.00000
25%           3.00000
50%           8.00000
75%          19.00000
max         740.00000
Name: line_count, dtype: float64

In [19]:
from etl_hvi_data import transform

trading_partner_columns = [
    f for f in ipr_model.feature_names_in_ if f.startswith("trading_partner_")
]
line_count_values = [24, 56, 1, 3, 8, 19, 740]

hvi_processed_df = transform(
    hvi_products_df, trading_partner_columns, line_count_values, copy=True
)
hvi_processed_df.head()

Unnamed: 0,product_id,brand_name,msrp,trading_partner_cn,trading_partner_tr,trading_partner_other_countries,trading_partner_sg,trading_partner_hk,line_count
0,2296012,Roadster,17.9092,1,0,0,0,0,24
1,13780156,LOCOMOTIVE,13.727599,1,0,0,0,0,24
2,11895958,Roadster,16.714456,1,0,0,0,0,24
3,4335679,Zivame,15.471924,1,0,0,0,0,24
4,11690882,Roadster,7.156511,1,0,0,0,0,24


In [20]:
hvi_processed_df.describe()

Unnamed: 0,product_id,msrp,trading_partner_cn,trading_partner_tr,trading_partner_other_countries,trading_partner_sg,trading_partner_hk,line_count
count,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0
mean,12575320.0,24.713262,0.2,0.2,0.2,0.2,0.2,121.571429
std,4424019.0,20.248316,0.400057,0.400057,0.400057,0.400057,0.400057,253.100159
min,1864573.0,4.767025,0.0,0.0,0.0,0.0,0.0,1.0
25%,10539420.0,11.935484,0.0,0.0,0.0,0.0,0.0,3.0
50%,13648650.0,17.9092,0.0,0.0,0.0,0.0,0.0,19.0
75%,16037080.0,30.101553,0.0,0.0,0.0,0.0,0.0,56.0
max,17899310.0,143.369176,1.0,1.0,1.0,1.0,1.0,740.0


In [21]:
hvi_processed_df[trading_partner_columns].sum()

trading_partner_cn                 700
trading_partner_hk                 700
trading_partner_other_countries    700
trading_partner_sg                 700
trading_partner_tr                 700
dtype: int64

In [22]:
hvi_processed_df.isnull().sum()

product_id                         0
brand_name                         0
msrp                               0
trading_partner_cn                 0
trading_partner_tr                 0
trading_partner_other_countries    0
trading_partner_sg                 0
trading_partner_hk                 0
line_count                         0
dtype: int64

Finally, after adapting the dataset to be used for predictions with the model, we'll save it and 
perform the predictions and evaluation in a separate notebook

In [23]:
hvi_processed_df.to_csv('data/processed/hvi_data_processed.csv', index=False)