![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

# MLA Tabular Data - Final Project: Product Substitute Prediction
# Day 1: Data Processing and Developing ML Model
___
### Problem Definition:
Given a pair of products, (A, B), we say that B is a "substitute" for A if a customer would buy B in place of A -- say, if A were out of stock.

The goal of this project is to predict a substitute relationship between pairs of products. Complete the tasks in this notebook and submit your result CSV file to the leaderboard: __https://leaderboard.corp.amazon.com/tasks/478__

### __Dataset and Files:__

* __asin_product_data.csv__: Each line gives a specific product information such as ASIN, category, item name, etc. We will use this to create a feature vector for each product. This file has 113 columns, we will try to select some useful columns in this notebook because not all of them are suitable. `|Region Id|MarketPlace Id|ASIN|Binding Code|binding_description|brand_code|case_pack_quantity|, ...`


* __dataset_metadata.csv__: Provides detailed information about all 113 columns in the asin_product_data.csv


* __training.csv__: Product pairs to consider are given here. Its columns are:
> - `ID:` ID of the record
> - `key_asin:` Key product ASIN
> - `cand_asin:` Candidate product ASIN
> - `label:` Tells whether the key and candidate products are susbstitutes (1) or not (0).

* __Sample submission file:__ You can downlad a sample file from [here](https://leaderboard.corp.amazon.com/datasets/1469)

___
## 1. Reading the Dataset

In [1]:
import boto3
from os import path
import pandas as pd

# import the datasets
bucketname = 'mlu-student-datalake' # replace with your bucket name
filename1 = 'MLA-TAB/asin_product.csv' # replace with your object key
filename2 = 'MLA-TAB/training.csv' # replace with your object key
filename3 = 'MLA-TAB/public_test_features.csv' # replace with your object key
s3 = boto3.resource('s3')
if not path.exists("asin_product.csv"):
    s3.Bucket(bucketname).download_file(filename1, 'asin_product.csv')
if not path.exists("training.csv"):
    s3.Bucket(bucketname).download_file(filename2, 'training.csv')
if not path.exists("public_test_features.csv"):
    s3.Bucket(bucketname).download_file(filename3, 'public_test_features.csv')
    
asin_product_data = pd.read_csv('asin_product.csv', encoding='ISO-8859-1')
training_data = pd.read_csv('training.csv')
test_data = pd.read_csv('public_test_features.csv')
#env.head()

  interactivity=interactivity, compiler=compiler, result=result)


Let's see what our data looks like below:

__"asin_product_data.csv"__ file gives us information about products. We will use this as our main data table to construct feature vectors for each ASIN (product) in our training and test datasets

In [2]:
asin_product_data.head()

Unnamed: 0,Region Id,MarketPlace Id,ASIN,Binding Code,binding_description,brand_code,case_pack_quantity,classification_code,classification_description,color_map,...,pkg_weight,pkg_weight_uom,pkg_width,release_date_embargo_level,dw_creation_date,dw_last_updated,is_deleted,last_updated,version,external_testing_certification
0,1,1,153427507,hardcover,Hardcover,,,base_product,Base Product,,...,,,,,4-Jan-11,22-Jul-16,N,21-Jul-16,145,
1,1,1,267648340,hardcover,Hardcover,FOS3T,,base_product,Base Product,,...,0.85,pounds,5.98,,16-Sep-16,23-Feb-18,N,22-Feb-18,33,
2,1,1,545496470,hardcover,Hardcover,KLUTZ,6.0,base_product,Base Product,Black,...,1.631404,pounds,8.267717,low,20-Aug-12,23-Dec-17,N,23-Dec-17,20912,
3,1,1,679858040,paperback,Paperback,,,base_product,Base Product,,...,1.2,pounds,7.7,,4-Jan-11,4-Nov-17,N,3-Nov-17,2108,
4,1,1,078694742X,toy,Toy,WZDCS,12.0,base_product,Base Product,,...,0.62,pounds,6.535433,,4-Jan-11,21-Oct-17,N,20-Oct-17,9395,


__"training.csv"__ file is our training data. This file has a label 1 if the two products are subsitutes to each other and 0 otherwise.

In [3]:
training_data.head()

Unnamed: 0,ID,key_asin,cand_asin,label
0,42595,B01L7CFUWC,B01CU4SOQ0,1
1,35775,B01KDAKKTM,B01CGQE5YC,0
2,37152,B013FA0UVA,B06W9HY6MV,1
3,4340,B008KPZLEC,B01M5AMNA1,0
4,37349,B0196BJHXY,B00XCHMLI2,1


__"public_test_features.csv"__ file is the test data. Let's see what it looks like. See below that we don't have the label column in this data. We will predict the labels with our Machine Learning model.

In [4]:
test_data.head()

Unnamed: 0,ID,key_asin,cand_asin
0,39236,B01C5TFLSE,B06XDMZ5MY
1,1353,B003YJ8TVQ,B01G6R24CM
2,39280,B0063X7BT6,B01BO2NOD2
3,1665,B01DJH637O,B017SCJACQ
4,14925,B003U8ESI4,B00HGDMGM4


_____
## 2. Exploratory Data Analysis and Feature Engineering

As you can see the from the __"asin_product_data.head()"__ result above, we are dealing with a lot of columns. Each column is a potential feature for our problem. As we don't know how to deal with text and categorical columns yet, we will only consider numerical columns in this first submission.

In [5]:
# categorical columns include "ASIN", "item_package_quantity", "pkg_height", "pkg_length", "pkg_weight"
asin_product_data.isna().sum()

Region Id                             0
MarketPlace Id                        0
ASIN                                  0
Binding Code                      10072
binding_description               10072
brand_code                        45368
case_pack_quantity                44866
classification_code                   1
classification_description            1
color_map                         46490
cpsia_cautionary_statement        55653
creation_date                         0
currency_code                         0
customer_return_method            65091
customer_return_policy            61869
delivery_option                   63364
discontinued_date                 65715
ean                               11445
esrb_age_rating                   65624
esrb_descriptors                  65713
excluded_direct_browse_node_id    53635
fedas_id                          65702
fma_qualified_price_max           14540
fma_override                      65695
Product Group Code                    0


Let's select numerical features below

In [6]:
#replacing missing values with mean for now
asin_product_data["item_package_quantity"].fillna(asin_product_data["item_package_quantity"].mean(), inplace=True)
asin_product_data["item_height"].fillna(asin_product_data["item_height"].mean(), inplace=True)
asin_product_data["item_width"].fillna(asin_product_data["item_width"].mean(), inplace=True)
asin_product_data["item_length"].fillna(asin_product_data["item_length"].mean(), inplace=True)
asin_product_data["item_weight"].fillna(asin_product_data["item_weight"].mean(), inplace=True)
asin_product_data["pkg_height"].fillna(asin_product_data["pkg_height"].mean(), inplace=True)
asin_product_data["pkg_width"].fillna(asin_product_data["pkg_width"].mean(), inplace=True)
asin_product_data["pkg_length"].fillna(asin_product_data["pkg_length"].mean(), inplace=True)
asin_product_data["pkg_weight"].fillna(asin_product_data["pkg_weight"].mean(), inplace=True)

In [7]:
# re-cheking the replaced values
asin_product_data.isna().sum()

Region Id                             0
MarketPlace Id                        0
ASIN                                  0
Binding Code                      10072
binding_description               10072
brand_code                        45368
case_pack_quantity                44866
classification_code                   1
classification_description            1
color_map                         46490
cpsia_cautionary_statement        55653
creation_date                         0
currency_code                         0
customer_return_method            65091
customer_return_policy            61869
delivery_option                   63364
discontinued_date                 65715
ean                               11445
esrb_age_rating                   65624
esrb_descriptors                  65713
excluded_direct_browse_node_id    53635
fedas_id                          65702
fma_qualified_price_max           14540
fma_override                      65695
Product Group Code                    0


___
## 3. Creating the feature map

We are giving you a helper function here. Because of the nature of our training and test data, we need to map ASIN's to their feaures. For this reason, we will use a dictionary variable named __feature_map__. The main advantage of this is fast lookup time when we are putting together the training and test features

We completed constructing our features for each product in section 2. If you used the given columns, you will have 4 numeric features for each product. The code below maps ASINs to these features. 

Let's print our dataframe below.

In [8]:
asin_product_data.head()

Unnamed: 0,Region Id,MarketPlace Id,ASIN,Binding Code,binding_description,brand_code,case_pack_quantity,classification_code,classification_description,color_map,...,pkg_weight,pkg_weight_uom,pkg_width,release_date_embargo_level,dw_creation_date,dw_last_updated,is_deleted,last_updated,version,external_testing_certification
0,1,1,153427507,hardcover,Hardcover,,,base_product,Base Product,,...,6.23624,,7.714382,,4-Jan-11,22-Jul-16,N,21-Jul-16,145,
1,1,1,267648340,hardcover,Hardcover,FOS3T,,base_product,Base Product,,...,0.85,pounds,5.98,,16-Sep-16,23-Feb-18,N,22-Feb-18,33,
2,1,1,545496470,hardcover,Hardcover,KLUTZ,6.0,base_product,Base Product,Black,...,1.631404,pounds,8.267717,low,20-Aug-12,23-Dec-17,N,23-Dec-17,20912,
3,1,1,679858040,paperback,Paperback,,,base_product,Base Product,,...,1.2,pounds,7.7,,4-Jan-11,4-Nov-17,N,3-Nov-17,2108,
4,1,1,078694742X,toy,Toy,WZDCS,12.0,base_product,Base Product,,...,0.62,pounds,6.535433,,4-Jan-11,21-Oct-17,N,20-Oct-17,9395,


In [9]:
feature_map = {}

for index, row in asin_product_data.iterrows():
    # load all features in (some are useless)
    feature_map[row["ASIN"]] = row.tolist()

In [10]:
# Test: This should work if the mapper is correctly created
print(feature_map["1785245481"])

[1, 1, '1785245481', 'office_product', 'Office Product', nan, nan, 'base_product', 'Base Product', nan, nan, '24-Feb-16', 'USD', nan, nan, nan, nan, 9780000000000.0, nan, nan, nan, nan, 26.31, nan, 229, 'gl_office_product', 'Y', 'N', 'N', 'N', 'N', nan, nan, 'N', nan, 'Y', 'N', 'N', nan, 'Y', nan, '1785245481', 1.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, 0.25, 11.81, 'Inspirational Quotes, Notable Quotables 2017 Monthly Wall Calendar, 12" x 12"', 1.0, 0.5, 11.81, 'en_US', 'The Gifted Stationary Co', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 'OFFICE_PRODUCTS', 3523, nan, nan, nan, nan, nan, nan, nan, nan, 'The Gifted Stationary Co', nan, nan, nan, nan, nan, nan, nan, nan, 'COLOR', 2.0, nan, nan, nan, nan, 'inches', 0.25, 11.81199999, 0.5, 'pounds', 11.81199999, nan, '25-Feb-16', '9-Jun-17', 'N', '9-Jun-17', 69, nan]


___
## 4. Getting features

We provide the __getFeatures()__ function below. We will get training and test features with this function.

In this function, we are concatenating key product features and candidate product features. For example:
key_asin_feature = [0.12, 2.5, 1, 4.2] and cand_asin_feature = [0.5, 0.1, 3.2, 2.75] results in [0.12, 2.5, 1, 4.2, 0.5, 0.1, 3.2, 2.75]

In [11]:
def getFeatures(data_df, feature_map):
    features = []
    for index, row in data_df.iterrows():
        key_features = feature_map[row["key_asin"]]
        cand_features = feature_map[row["cand_asin"]]
        
        # Concatenate feature vectors
        concat_features = key_features + cand_features
        features.append(concat_features)
        
    return features

Let's use this function twice to get our training and test features.

In [12]:
train_features = getFeatures(training_data, feature_map)
test_features = getFeatures(test_data, feature_map)

In [13]:
# create column names for features dataframes
key_columns = ['key_' + val for val in asin_product_data.columns]
cand_columns = ['cand_' + val for val in asin_product_data.columns]
concat_columns = key_columns + cand_columns

training_features_df = pd.DataFrame(train_features, columns=concat_columns)

pd.set_option('display.max_columns', None)
training_features_df.head()

Unnamed: 0,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,key_classification_description,key_color_map,key_cpsia_cautionary_statement,key_creation_date,key_currency_code,key_customer_return_method,key_customer_return_policy,key_delivery_option,key_discontinued_date,key_ean,key_esrb_age_rating,key_esrb_descriptors,key_excluded_direct_browse_node_id,key_fedas_id,key_fma_qualified_price_max,key_fma_override,key_Product Group Code,key_Product Group Description,key_has_ean,key_has_online_play,key_has_platform,key_has_recommended_browse_nodes,key_has_upc,key_inner_package_type,key_is_adult_product,key_is_advantage,key_is_certified_organic,key_is_conveyable,key_is_discontinued,key_is_manufacture_on_demand,key_is_phone_upgradeable,key_Is Sortable,key_is_super_saver_shipping_excl,key_isbn,key_item_classification_id,key_item_display_diameter,key_item_display_height,key_item_display_length,key_item_display_length_uom,key_item_display_volume,key_item_display_volume_uom,key_item_display_weight,key_item_display_weight_uom,key_item_display_width,key_item_height,key_item_length,key_item_name,key_item_package_quantity,key_item_weight,key_item_width,key_language_code,key_manufacturer_name,key_manufacturer_sku,key_manufacturer_vendor_code,key_max_weight_recommendation,key_mfg_series_number,key_min_weight_recommendation,key_model_number,key_monthly_recurring_charge,key_number_of_items,key_number_of_licenses,key_number_of_pages,key_number_of_points,key_ordering_channel,key_preferred_vendor,key_product_sample_received_day,key_product_type,key_product_type_id,key_program_member,key_program_member_code,key_publication_date,key_publication_day,key_publication_month,key_publication_year,key_publisher,key_publisher_code,key_publisher_studio_label,key_recall_description,key_recall_external_identifier,key_recall_notice_expiration_date,key_recall_notice_publication_date,key_recall_notice_receive_date,key_target_gender,key_unit_count,key_upc,key_variation_theme_description,key_variation_theme_id,key_video_game_region,key_video_game_region_description,key_wireless_provider,key_wireless_provider_code,key_pkg_dimensional_uom,key_pkg_height,key_pkg_length,key_pkg_weight,key_pkg_weight_uom,key_pkg_width,key_release_date_embargo_level,key_dw_creation_date,key_dw_last_updated,key_is_deleted,key_last_updated,key_version,key_external_testing_certification,cand_Region Id,cand_MarketPlace Id,cand_ASIN,cand_Binding Code,cand_binding_description,cand_brand_code,cand_case_pack_quantity,cand_classification_code,cand_classification_description,cand_color_map,cand_cpsia_cautionary_statement,cand_creation_date,cand_currency_code,cand_customer_return_method,cand_customer_return_policy,cand_delivery_option,cand_discontinued_date,cand_ean,cand_esrb_age_rating,cand_esrb_descriptors,cand_excluded_direct_browse_node_id,cand_fedas_id,cand_fma_qualified_price_max,cand_fma_override,cand_Product Group Code,cand_Product Group Description,cand_has_ean,cand_has_online_play,cand_has_platform,cand_has_recommended_browse_nodes,cand_has_upc,cand_inner_package_type,cand_is_adult_product,cand_is_advantage,cand_is_certified_organic,cand_is_conveyable,cand_is_discontinued,cand_is_manufacture_on_demand,cand_is_phone_upgradeable,cand_Is Sortable,cand_is_super_saver_shipping_excl,cand_isbn,cand_item_classification_id,cand_item_display_diameter,cand_item_display_height,cand_item_display_length,cand_item_display_length_uom,cand_item_display_volume,cand_item_display_volume_uom,cand_item_display_weight,cand_item_display_weight_uom,cand_item_display_width,cand_item_height,cand_item_length,cand_item_name,cand_item_package_quantity,cand_item_weight,cand_item_width,cand_language_code,cand_manufacturer_name,cand_manufacturer_sku,cand_manufacturer_vendor_code,cand_max_weight_recommendation,cand_mfg_series_number,cand_min_weight_recommendation,cand_model_number,cand_monthly_recurring_charge,cand_number_of_items,cand_number_of_licenses,cand_number_of_pages,cand_number_of_points,cand_ordering_channel,cand_preferred_vendor,cand_product_sample_received_day,cand_product_type,cand_product_type_id,cand_program_member,cand_program_member_code,cand_publication_date,cand_publication_day,cand_publication_month,cand_publication_year,cand_publisher,cand_publisher_code,cand_publisher_studio_label,cand_recall_description,cand_recall_external_identifier,cand_recall_notice_expiration_date,cand_recall_notice_publication_date,cand_recall_notice_receive_date,cand_target_gender,cand_unit_count,cand_upc,cand_variation_theme_description,cand_variation_theme_id,cand_video_game_region,cand_video_game_region_description,cand_wireless_provider,cand_wireless_provider_code,cand_pkg_dimensional_uom,cand_pkg_height,cand_pkg_length,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,1,1,B01L7CFUWC,,,,,base_product,Base Product,,,29-Aug-16,USD,,,,,,,,,,35.17,,60,gl_home_improvement,N,N,N,N,N,,,N,,Y,N,N,,Y,,,1.0,,,,,,,,,,0.03937,0.03937,KINGSO E26/ E27 Pendant Light Triple Lamp Sock...,1.0,0.00022,0.03937,en_US,KINGSO,,,,,,,,,,,,,,,HOME_LIGHTING_AND_LAMPS,930,,,,,,,,,KINGSO,Product Safety Investigation,,,,,,,,COLOR,2.0,,,,,inches,6.1,8.7,0.69,pounds,7.0,,30-Aug-16,7-Feb-18,N,6-Feb-18,49,,1,1,B01CU4SOQ0,,,,,base_product,Base Product,7 Head,,11-Mar-16,USD,,,,,742000000000.0,,,,,74.95,,60,gl_home_improvement,Y,N,N,N,Y,,,N,,Y,N,N,,Y,,,1.0,,,,,,,,,,11.0,10.8,Lightess Pendant Lights Vintage Multi Cord Edi...,1.0,2.8,5.1,en_US,Lightess,,,,,,,,,,,,,,,HOME_LIGHTING_AND_LAMPS,930,,,,,,,,,Lightess,,,,,,,,742000000000.0,COLOR,2.0,,,,,inches,5.12,13.0,2.819712,pounds,10.0,,12-Mar-16,27-Feb-18,N,26-Feb-18,181,
1,1,1,B01KDAKKTM,baby_product,Baby Product,,,base_product,Base Product,White,,14-Aug-16,USD,,,,,702000000000.0,,,,,29.4,,75,gl_baby_product,Y,N,N,N,Y,,N,N,,Y,N,N,,Y,,,1.0,,,,,,,,,,3.0,2.0,Hawaiian Fish Hook Pendant Silicone Teething N...,1.0,7.098467,0.38,en_US,"Glorified Enterprises USA, LLC",,,,,,,,,,,,,,,BABY_PRODUCT,6336,,,,,,,,,"Glorified Enterprises USA, LLC",,,,,,,,702000000000.0,,,,,,,inches,0.25,5.5,0.03,pounds,4.5,,14-Aug-16,21-Apr-17,N,20-Apr-17,50,,1,1,B01CGQE5YC,,,,,base_product,Base Product,,,2-Mar-16,USD,,,,,,,,,,,,517,gl_guild,N,N,N,N,N,,N,N,,Y,N,N,,Y,,,1.0,,,,,,,,,,1045.436025,15.782868,Chew Necklace for Boys/Sensory Necklace/Toddle...,4.025675,7.098467,10.503185,en_US,,,,,,,,,,,,,,,,GUILD_JEWELRY,34170,,,,,,,,,,,,,,,,,,,,,,,,,3.470359,12.572308,6.23624,,7.714382,,3-Mar-16,31-Aug-17,Y,30-Aug-17,16,
2,1,1,B013FA0UVA,toy,Toy,POKB7,1.0,base_product,Base Product,,,4-Aug-15,USD,,,,,4170000000000.0,,,,,34.29,,21,gl_toy,Y,N,N,N,Y,,N,N,,Y,N,N,,Y,,,1.0,,,,,,,,,,2.0,9.25,TCG: Shiny Rayquaza-EX Box Card Game (Disconti...,1.0,0.639341,13.0,en_US,R&M,,RMAAD,,,,290-80016,,,,,,toy_ordering_channel,,9-Dec-15,TOYS_AND_GAMES,6335,,,,,,,,RMAAD,R&M,,,,,,unisex,,617000000000.0,,,,,,,inches,1.299213,13.188976,0.573202,pounds,9.409449,,5-Aug-15,10-Mar-18,N,9-Mar-18,5706,,1,1,B06W9HY6MV,toy,Toy,,,base_product,Base Product,,,13-Feb-17,USD,,,,,722000000000.0,,,,,,,21,gl_toy,Y,N,N,N,Y,,N,N,,Y,N,N,,Y,,,1.0,,,,,,,,,,0.3,11.0,Pokemon Oversize Card Lot - 4 Oversize Promo C...,4.025675,7.098467,8.5,en_US,Pokemon,,,,,,,,,,,,,,,TOYS_AND_GAMES,6335,,,,,,,,,Pokemon,,,,,,,,722000000000.0,,,,,,,inches,0.19685,12.007874,0.050706,pounds,8.897638,,14-Feb-17,4-Mar-17,N,4-Mar-17,9,
3,1,1,B008KPZLEC,health_and_beauty,Health and Beauty,,,base_product,Base Product,,,12-Jul-12,USD,,,,,854000000000.0,,,228013.0,,37.23,,121,gl_drugstore,Y,N,N,N,Y,,N,N,,Y,N,N,,Y,,,1.0,,,,,,,0.2,pounds,,1045.436025,15.782868,LifeSeasons Metabolism Weight Control - Natura...,1.0,0.2,10.503185,en_US,LifeSeasons,,,,,,,,,,,,,,,HEALTH_PERSONAL_CARE,1203,,,,,,,,,LifeSeasons,,,,,,,70.0,854000000000.0,SIZE,1.0,,,,,inches,2.2,4.7,0.25,pounds,2.3,,13-Jul-12,22-Mar-18,N,21-Mar-18,466,,1,1,B01M5AMNA1,miscellaneous,Misc.,,,base_product,Base Product,,,12-Oct-16,USD,,,,,815000000000.0,,,,,61.32,,194,gl_beauty,Y,N,N,N,Y,,N,N,,Y,N,N,,Y,,,1.0,,,,,,,,,,1045.436025,15.782868,Ouai Finishing Crème,1.0,0.264555,10.503185,en_US,,,,,,,,,,,,,,,,BEAUTY,3606,,,,,,,,,,,,,,,,,815000000000.0,,,,,,,inches,1.6,5.8,0.26,pounds,2.1,,13-Oct-16,8-Feb-18,N,8-Feb-18,157,
4,1,1,B0196BJHXY,kitchen,Kitchen,CRCF7,,base_product,Base Product,Red,,10-Dec-15,USD,,,,,48894060000.0,,,,,83.46,,79,gl_kitchen,Y,N,N,N,Y,,,N,,Y,N,N,,N,,,1.0,,,,,,,,,,14.8,9.4,Crock-pot SCCPVL610-R-A 6-Quart Programmable C...,1.0,7.098467,15.1,en_US,Crock-pot,,OSTE9,,,,SCCPVL610-R-A,,,,,,kitchen_ordering_channel,,,KITCHEN,6419,,,,,,,,OSTE9,Crock-pot,,,,,,,,48894060000.0,COLOR_NAME,8.0,,,,,inches,9.015748,15.511811,13.549611,pounds,15.0,,10-Dec-15,23-Jan-18,N,23-Jan-18,1399,,1,1,B00XCHMLI2,,,,,base_product,Base Product,,,8-May-15,USD,,,,,798000000000.0,,,,,92.93,,79,gl_kitchen,Y,N,N,N,Y,,,N,,Y,N,N,,N,,,1.0,,,,,,,,,,1045.436025,15.782868,Crock-Pot 7-qt. Slow Cooker (Turquoise),1.0,12.720673,10.503185,en_US,,,,,,,COMIN16JU032690,,,,,,,,,KITCHEN,6419,,,,,,,,,,,,,,,,,798000000000.0,COLOR_NAME,8.0,,,,,inches,9.5,14.3,12.7,pounds,14.2,,9-May-15,21-Mar-18,N,20-Mar-18,212,


Selecting numerical columns below.

In [14]:
X_train = training_features_df[['key_item_width', 'key_item_height', 'key_item_length', 'key_item_weight', 'key_item_package_quantity', 'key_pkg_height', 'key_pkg_length', 'key_pkg_weight', 'key_pkg_width',
                                'cand_item_width', 'cand_item_height', 'cand_item_length', 'cand_item_weight', 'cand_item_package_quantity', 'cand_pkg_height', 'cand_pkg_length', 'cand_pkg_weight', 'cand_pkg_width']].values
y_train = training_data['label'].values

___
## 5. Fitting the classifier

We will use the K Nearest Neighbors Classifier from sklearn library: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html 

Things to do in this section:
* Initialize the classifer and fit your training data to it below.

In [15]:
# Enter your code here
from sklearn.neighbors import KNeighborsClassifier

print("Training data shape:", X_train.shape)
print("Training label shape:", y_train.shape)

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)

Training data shape: (41400, 18)
Training label shape: (41400,)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

In [16]:
train_preds = knn.predict(X_train)

In [17]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_train, train_preds))

[[15337  4997]
 [ 3587 17479]]


___
## 6. Getting test predictions

In this section, you will use your test_features and make predictions using our trained KNN classifier.
* Use __knnClassifier.predict()__

In [18]:
test_features_df = pd.DataFrame(test_features, columns=concat_columns)

In [19]:
# Enter your code here
X_test = test_features_df[['key_item_width', 'key_item_height', 'key_item_length', 'key_item_weight', 'key_item_package_quantity', 'key_pkg_height', 'key_pkg_length', 'key_pkg_weight', 'key_pkg_width',
                                'cand_item_width', 'cand_item_height', 'cand_item_length', 'cand_item_weight', 'cand_item_package_quantity', 'cand_pkg_height', 'cand_pkg_length', 'cand_pkg_weight', 'cand_pkg_width']].values

test_preds = knn.predict(X_test)

In [20]:
test_preds[0:5]

array([0, 0, 0, 1, 1])

In [21]:
len(test_preds)

4600

In [22]:
len(test_data)

4600

_____
## 7. Writing results to a CSV file for leaderboard submission:
Let's write our test predictions to a CSV file. You will submit this file to [Leaderboard using this link](https://leaderboard.corp.amazon.com/tasks/478/submit)

In [23]:
import pandas as pd

result_df = pd.DataFrame(columns=["ID", "label"])
result_df["ID"] = test_data["ID"].tolist()
result_df["label"] = test_preds
result_df.to_csv("results.csv", index=False)

## 8. Ideas for improvement:
* Tune K parameter: You can use the train-test split function [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and search for optimum K value using your validation performance.
* Try different combinations of features instead of concatenating them in __getFeatures()__. For example: The item-wise difference.

In [24]:
# try item-wise difference
# try filtering out outliers
# use package group and such