# <a name="0">Machine Learning Lab

Build a classfier to predict the __label__ field (substitute or not substitute) of the product substitute dataset.

### Final Project Problem: Product Substitute Prediction

__Problem Definition__:
Given a pair of products, (A, B), we say that B is a "substitute" for A if a customer would buy B in place of A -- say, if A were out of stock.

The goal of this project is to predict a substitute relationship between pairs of products. Complete the tasks in this notebook and submit your notebook via Colab  

1. <a href="#1">Read the datasets</a> (Given)
2. <a href="#2">Data Processing</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a> (Suggested)
    * <a href="#23">Train - Validation - Test Datasets</a>
    * <a href="#24">Data Processing with Pipeline</a>
3. <a href="#3">Train (and Tune) a Classifier on the Training Dataset</a> (Implement)
4. <a href="#3">Make Predictions on the Test Dataset</a> (Implement)


__Datasets and Files:__


* __training.csv__: Training data with product pair features and corresponding labels:
> - `ID:` ID of the record
> - `label:` Tells whether the key and candidate products are substitutes (1) or not (0).
> - `key_asin ...:` Key product ASIN features
> - `cand_asin ...:` Candidate product ASIN features


* __public_test_features.csv__: Test data with product pairs features __without__ labels:
> - `ID:` ID of the record
> - `key_asin ...:` Key product ASIN features
> - `cand_asin ...:` Candidate product ASIN features


* __metadata-dataset.xlsx__: Provides detailed information about all key_ and cand_ columns in the training and test sets. Try to select some useful features to include in the model, as not all of them are suitable. `|Region Id|MarketPlace Id|ASIN|Binding Code|binding_description|brand_code|case_pack_quantity|, ...`


## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)
</br>
<a href="https://propensity-labs-screening.s3.amazonaws.com/machine_learning/ml_data.zip">Download Dataset</a>

Then, we read the __training__ and __test__ datasets into dataframes

In [4]:
import pandas as pd
import numpy as np


## 2. <a name="2">Data Processing</a> (Implement)
(<a href="#0">Go to top</a>)

### 2.1 <a name="21">Exploratory Data Analysis</a>

We look at number of rows, columns, and some simple statistics of the datasets.

In [9]:
# Implement EDA here
df=pd.read_csv('/training.csv')

  df=pd.read_csv('/training.csv')


In [10]:
df

Unnamed: 0,ID,label,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,34016,0,1,1,B00YCZ6IKA,kitchen,Kitchen,NICLW,,base_product,...,0.529104,pounds,5.118110,,18-Apr-13,14-Oct-17,N,13-Oct-17,2867,
1,3581,0,1,1,B00U25WT7A,office_product,Office Product,,,base_product,...,0.100000,pounds,4.500000,,19-May-16,21-Mar-18,N,20-Mar-18,65,
2,36025,1,1,1,B011BZ3GXU,consumer_electronics,Electronics,,,base_product,...,0.654773,pounds,3.937008,,10-Dec-15,16-Feb-18,N,15-Feb-18,1532,
3,42061,1,1,1,B0089XDG3I,pc,Personal Computers,,,base_product,...,3.549442,pounds,10.314961,,19-Oct-12,15-Feb-18,N,14-Feb-18,13964,
4,14628,1,1,1,B014UTSBZW,miscellaneous,Misc.,ZUKC7,1.0,base_product,...,0.396832,pounds,5.196850,,26-Jul-12,9-Mar-18,N,9-Mar-18,1253,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36798,9631,0,1,1,B0002ABA8E,consumer_electronics,Electronics,HEWL4,10.0,base_product,...,0.260000,pounds,5.100000,,9-Sep-16,21-Mar-18,N,20-Mar-18,60,
36799,16965,1,1,1,B000H46XQE,kitchen,Kitchen,CUIJ9,2.0,base_product,...,7.900000,pounds,12.500000,,6-Apr-13,30-May-17,N,29-May-17,298,
36800,50014,1,1,1,B01HFRC7UQ,miscellaneous,Misc.,,,base_product,...,7.000000,pounds,,,2-Nov-16,17-Jun-17,N,17-Jun-17,13,
36801,42674,1,1,1,B001T0HHDS,health_and_beauty,Health and Beauty,O3S14,12.0,base_product,...,3.000000,pounds,11.700000,,4-Jan-11,15-Nov-17,N,14-Nov-17,618058,


In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
df.shape

(36803, 228)

In [13]:
df.isnull()

Unnamed: 0,ID,label,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,False,False,False,False,False,True
1,False,False,False,False,False,False,False,True,True,False,...,False,False,False,True,False,False,False,False,False,True
2,False,False,False,False,False,False,False,True,True,False,...,False,False,False,True,False,False,False,False,False,True
3,False,False,False,False,False,False,False,True,True,False,...,False,False,False,True,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36798,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,True
36799,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,True
36800,False,False,False,False,False,False,False,True,True,False,...,False,False,True,True,False,False,False,False,False,True
36801,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,True


In [14]:
df.dropna(subset=['label'])
df.shape

(36803, 228)

### 2.2 <a name="22">Select features to build the model</a>

For a quick start, we recommend using only a few of the numerical features for both key_ and cand_ ASINs: __item_package_quantity__, __item_height__, __item_width__, __item_length__, __item_weight__, __pkg_height__, __pkg_width__, __pkg_length__, __pkg_weight__. Feel free to explore other fields from the metadata-dataset.xlsx file.


In [15]:
# Implement here
new_df = df[['ID','label','key_item_package_quantity','key_item_height','key_item_width','key_item_length','key_item_weight','key_pkg_height','key_pkg_width','key_pkg_length','key_pkg_weight','cand_item_package_quantity','cand_item_height','cand_item_width','cand_item_length','cand_item_weight','cand_pkg_height','cand_pkg_width','cand_pkg_length','cand_pkg_weight']].copy()

In [16]:
new_df

Unnamed: 0,ID,label,key_item_package_quantity,key_item_height,key_item_width,key_item_length,key_item_weight,key_pkg_height,key_pkg_width,key_pkg_length,key_pkg_weight,cand_item_package_quantity,cand_item_height,cand_item_width,cand_item_length,cand_item_weight,cand_pkg_height,cand_pkg_width,cand_pkg_length,cand_pkg_weight
0,34016,0,1.0,1.00,66.00,86.00,6.000000,10.0,15.0,20.0,6.300000,1.0,0.00000,18.00000,40.0000,0.530000,1.574803,5.118110,18.110236,0.529104
1,3581,0,6.0,2.00,0.10,2.50,,0.2,4.0,4.8,0.022046,1.0,0.30000,4.50000,6.7500,0.110231,0.300000,4.500000,6.750000,0.100000
2,36025,1,1.0,0.83,2.24,5.94,0.789375,2.1,4.6,7.2,1.050000,1.0,0.86614,3.62204,2.3622,0.396832,2.007874,3.937008,5.236220,0.654773
3,42061,1,1.0,,,,,,,,,1.0,2.36000,20.29000,10.2400,3.480000,2.401575,10.314961,20.590551,3.549442
4,14628,1,1.0,9.33,7.50,2.75,0.438000,0.2,7.5,9.2,0.250000,1.0,8.50000,9.87500,11.7500,,1.102362,5.196850,7.874016,0.396832
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36798,9631,0,1.0,5.70,3.19,0.60,0.500000,1.3,6.2,10.0,0.650364,1.0,2.00000,5.10000,6.4000,0.264555,2.000000,5.100000,6.400000,0.260000
36799,16965,1,1.0,7.00,11.06,11.88,0.881849,7.9,11.5,12.4,5.800000,1.0,9.50000,12.37000,13.1200,,9.500000,12.500000,12.800000,7.900000
36800,50014,1,,,,,,5.9,8.7,13.4,3.300000,,,,,7.000000,,,,7.000000
36801,42674,1,1.0,10.00,3.50,8.50,1.800000,4.1,9.8,11.9,2.599250,1.0,11.50000,13.00000,3.7500,2.906250,4.000000,11.700000,12.500000,3.000000


In [17]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### 2.3 <a name="23">Train - Validation Datasets</a>
(<a href="#2">Go to Data Processing</a>)

We already have training and test datasets, however the test dataset is missing the labels - the goal of the project is to predict these labels.

To produce a validation set to evaluate model performance before submitting  split the training dataset into train and validation. Validation data you get here will be used later in section 3 to tune your classifier.

In [47]:
# Convert all columns to strings
new_df_str = new_df.astype(str)

# Concatenate columns
concatenated_cols = new_df_str['key_item_package_quantity'] + \
                    new_df_str['key_item_height'] + \
                    new_df_str['key_item_width'] + \
                    new_df_str['key_item_length'] + \
                    new_df_str['key_item_weight'] + \
                    new_df_str['key_pkg_height'] + \
                    new_df_str['key_pkg_width'] + \
                    new_df_str['key_pkg_length'] + \
                    new_df_str['key_pkg_weight'] + \
                    new_df_str['cand_item_package_quantity'] + \
                    new_df_str['cand_item_height'] + \
                    new_df_str['cand_item_width'] + \
                    new_df_str['cand_item_length'] + \
                    new_df_str['cand_item_weight'] + \
                    new_df_str['cand_pkg_height'] + \
                    new_df_str['cand_pkg_width'] + \
                    new_df_str['cand_pkg_length'] + \
                    new_df_str['cand_pkg_weight']

# Use TF-IDF vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(concatenated_cols)


In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, new_df['label'], test_size=0.25, random_state=30)

### 2.4 <a name="24">Data processing with Pipeline</a>

Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)  to impute the missing values and scale the numerical features, and finally train the classifier on the imputed and scaled dataset.


In [None]:
# Implement here


## 3. <a name="3">Train (and Tune) a Classifier</a> (Implement)
(<a href="#0">Go to top</a>)

Train and tune the classifier

In [65]:
# Implement here
model = LogisticRegression()
model.fit(X_train, y_train)

In [66]:
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.5943919139223998


## 4. <a name="4">Make Predictions on the Test Dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the trained classifier to predict the labels on the test set. Test accuracy would be displayed upon a valid submission to the leaderboard.

In [41]:
# Implement here

# Get test data to test the classifier
# ! test data should come from public_test_features.csv !
# ...

# Use the trained model to make predictions on the test dataset
# test_predictions = ...
x_public_test=pd.read_csv('/public_test_features.csv')




  x_public_test=pd.read_csv('/public_test_features.csv')


In [53]:
x_public_test.shape

(15774, 227)

In [51]:
# Convert all columns to strings
n_df_str = x_public_test.astype(str)

# Concatenate columns
concatenated_col = n_df_str['key_item_package_quantity'] + \
                    n_df_str['key_item_height'] + \
                    n_df_str['key_item_width'] + \
                    n_df_str['key_item_length'] + \
                    n_df_str['key_item_weight'] + \
                    n_df_str['key_pkg_height'] + \
                    n_df_str['key_pkg_width'] + \
                    n_df_str['key_pkg_length'] + \
                    n_df_str['key_pkg_weight'] + \
                    n_df_str['cand_item_package_quantity'] + \
                    n_df_str['cand_item_height'] + \
                    n_df_str['cand_item_width'] + \
                    n_df_str['cand_item_length'] + \
                    n_df_str['cand_item_weight'] + \
                    n_df_str['cand_pkg_height'] + \
                    n_df_str['cand_pkg_width'] + \
                    n_df_str['cand_pkg_length'] + \
                    n_df_str['cand_pkg_weight']

# Use TF-IDF vectorizer
vectorizers = TfidfVectorizer()
p= vectorizers.fit_transform(concatenated_col)


In [54]:
p.shape

(15774, 16948)

In [52]:
y_predS = model.predict(p)
x_public_test.loc[:, "label"] = y_predS





ValueError: X has 16948 features, but LogisticRegression is expecting 26078 features as input.