<h1>Importing Libraries</h1>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install mlxtend

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from tensorflow.keras.layers import Input, Dense, Activation, Dropout
from tensorflow.keras.models import Model
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#from mlxtend.classifier import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from tqdm import tqdm
import random
import lightgbm as lgb
import pandas as pd
import numpy as np
import pickle
import os

<h1>Importing Train Test Files</h1>

In [4]:
train_data = pickle.load(open('/content/drive/MyDrive/instacart-market-basket-analysis/final_data_trainn.pkl','rb'))
test_data = pickle.load(open('/content/drive/MyDrive/instacart-market-basket-analysis/final_data_test.pkl','rb'))

<h1>Data & Result Split</h1>

In [5]:
X = train_data.drop('reordered', axis=1)
Y = train_data['reordered'].values

In [6]:
# Replacing the infinite values with NaN first and then with the mean value.

X.replace([np.inf, -np.inf], np.nan, inplace=True)
X.fillna(X.mean(), inplace=True)

<h1>Preprocessing Data</h1>

In [7]:
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
def normalize(df):
    #This function normalizes data by applying min max scaler

    result1 = df.copy()
    for feature_name in df.columns:
        array_values = np.asarray(df[feature_name].values)
        array = array_values.reshape(-1,1)
        minmax.fit(array)
        scaled = minmax.transform(array)
        result1[feature_name] = scaled
    return result1

processed_x = normalize(X)

<h1>Train Test Split</h1>

In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(processed_x, Y, stratify = Y, test_size=0.2, random_state=14)

print("Train size", X_train.shape, Y_train.shape)
print("Test size", X_test.shape, Y_test.shape)

Train size (3024724, 25) (3024724,)
Test size (756181, 25) (756181,)


<h1>Building Models</h1>

#Note

1. As there are many models making submission with each of them doesn't make sense and is time consuming.

2. So I will look into Validation F1-Score and will pick the model with best Validation F1-Score.

3. Will compare validation F1-Score of each model.

4. Then will predict the real test data through that model and will create submission file

<h2> 1. Logistic Regression </h2>

<h3> Tuning Logistic Regression</h3>

In [None]:
C = [0.001,1,100]
for i in C:
  lr_temp = LogisticRegression(C=i)
  lr_temp.fit(X_train, Y_train)
  train_predict = (lr_temp.predict_proba(X_train)[:,1]>=0.1).astype('int')
  test_predict = (lr_temp.predict_proba(X_test)[:,1]>=0.1).astype('int')

  f1_train = f1_score(train_predict, Y_train)
  f1_test = f1_score(test_predict, Y_test)

  print("Train F1 Score At C = {0} is {1}".format(i, f1_train))
  print("Test F1 Score At C = {0} is {1}".format(i, f1_test))


Train F1 Score At C = 0.001 is 0.3069966414255283
Test F1 Score At C = 0.001 is 0.3067085709851395
Train F1 Score At C = 1 is 0.31092603218627923
Test F1 Score At C = 1 is 0.3101801068610453
Train F1 Score At C = 100 is 0.31089136789169575
Test F1 Score At C = 100 is 0.3101805781174407


<h3>Logistic Regression With Best Parameter</h3>

In [None]:
lr = LogisticRegression(C=1)
lr.fit(X_train, Y_train)

train_pred = (lr.predict_proba(X_train)[:, 1] >= 0.1).astype('int')
test_pred = (lr.predict_proba(X_test)[:, 1] >= 0.1).astype('int')

f1_train = f1_score(train_pred, Y_train)
f1_test = f1_score(test_pred, Y_test)

print("Train F1 Score:", f1_train)
print("Validation F1 Score:", f1_test)

Train F1 Score: 0.31092603218627923
Validation F1 Score: 0.3101801068610453


<h2> 2. Decision Tree </h2>

<h3>Tuning Decision Tree</h3>

In [None]:
depth = [3, 5, 7]
splits = [50,100,300]

for i in depth:
    for j in splits:
        dtree = DecisionTreeClassifier(max_depth = i, min_samples_split = j)
        dtree.fit(X_train,Y_train)

        train_pred = (dtree.predict_proba(X_train)[:, 1] >= 0.1).astype('int')
        test_pred = (dtree.predict_proba(X_test)[:, 1] >= 0.1).astype('int')

        f1_train = f1_score(train_pred, Y_train)
        f1_test = f1_score(test_pred, Y_test)

        print("Train F1 Score At Depth {0} & Minimum Split {1} is {2}".format(i,j, f1_train))
        print("Test F1 Score At Depth {0} & Minimum Split {1} is {2}".format(i,j, f1_test))

Train F1 Score At Depth 3 & Minimum Split 5 is 0.2743777273744212
Test F1 Score At Depth 3 & Minimum Split 5 is 0.2747626874686363
Train F1 Score At Depth 3 & Minimum Split 20 is 0.2743777273744212
Test F1 Score At Depth 3 & Minimum Split 20 is 0.2747626874686363
Train F1 Score At Depth 3 & Minimum Split 50 is 0.2743777273744212
Test F1 Score At Depth 3 & Minimum Split 50 is 0.2747626874686363
Train F1 Score At Depth 3 & Minimum Split 100 is 0.2743777273744212
Test F1 Score At Depth 3 & Minimum Split 100 is 0.2747626874686363
Train F1 Score At Depth 3 & Minimum Split 300 is 0.2743777273744212
Test F1 Score At Depth 3 & Minimum Split 300 is 0.2747626874686363
Train F1 Score At Depth 3 & Minimum Split 500 is 0.2743777273744212
Test F1 Score At Depth 3 & Minimum Split 500 is 0.2747626874686363
Train F1 Score At Depth 5 & Minimum Split 5 is 0.30772668137912546
Test F1 Score At Depth 5 & Minimum Split 5 is 0.3066342812422703
Train F1 Score At Depth 5 & Minimum Split 20 is 0.3077266813791254

<h3>Decision Tree With Best Parameter</h3>

In [None]:
dtree_tuned = DecisionTreeClassifier(max_depth=10, min_samples_split = 500, random_state=21)
dtree_tuned.fit(X_train, Y_train)
train_pred = (dtree_tuned.predict_proba(X_train)[:, 1] >= 0.1).astype('int')
test_pred = (dtree_tuned.predict_proba(X_test)[:, 1] >= 0.1).astype('int')

f1_train = f1_score(train_pred, Y_train)
f1_test = f1_score(test_pred, Y_test)

print("Train F1 Score:", f1_train)
print("Test F1 Score:", f1_test)

Train F1 Score: 0.32804105909439757
Test F1 Score: 0.32022944099562056


<h2> 3. Random Forest </h2>

<h3>Tuning Random Forest</h3>

In [None]:
rf= RandomForestClassifier()
n_estimators = [50,100,300]
max_depth = [3,5,7]
for i in n_estimators:
    print("Estimator :",i)
    for j in max_depth:
        print("Depth :",j)
        rf_tuned = RandomForestClassifier(max_depth=j, n_estimators = i)
        rf_tuned.fit(X_train, Y_train)

        train_pred = (rf_tuned.predict_proba(X_train)[:, 1] >= 0.1).astype('int')
        test_pred = (rf_tuned.predict_proba(X_test)[:, 1] >= 0.1).astype('int')

        f1_train = f1_score(train_pred, Y_train)
        f1_test = f1_score(test_pred, Y_test)

        print("Train F1 Score:", f1_train)
        print("Test F1 Score:", f1_test)
    print("_"*40)   

Estimator : 10
Depth : 3
Train F1 Score: 0.28109909451122816
Test F1 Score: 0.281397527999643
Depth : 5
Train F1 Score: 0.29492718425918824
Test F1 Score: 0.2939154114543443
Depth : 7
Train F1 Score: 0.3147034570744003
Test F1 Score: 0.31293574487599
Depth : 10
Train F1 Score: 0.32664776197706547
Test F1 Score: 0.31899101853621253
________________________________________
Estimator : 50
Depth : 3
Train F1 Score: 0.28458758393848205
Test F1 Score: 0.2852084022642117
Depth : 5
Train F1 Score: 0.3077378578593095
Test F1 Score: 0.30680448166663987
Depth : 7
Train F1 Score: 0.3153004063429908
Test F1 Score: 0.31369295892577487
Depth : 10
Train F1 Score: 0.3287738816658795
Test F1 Score: 0.3211972814023204
________________________________________
Estimator : 100
Depth : 3
Train F1 Score: 0.2914878514167378
Test F1 Score: 0.2908020603384842
Depth : 5
Train F1 Score: 0.3092073921646533
Test F1 Score: 0.3081204228516472
Depth : 7
Train F1 Score: 0.31470620392352866
Test F1 Score: 0.3129934607920

<h3>Random Forest With Best Parameter</h3>

In [None]:
rf_tuned = RandomForestClassifier(max_depth=7, n_estimators = 500)
rf_tuned.fit(X_train, Y_train)

train_pred = (rf_tuned.predict_proba(X_train)[:, 1] >= 0.21).astype('int')
test_pred = (rf_tuned.predict_proba(X_test)[:, 1] >= 0.21).astype('int')

f1_train = f1_score(train_pred, Y_train)
f1_test = f1_score(test_pred, Y_test)

print("Train F1 Score:", f1_train)
print("Test F1 Score:", f1_test)

<h2> 4. Light GBM</h2>

<h3>Tuning Light GBM</h3>

In [9]:
#Since GBDT was too slow when it comes to hyper-parameter tuning. So didn't used it.
#And since Light GBM is successor to GBDT and also fast then it so using it instead of GBDT

import lightgbm as lgb
lgb_tune = lgb.LGBMClassifier(boosting_type='gbdt')

estimator = [10,50,100,300,500]
depths = [3,5,7,10]

for i in depths:
    for j in estimator:
        lgb_model = lgb.LGBMClassifier(boosting_type='gbdt', n_estimators=j, max_depth = i)
        lgb_model.fit(X_train,Y_train)

        train_pred = (lgb_model.predict_proba(X_train)[:, 1] >= 0.1).astype('int')
        test_pred = (lgb_model.predict_proba(X_test)[:, 1] >= 0.1).astype('int')

        f1_train = f1_score(train_pred, Y_train)
        f1_test = f1_score(test_pred, Y_test)

        print("Train F1 Score At Depth {0} & Minimum Split {1} is {2}".format(i,j, f1_train))
        print("Test F1 Score At Depth {0} & Minimum Split {1} is {2}".format(i,j, f1_test))
    print("_"*35)

Train F1 Score At Depth 3 & Minimum Split 10 is 0.3020379760532217
Test F1 Score At Depth 3 & Minimum Split 10 is 0.30175365468782567
Train F1 Score At Depth 3 & Minimum Split 50 is 0.3060937913503412
Test F1 Score At Depth 3 & Minimum Split 50 is 0.3066379000359583
Train F1 Score At Depth 3 & Minimum Split 100 is 0.3089156539023571
Test F1 Score At Depth 3 & Minimum Split 100 is 0.3089804391853948
Train F1 Score At Depth 3 & Minimum Split 300 is 0.3106864493468638
Test F1 Score At Depth 3 & Minimum Split 300 is 0.31026946183497045
Train F1 Score At Depth 3 & Minimum Split 500 is 0.3115562967268355
Test F1 Score At Depth 3 & Minimum Split 500 is 0.310597299719564
___________________________________
Train F1 Score At Depth 5 & Minimum Split 10 is 0.30899596895388987
Test F1 Score At Depth 5 & Minimum Split 10 is 0.308244640809756
Train F1 Score At Depth 5 & Minimum Split 50 is 0.3104333163591114
Test F1 Score At Depth 5 & Minimum Split 50 is 0.3095952705775352
Train F1 Score At Depth 5 

<h3>Light GBM With Best Parameters</h3>

In [10]:
lgb_tuned = lgb.LGBMClassifier(boosting_type='gbdt', n_estimators=500, max_depth=10, random_state=21)

lgb_tuned.fit(X_train, Y_train)

train_pred = (lgb_tuned.predict_proba(X_train)[:, 1] >= 0.1).astype('int')
test_pred = (lgb_tuned.predict_proba(X_test)[:, 1] >= 0.1).astype('int')

f1_train = f1_score(train_pred, Y_train)
f1_test = f1_score(test_pred, Y_test)

print("Train F1 Score:", f1_train)
print("Test F1 Score:", f1_test)

Train F1 Score: 0.31927604671064946
Test F1 Score: 0.3115880451755819


In [None]:
pickle.dump((predicted_df),open('/content/drive/MyDrive/instacart-market-basket-analysis/base_model_predicted_df.pkl','wb'))
pickle.dump((models),open('/content/drive/MyDrive/instacart-market-basket-analysis/base_models.pkl','wb'))

<h2>Feature Importance</h2>

In [None]:
lgb.plot_importance(lgb_tuned)

**Observation**

1. Although every model has almost similar results on both train and validation.

2. But still Light GBM has bit better results. 

3. Will create final prediction data with Light GBM

<h2>Function To Predict On Test Data </h2>

In [None]:
test_data.replace([np.inf, -np.inf], np.nan, inplace=True)
test_data.fillna(X.mean(), inplace=True)

final_test = normalize(test_data)

In [None]:
# This function performs prediction on test data by taking the model, test data and dataframe as input

def predict_final(data_point, model):
  new_data = data_point.copy()
      new_data['prediction'] = (model.predict_proba(data_point)[:, 1] >= 0.1).astype('int')
      new_data = new_data.reset_index()

  final_test_data = new_data[['product_id', 'user_id', 'prediction']]

  # We are getting the user_id and order_id which belongs to "Test Set"
  orders = pd.read_csv('/content/drive/MyDrive/instacart-market-basket-analysis/orders.csv')
  test_from_order = orders.loc[orders['eval_set']=='test',['user_id','order_id']]

  # Calling the predict function by passing the test data and model by which prediction is to be done
  # Merging the whole data with real test data using user_id which gives us 75000 test points

  
  predicted_results = pd.merge(final_test_data, test_from_order, how='left', on='user_id')
  predicted_results = predicted_results.drop('user_id',axis=1)
  print(predicted_results.isnull().any().sum())

  return predicted_results

<h2>Creating the submission file with the final dataframe</h2>

In [None]:
# This function takes the final dataframe and creates the submission file

def create_submission_file(final_df):
    user_product = dict()

    for row in tqdm(final_df.itertuples()):
        if row.prediction== 1:
            try:
                user_product[row.order_id] += ' ' + str(row.product_id)
            except:
                user_product[row.order_id] = str(row.product_id)

    for order in final_df.order_id:
        if order not in user_product:
            user_product[order] = 'None'

    sub = pd.DataFrame.from_dict(user_product, orient='index')
    #Reset index
    sub.reset_index(inplace=True)
    #Set column names
    sub.columns = ['order_id', 'products']
    
    return sub

def final_predict_function(final_df):
    user_product = dict()

    for row in tqdm(final_df.itertuples()):
        if row.prediction== 1:
            try:
                user_product[row.order_id] += ' ' + str(row.product_id)
            except:
                user_product[row.order_id] = str(row.product_id)

    for order in final_df.order_id:
        if order not in user_product:
            user_product[order] = 'None'    
    return user_product            

<h2>Prediction & Submission File By Each Model</h2>

<h3>1. Logistic Regression</h3>

In [None]:
lr_predicted_df = predict_final(final_test, lr)
lr_submission_file = create_submission_file(lr_predicted_df)
lr_submission_file.to_csv('/content/drive/MyDrive/instacart-market-basket-analysis/submission files/logistic_regression_submission.csv',index=False, header=True)

In [None]:
print("F1 Score on Test Data From Kaggle 0.359470")

<h3>2. Decision Tree</h3>

In [None]:
dtree_predicted_df = predict_final(final_test, dtree_tuned)
dtree_submission_file = create_submission_file(dtree_predicted_df)

dtree_submission_file.to_csv('/content/drive/MyDrive/instacart-market-basket-analysis/submission files/Decision_Tree_Submission.csv', index=False, header=True)

In [None]:
print("F1 Score on Test Data From Kaggle 0.366460")

<h3>3. Random Forest</h3>

In [None]:
random_forest_predicted_df = predict_final(final_test, rf_tuned)
random_forest_submission_file = create_submission_file(random_forest_predicted_df)

random_forest_submission_file.to_csv('/content/drive/MyDrive/instacart-market-basket-analysis/submission files/random_forest_submission.csv', index=False, header=True)

In [None]:
print("F1 Score on Test Data From Kaggle 0.366450")

<h3>4. Light GBM</h3>

In [None]:
lgb_predicted_df = predict_final(final_test, lgb_tuned)
lgb_submission_file = create_submission_file(lgb_predicted_df)

lgb_submission_file.to_csv('/content/drive/MyDrive/instacart-market-basket-analysis/submission files/light_gbm_submission.csv', index=False, header=True)

In [None]:
print("F1 Score on Test Data From Kaggle 0.363680")

<h1>Comparing Results Of Each Model</h1>

In [None]:
from prettytable import PrettyTable

table = PrettyTable()
table.field_names = ["Model", "Train F1 Score", "Validation F1 Score", "Kaggle Test F1 Score"]
table.add_row(["Logistic Regression", 0.414431, 0.415396, 0.359470])
table.add_row(["Decision Tree", 0.425405, 0.423107, 0.366460])           
table.add_row(["Random Forest", 0.425082, 0.422441, 0.366450])
table.add_row(["Light GBM", 0.432666, 0.428093, 0.363680])
table.add_row(["Stacked Model",  0.298419, 0.294584, 0.255930])
table.add_row(["Meta Classifier",  0.294019, 0.295305, 0.28001])
table.add_row(["Multi Level Perceptron",  0.241301, 0.081301, 0.063320])

print(table) 

<h1> Observation</h1>

Decision Tree is the best model with Test F1-Score of 0.36646

So will use product dataframe generated by decision tree

<h1>Final Predict Function</h1>

1. Here our task is to predict the products which are going to be reordered withn in a certain order id.

2. So for final predict function it can never expect any new order-id or a new user. As if the order-id or user is new there is no chance of any product being reordered when products are never ordered before.

3. This can be seen as cold-start problem.

4. So for my final predict function I followed below approach.
I took all the prediction from train, validation and test so when a order-id comes and we already have that order-id the function returns list of all the products which can be reordered.

5. Still, there was a issue in above approach as I tried to concat train, validaton, test prediction and build a dictionary consisting of order id and product id. But the problem was it was not possible to do so. As it was taking huge amount of time even after I left it to run overnight session got crashed/disconnected itself. So can't use the same approach.

6. Time taken last night was
13094627it [9:57:59, 127.73it/s]

7. So alternative of that I have just taken the test data and predict function work on top of that.

8. Since that approach was not working I removed that code to make notebook look more clean.

<h1>Function For Prediction</h1>

In [None]:
user_product = final_predict_function(lgb_predicted_df)

def pred(order_id, prediction_dict = user_product):    
    product_list = user_product.get(order_id,"The order id does not exist! The user is new so no point of product being reordered")
    return product_list


In [None]:
products = pd.read_csv('/content/drive/MyDrive/instacart-market-basket-analysis/products.csv')

In [None]:
products.head(3)

In [None]:
test_point_success = pred(2897111)
test_point_fail = pred(2897)

product_name = list()
for i in test_point_success.split(" "):
    product_name.append(i)

product_list = products[products['product_id'].isin(product_name)]['product_name']

print("Valid Case")
print('Products to be re-ordered are: ')
print(product_list)
print("_"*80)
print("Invalid Case", test_point_fail)