# [실습] e-commerce 데이터를 활용한 정시 배송 여부 판단

**[모델링 체크리스트]**

1. 어떠한 지도학습이 적합한가? (분류 vs 회귀)
2. 선택한 지도학습 모델 중에서 3가지 이상을 골라서 성능을 비교해보자. 
3. 성능을 확인하고 어떠한 모델을 사용했을 때 예측이 잘되었는지 평가지표와 함께 기록해보자.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

1. 실습 데이터 (pd.read_csv 로 로드)
    - 데이터 출처 :  https://www.kaggle.com/datasets/prachi13/customer-analytics
    - 데이터 설명 : e-commerce 배송의 정시 도착여부 (1: 정시배송 0 : 정시미배송)
    
    1) x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_train.csv
    
    2) y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_train.csv
    
    3) x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_test.csv
    
    4)x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_test.csv
    

→ 데이터가 train-test로 나누어져 있으므로 train_test_split 메서드는 train데이터를 train-validation 데이터로 나누어 모델링하는데 사용하세요

→ test 데이터는 성능을 측정하기 위한 고정된 데이터로 활용하고, train데이터와 병합하지 않습니다

In [2]:
x_trn_org = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_train.csv')
y_trn_org = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_train.csv')

In [3]:
x_train = x_trn_org.drop(['ID'], axis=1)
y_train = y_trn_org.drop(['ID'], axis=1)

In [4]:
x_train.head()

Unnamed: 0,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms
0,A,Flight,4,3,266,5,high,F,5,1590
1,F,Ship,3,1,174,2,low,M,44,1556
2,F,Road,4,1,154,10,high,M,10,5674
3,F,Ship,4,3,158,3,medium,F,27,1207
4,A,Flight,5,3,175,3,low,M,7,4833


In [5]:
x_train['Mode_of_Shipment'].value_counts()

Mode_of_Shipment
Ship      4512
Flight    1066
Road      1020
Name: count, dtype: int64

In [6]:
x_train['Product_importance'].value_counts()

Product_importance
low       3162
medium    2866
high       570
Name: count, dtype: int64

In [7]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6598 entries, 0 to 6597
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Warehouse_block      6598 non-null   object
 1   Mode_of_Shipment     6598 non-null   object
 2   Customer_care_calls  6598 non-null   object
 3   Customer_rating      6598 non-null   int64 
 4   Cost_of_the_Product  6598 non-null   int64 
 5   Prior_purchases      6598 non-null   int64 
 6   Product_importance   6598 non-null   object
 7   Gender               6598 non-null   object
 8   Discount_offered     6598 non-null   int64 
 9   Weight_in_gms        6598 non-null   int64 
dtypes: int64(5), object(5)
memory usage: 515.6+ KB


In [8]:
# 범주형 데이터를 sklearn label encoder 이용해서 수치화

from sklearn.preprocessing import LabelEncoder

col_lst = ['Warehouse_block', 'Mode_of_Shipment', 'Product_importance', 'Gender']

def lb_encoding(df, col_lst):
	for col in col_lst:
		encoder = LabelEncoder()
		df[col] = encoder.fit_transform(df[col])

lb_encoding(x_train, col_lst)
x_train.head()


Unnamed: 0,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms
0,0,0,4,3,266,5,0,0,5,1590
1,4,2,3,1,174,2,1,1,44,1556
2,4,1,4,1,154,10,0,1,10,5674
3,4,2,4,3,158,3,2,0,27,1207
4,0,0,5,3,175,3,1,1,7,4833


In [9]:
x_train.loc[x_train['Customer_care_calls'] == '$7'] = 7
x_train['Customer_care_calls'] = x_train['Customer_care_calls'].astype('int')

In [10]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6598 entries, 0 to 6597
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Warehouse_block      6598 non-null   int64
 1   Mode_of_Shipment     6598 non-null   int64
 2   Customer_care_calls  6598 non-null   int64
 3   Customer_rating      6598 non-null   int64
 4   Cost_of_the_Product  6598 non-null   int64
 5   Prior_purchases      6598 non-null   int64
 6   Product_importance   6598 non-null   int64
 7   Gender               6598 non-null   int64
 8   Discount_offered     6598 non-null   int64
 9   Weight_in_gms        6598 non-null   int64
dtypes: int64(10)
memory usage: 515.6 KB


In [11]:
def metrics(model, y_tst, pred):
		acc = accuracy_score(y_tst, pred)
		precision = precision_score(y_tst, pred)
		recall = recall_score(y_tst, pred)
		f1 = f1_score(y_tst, pred)
		
		return f'{model} Accuracy : {round(acc, 4)}, Precision : {round(precision, 4)}, Recall : {round(recall, 4)}, F1 Score : {round(f1, 4)}'

In [12]:
def model_predict(model_lst, x_trn, x_tst, y_trn, y_tst):
		
	for model in model_lst:
		model.fit(x_trn, y_trn)
		pred = model.predict(x_tst)
		result = metrics(model, y_tst, pred)
		print(result)


In [13]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')


## Validation

In [14]:
x_trn, x_val, y_trn, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=0)

rf_cl = RandomForestClassifier()
dtree = DecisionTreeClassifier()
logistic = LogisticRegression()
lgbm = LGBMClassifier()
xgb = XGBClassifier()

model_lst = [rf_cl, dtree, logistic, lgbm] #xgb
model_predict(model_lst, x_trn, x_val, y_trn, y_val)

RandomForestClassifier() Accuracy : 0.678, Precision : 0.7735, Recall : 0.6553, F1 Score : 0.7095
DecisionTreeClassifier() Accuracy : 0.6356, Precision : 0.6922, Recall : 0.7071, F1 Score : 0.6996
LogisticRegression() Accuracy : 0.6379, Precision : 0.6924, Recall : 0.7134, F1 Score : 0.7027
[LightGBM] [Info] Number of positive: 3145, number of negative: 2133
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000269 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 573
[LightGBM] [Info] Number of data points in the train set: 5278, number of used features: 10
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.595870 -> initscore=0.388284
[LightGBM] [Info] Start training from score 0.388284
LGBMClassifier() Accuracy : 0.6621, Precision : 0.75, Recall : 0.6553, F1 Score : 0.6995


## GridSearchCV

In [15]:
from sklearn.model_selection import GridSearchCV

def grid_search(model, params, scores, x_train, y_train):
	for score in scores:
		grid_cv = GridSearchCV(model, param_grid=params, scoring=score, cv=5)
		grid_cv.fit(x_train , y_train)
		print('GridSearchCV 최고 {0} 정확도 수치:{1:.4f}'.format(score, grid_cv.best_score_))
		print('GridSearchCV 최적 하이퍼 파라미터:', grid_cv.best_params_)


### Random Forest

In [16]:
params = {
	'n_estimators' : [100, 300],
    'max_depth' : [5, 10, 16 ,20],
	'max_leaf_nodes' : [15, 20]
	}

scores = ['accuracy', 'precision', ] # 'recall', 'f1', 'roc_auc'

grid_search(rf_cl, params, scores, x_train , y_train)

GridSearchCV 최고 accuracy 정확도 수치:0.6857
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 20, 'max_leaf_nodes': 15, 'n_estimators': 300}
GridSearchCV 최고 precision 정확도 수치:0.9050
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 10, 'max_leaf_nodes': 15, 'n_estimators': 300}


In [17]:
params = {
	'n_estimators' : [100, 300],
    'max_depth' : [5, 10],
	'max_leaf_nodes' : [5, 10, 15, 20]
	}

scores = ['accuracy', 'precision', ] # 'recall', 'f1', 'roc_auc'

grid_search(rf_cl, params, scores, x_train , y_train)

GridSearchCV 최고 accuracy 정확도 수치:0.6861
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 10, 'max_leaf_nodes': 15, 'n_estimators': 100}
GridSearchCV 최고 precision 정확도 수치:0.9022
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 10, 'max_leaf_nodes': 10, 'n_estimators': 300}


In [18]:
params = {
	'n_estimators' : [100, 300, 350],
    'max_depth' : [10, 15],
	'max_leaf_nodes' : [5, 10, 15, 20]
	}

scores = ['accuracy', 'precision', ] # 'recall', 'f1', 'roc_auc'

grid_search(rf_cl, params, scores, x_train , y_train)

GridSearchCV 최고 accuracy 정확도 수치:0.6860
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 15, 'max_leaf_nodes': 15, 'n_estimators': 100}
GridSearchCV 최고 precision 정확도 수치:0.9050
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 10, 'max_leaf_nodes': 15, 'n_estimators': 350}


In [19]:
params = {
	'n_estimators' : [50, 100, 300],
    'max_depth' : [10, 15],
	'max_leaf_nodes' : [10, 15, 20]
	}

scores = ['accuracy',  ] # 'precision', 'recall', 'f1', 'roc_auc'

grid_search(rf_cl, params, scores, x_train , y_train)

GridSearchCV 최고 accuracy 정확도 수치:0.6872
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 15, 'max_leaf_nodes': 10, 'n_estimators': 50}


In [20]:
params = {
	'n_estimators' : [100, 150, 300],
    'max_depth' : [15, 20],
	'max_leaf_nodes' : [10, 15, 20]
	}

scores = ['accuracy' ] # , 'precision', 'recall', 'f1', 'roc_auc'

grid_search(rf_cl, params, scores, x_train , y_train)

GridSearchCV 최고 accuracy 정확도 수치:0.6860
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 15, 'max_leaf_nodes': 10, 'n_estimators': 100}


In [21]:
params = {
	'n_estimators' : [100, 150, 170],
    'max_depth' : [15, 20, 25],
	'max_leaf_nodes' : [10, 15, 20, 25]
	}

scores = ['accuracy' ] # , 'precision', 'recall', 'f1', 'roc_auc'

grid_search(rf_cl, params, scores, x_train , y_train)

GridSearchCV 최고 accuracy 정확도 수치:0.6870
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 25, 'max_leaf_nodes': 20, 'n_estimators': 100}


In [22]:
params = {
	'n_estimators' : [100, 150, 170],
    'max_depth' : [10, 15, 20],
	'max_leaf_nodes' : [10, 15, 20]
	}

scores = ['accuracy' ] # , 'precision', 'recall', 'f1', 'roc_auc'

grid_search(rf_cl, params, scores, x_train , y_train)

GridSearchCV 최고 accuracy 정확도 수치:0.6872
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 10, 'max_leaf_nodes': 10, 'n_estimators': 100}


In [23]:
params = {
	'n_estimators' : [100, 150, 170],
    'max_depth' : [10, 15, 17],
	'max_leaf_nodes' : [15, 17, 20]
	} 

scores = ['accuracy' ] # , 'precision', 'recall', 'f1', 'roc_auc'

grid_search(rf_cl, params, scores, x_train , y_train)

GridSearchCV 최고 accuracy 정확도 수치:0.6866
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 15, 'max_leaf_nodes': 17, 'n_estimators': 150}


In [24]:
params = {
	'n_estimators' : [80, 100],
    'max_depth' : [15, 17, 20],
	'max_leaf_nodes' : [15, 17, 20]
	} 

scores = ['accuracy' ] # , 'precision', 'recall', 'f1', 'roc_auc'

grid_search(rf_cl, params, scores, x_train , y_train)

GridSearchCV 최고 accuracy 정확도 수치:0.6852
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 20, 'max_leaf_nodes': 17, 'n_estimators': 80}


### Decision Tree

In [25]:
params = {
    'criterion':['gini','entropy'], 
    'max_depth':[None,2,3,4,5,6], 
    'max_leaf_nodes':[None,2,3,4,5,6,7], 
    'min_samples_split':[2,3,4,5,6], 
    'min_samples_leaf':[1,2,3], 
    'max_features':[None,'sqrt','log2',3,4,5]
    }

## Test

In [26]:
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_test.csv').drop('ID', axis=1)
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_test.csv').drop('ID', axis=1)


In [27]:
x_test['Customer_care_calls'].value_counts()

Customer_care_calls
4     1442
3     1298
5      925
6      409
2      234
$7      93
Name: count, dtype: int64

In [28]:
lb_encoding(x_test, col_lst)
x_test.loc[x_test['Customer_care_calls'] == '$7'] = 7
x_test['Customer_care_calls'] = x_test['Customer_care_calls'].astype('int')
x_test.head()


Unnamed: 0,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms
0,3,2,5,2,259,5,1,0,7,1032
1,4,2,3,5,133,3,2,0,4,5902
2,4,1,3,4,191,5,2,0,4,4243
3,3,2,4,2,221,3,1,1,10,4126
4,3,0,4,5,230,2,1,0,38,2890


In [29]:
rf_cl = RandomForestClassifier()
model_lst = [rf_cl]
model_predict(model_lst, x_train, x_test, y_train, y_test)

RandomForestClassifier() Accuracy : 0.666, Precision : 0.7809, Recall : 0.612, F1 Score : 0.6862


In [50]:
# GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 10, 'max_leaf_nodes': 10, 'n_estimators': 300}
rf_cl = RandomForestClassifier(max_depth=15, max_leaf_nodes=10, n_estimators=50)
model_lst = [rf_cl]
model_predict(model_lst, x_train, x_test, y_train, y_test)

RandomForestClassifier(max_depth=15, max_leaf_nodes=10, n_estimators=50) Accuracy : 0.6778, Precision : 0.9021, Recall : 0.516, F1 Score : 0.6565


In [51]:
rf_cl = RandomForestClassifier(max_depth=10, max_leaf_nodes=10, n_estimators=100)
model_lst = [rf_cl]
model_predict(model_lst, x_train, x_test, y_train, y_test)

RandomForestClassifier(max_depth=10, max_leaf_nodes=10) Accuracy : 0.6723, Precision : 0.8503, Recall : 0.5472, F1 Score : 0.6659


In [49]:
rf_cl = RandomForestClassifier(max_depth=25, max_leaf_nodes=20, n_estimators=100)
model_lst = [rf_cl]
model_predict(model_lst, x_train, x_test, y_train, y_test)

RandomForestClassifier(max_depth=25, max_leaf_nodes=20) Accuracy : 0.678, Precision : 0.8843, Recall : 0.5297, F1 Score : 0.6625
