<a href="https://colab.research.google.com/github/wahyunh10/Project-Ecommerce-Shipping-Clasification-Modeling/blob/main/Stage_2_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Import Libraries**

In [None]:
#code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from scipy import stats

# **Load Dataset**

In [None]:
#code
df = pd.read_csv('Full_data.csv')
dfSel = df.copy()

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10999 entries, 0 to 10998
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ID                   10999 non-null  int64 
 1   Warehouse_block      10999 non-null  object
 2   Mode_of_Shipment     10999 non-null  object
 3   Customer_care_calls  10999 non-null  int64 
 4   Customer_rating      10999 non-null  int64 
 5   Cost_of_the_Product  10999 non-null  int64 
 6   Prior_purchases      10999 non-null  int64 
 7   Product_importance   10999 non-null  object
 8   Gender               10999 non-null  object
 9   Discount_offered     10999 non-null  int64 
 10  Weight_in_gms        10999 non-null  int64 
 11  Reached.on.Time_Y.N  10999 non-null  int64 
dtypes: int64(8), object(4)
memory usage: 1.0+ MB


In [None]:
# pengelompokan kolom berdasarkan jenisnya
nums = ['Customer_care_calls', 'Customer_rating', 'Prior_purchases', 'Discount_offered', 'Cost_of_the_Product', 'Weight_in_gms', 'Reached.on.Time_Y.N']
cats = ['Mode_of_Shipment', 'Product_importance', 'Gender','Warehouse_block']

In [None]:
df[nums].describe()

Unnamed: 0,Customer_care_calls,Customer_rating,Prior_purchases,Discount_offered,Cost_of_the_Product,Weight_in_gms,Reached.on.Time_Y.N
count,10999.0,10999.0,10999.0,10999.0,10999.0,10999.0,10999.0
mean,4.054459,2.990545,3.567597,13.373216,210.196836,3634.016729,0.596691
std,1.14149,1.413603,1.52286,16.205527,48.063272,1635.377251,0.490584
min,2.0,1.0,2.0,1.0,96.0,1001.0,0.0
25%,3.0,2.0,3.0,4.0,169.0,1839.5,0.0
50%,4.0,3.0,3.0,7.0,214.0,4149.0,1.0
75%,5.0,4.0,4.0,10.0,251.0,5050.0,1.0
max,7.0,5.0,10.0,65.0,310.0,7846.0,1.0


Some observations:

* The `Customer_care_calls`, `customer_rating`, and `Cost_of_the_Product` columns appear to have a fairly symmetrical distribution (mean and median are not much different)
* The `Discount_offered` and `Prior_purchases` columns appear to be skewed to the right (long-right tail)
* Column `Reached.on.Time_Y.N` is boolean/binary

In [None]:
df[cats].describe()

Unnamed: 0,Mode_of_Shipment,Product_importance,Gender,Warehouse_block
count,10999,10999,10999,10999
unique,3,3,2,5
top,Ship,low,F,F
freq,7462,5297,5545,3666


Some observations:

* For the category of **female gender** is more dominant,
* For the product importance category, it is dominated by the **low category**
* For the shipping mode category is dominated by **shipping by ship** 
* For warehouse_block is dominated by **block F**
* All unique values for each category are still in the normal category, around **2-5 unique values**

In [None]:
data_clean = df.copy()
data_clean.info()

# **Data Cleansing**
**Handle missing values**

In [None]:
#code
df.isna().sum()

Missing values do not need to be handled because there are no missing values for each feature

**Handle duplicated data**

In [None]:
#code
df.duplicated().sum()

In [None]:
df.duplicated(subset=['Customer_care_calls', 'Customer_rating', 'Prior_purchases', 'Discount_offered', 'Cost_of_the_Product',
                      'Weight_in_gms', 'Reached.on.Time_Y.N', 'Mode_of_Shipment', 'Product_importance', 'Gender','Warehouse_block']).sum()

Duplicated Data does not need to be handled because there is no duplicated data on each feature

**Handle outliers**

In [None]:
#code
df2 = df.copy()
print(f'Jumlah baris sebelum memfilter outlier: {len(df2)}')

filtered_entries1 = np.array([True] * len(df2))

for col in nums:
    zscore = abs(stats.zscore(df2[col])) # hitung absolute z-scorenya
    filtered_entries1 = (zscore < 3) & filtered_entries1 # keep yang kurang dari 3 absolute z-scorenya
    
df2 = df2[filtered_entries1] # filter, cuma ambil yang z-scorenya dibawah 3

print(f'Jumlah baris setelah memfilter outlier: {len(df2)}')

Number of rows before filtering outliers: 10999
Number of rows after filtering outliers: 10642

Using z-score for each existing feature **removes about 3% of outlier data** so the data becomes 10642. because we consider every data valuable **so we use z-score** to not waste too much data

**Feature transformation**

In [None]:
#code
df2.describe()

In [None]:
# Normalisasi :
df2['Customer_rating'] = MinMaxScaler().fit_transform(df2['Customer_rating'].values.reshape(len(df2), 1))

#Standarisasi :
df2['Customer_care_calls'] = StandardScaler().fit_transform(df2['Customer_care_calls'].values.reshape(len(df2), 1))
df2['Cost_of_the_Product'] = StandardScaler().fit_transform(df2['Cost_of_the_Product'].values.reshape(len(df2), 1))
df2['Prior_purchases'] = StandardScaler().fit_transform(df2['Prior_purchases'].values.reshape(len(df2), 1))
df2['Discount_offered'] = StandardScaler().fit_transform(df2['Discount_offered'].values.reshape(len(df2), 1))
df2['Weight_in_gms'] = StandardScaler().fit_transform(df2['Weight_in_gms'].values.reshape(len(df2), 1))

In [None]:
df2.describe()

Some features are standardized to make it easier for modeling and also to make the features approach a normal distribution. especially 'customer_rating' is normalized because we already know the limit of the rating, which is 1-5 so it only needs to be normalized

In [None]:
features = nums
plt.figure(figsize=(20, 10))
for i in range(0, len(nums)):
    plt.subplot(3, len(nums)/2, i+1)
    sns.histplot(x=df2[features[i]], kde=True, color='green')
    plt.xlabel(features[i])
    plt.tight_layout()

In [None]:
sns.kdeplot(df2['Prior_purchases']);

In [None]:
sns.kdeplot(np.log(df2['Prior_purchases']));

In [None]:
sns.kdeplot(df2['Discount_offered']);

In [None]:
sns.kdeplot(np.log(df2['Discount_offered']));

In [None]:
sns.kdeplot(df2['Weight_in_gms']);