### Imports

In [53]:
# Import packages
import pandas as pd
import numpy as np

In [54]:
# Read in dataset
df = pd.read_csv('Data/processed_df.csv')
df.head()

Unnamed: 0,return,no_delivery,is_female,is_male,is_bday,most_returned_item,least_returned_item,most_returned_color,least_returned_color,log_item_price,...,item_size_l,item_size_m,item_size_s,item_size_unsized,item_size_xl,item_size_xs,item_size_xxl,item_size_xxxl,user_return_rate,brand_return_rate
0,0,0,1,0,0,0,0,0,0,-0.0902,...,0,0,0,0,0,0,0,1,0.0,0.361744
1,0,1,1,0,0,0,1,0,0,-1.274292,...,0,0,0,0,0,0,0,0,0.0,0.361744
2,1,0,1,0,0,0,0,0,0,0.6034,...,0,0,0,0,0,0,1,0,0.615385,0.541545
3,0,0,1,0,0,1,0,0,0,0.946548,...,0,0,0,0,0,0,1,0,0.615385,0.541545
4,1,0,1,0,0,0,0,0,0,-1.589747,...,0,0,0,0,0,0,1,0,0.615385,0.370151


### Function for Comparison

**Imports**

In [55]:
# Standard Scaler
from sklearn.preprocessing import StandardScaler

# Train Test Split
from sklearn.model_selection import train_test_split

# XGBoost
from xgboost.sklearn import XGBClassifier

# Evaluation Metrics
from sklearn.metrics import roc_auc_score, balanced_accuracy_score, precision_score, recall_score

In [56]:
# Set X and y
X = df.drop('return', axis=1)
y = df['return']

# Scale X
X_ss = StandardScaler().fit_transform(X)
X_ss = pd.DataFrame(X_ss, columns=X.columns)

**Comparison Model (XGBoost)**

Use the best model from modelling.ipynb to compare the effects of dimensionality reduction.

In [59]:
param_grid = {
    'n_estimators': 134,
    'eta': 0.015,
    'lambda': 0.00001,
    'random_state': 42,
    'objective': 'binary:logistic',
    'booster': 'gblinear',
    'eval_metric': 'auc'
}

In [60]:
def xgb(X, y, param_grid, threshold=0.5):    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    classifier = XGBClassifier(**param_grid)
    classifier.fit(X_train, y_train)
     
    # Making predictions on the same data
    y_pred_probs = classifier.predict_proba(X_test)[:, 1]
    y_pred = (y_pred_probs > threshold).astype(int)

    # Evaluating the model
    auc = roc_auc_score(y_test, y_pred_probs)
    accuracy = balanced_accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)

    print(f'AUC: {auc}')
    print(f'Accuracy: {accuracy}')
    print(f'Precision: {precision}')
    print(f'Recall: {recall}')

### Model without Dimensionality Reduction

In [61]:
xgb(X, y, param_grid, threshold=0.49)

AUC: 0.7677417195434493
Accuracy: 0.6999611805690149
Precision: 0.6452017764047113
Recall: 0.7363375936535919


### Principle Component Analysis (PCA)

Linear Component-based Reduction

In [62]:
from sklearn.decomposition import PCA

pca = PCA(n_components=7, random_state=42)
X_pca = pca.fit_transform(X_ss)

print(pca.explained_variance_ratio_)

[0.06175897 0.05148013 0.04539799 0.03115076 0.03096632 0.03012022
 0.02956794]


In [63]:
xgb(X_pca, y, param_grid)

AUC: 0.7225629137424336
Accuracy: 0.6568363107807313
Precision: 0.6316331198536139
Recall: 0.6085279858968708


- We find that each component explains a sparse percent of the variance in the data.
- However, with 12 components we are still able to get metrics with only slightly lower performance.

### Independent Component Analysis (ICA)

Non-linear Component-based Reduction

In [64]:
from sklearn.decomposition import FastICA

transformer = FastICA(n_components=7, random_state=42, whiten='unit-variance', max_iter=500)

X_fica = transformer.fit_transform(X_ss)
X_fica.shape

(99999, 7)

In [65]:
xgb(X_fica, y, param_grid)

AUC: 0.7346542617699692
Accuracy: 0.6678067336180146
Precision: 0.6378013713780137
Recall: 0.6354120758043191


With less than half the components we are able to get metrics with at more similar performance to before reduction, compared to PCA.

###  Uniform Manifold Approximation and Projection (UMAP)

https://pair-code.github.io/understanding-umap/

In [66]:
import umap

reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=7, init='spectral', n_jobs=1, metric='euclidean', n_epochs=200, random_state=42)
X_umap = reducer.fit_transform(X_ss)
X_umap.shape

(99999, 7)

In [67]:
xgb(X_umap, y, param_grid)

AUC: 0.5121532565879665
Accuracy: 0.5
Precision: 0.0
Recall: 0.0


  _warn_prf(average, modifier, msg_start, len(result))


It appears that the data might be too clustered to effectively apply umap.

### Conclusion

Based on the current inputs, reducing the dimensions comes at an approximate 3% loss in accuracy and AUC. Overall, given the time taken to build the XGBoost model is quite fast, we opt for the original model. Given the complexity of this data set, the manual feature selection and extraction seemed to have resulted in a local optima (hopefully) - and any further reduction reduces accuracy of predictions.

##### Export ICA-Reduced Dataset

In [68]:
df = pd.concat([pd.DataFrame(X_fica), df[['return']]], axis=1)
df.to_csv('Data/reduced_df.csv', index=False)