# Analytics Vidhya JanataHack Solution:
---

Solution to the Analytics Vidhya's [JanataHack ML Hackathon](https://datahack.analyticsvidhya.com/contest/janatahack-e-commerce-analytics-ml-hackathon/).<br>
**Problem Statement**: Predict the gender of e-commerce’s participants from their product viewing records.<br>
**Leaderboard Info**: 86.8% on public leaderboard, 87.6% on private leaderboard (Rank 42, top 10%).

# Table of Contents:
---
1. Imports and Loading Data
2. Miscellaneous Variables
3. Feature Engineering
4. Preparing the data
5. Model and Submission


## 1. Imports and Loading Data:
---

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, log_loss
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import hstack
from category_encoders.cat_boost import CatBoostEncoder

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

## 2. Miscellaneous Variables:
---

Just defined a list of the features to be used and initialising lists.

In [3]:
print("The columns are:", *list(train.columns))

features = ['feat_A', 'feat_B', 'feat_C', 'feat_D', 'seconds']
cat_cols = ['feat_A', 'feat_B', 'feat_C', 'feat_D']

encs_cb = []
encs_ohe = []

The columns are: session_id startTime endTime ProductList gender


## 3. Feature Engineering:
---

* Used only 2 features:
  1. Calculated time in seconds from difference of endTime and startTime
  2. Split the first product of the list into an A, B, C and D columns.<br><br>
   So if x = A00002/B00003/C00006/D28435/<br>
   x[:6]    = A00002<br>
   x[7:13]  = B00003<br>
   x[14:20] = C00006<br>
   x[21:27] = D28435<br><br>
   I used CatBoostEncoder as well as OneHotEncoder, doing both improved leaderboard accuracy than just single encoding

* I tested more features but these were the minimal set of features that worked well.

In [53]:
def clean_data(data, train=True):
    enc_val = []
    
    # Calculating the time (in seconds) for each session_id
    time = (pd.to_datetime(data['endTime']) - pd.to_datetime(data['startTime']))
    data['seconds'] = time.apply(lambda x: x.total_seconds() * 60)      
    
    # Slicing the A B C D for the first product for each session_id
    data['feat_A'] = data['ProductList'].apply(lambda x: x[:6])
    data['feat_B'] = data['ProductList'].apply(lambda x: x[7:13])
    data['feat_C'] = data['ProductList'].apply(lambda x: x[14:20])
    data['feat_D'] = data['ProductList'].apply(lambda x: x[21:27])   
    
    if train:
        # only for training dataset
        data['gender'] = data['gender'].map({'male': 0, 'female': 1})    
        for col in cat_cols:
            # Cat Boost Encoder
            cb = CatBoostEncoder()
            cb.fit(data[col].values.reshape(-1, 1), data['gender'])
            # One Hot Encoder
            ohe = OneHotEncoder(handle_unknown='ignore')
            ohe.fit(data[col].values.reshape(-1, 1))   
            # Storing these encoder objects for test set
            encs_cb.append(cb)
            encs_ohe.append(ohe)
    
    # Transforming the data
    for i, enc in enumerate(encs_ohe):
        enc_val.append(enc.transform(data[cat_cols[i]].values.reshape(-1, 1)))
    for i, enc in enumerate(encs_cb):
        data[cat_cols[i]] = enc.transform(data[cat_cols[i]].values.reshape(-1, 1))     

    return data[features].values, enc_val

## 4. Preparing the data:
---
* The train features (seconds and CatBoost encoding) and the one-hot encoded features are concatenated using hstack from scipy as normal concatenation does not work with sparse matrices.

* Did a simple train-test split (80-20). Since the classes are imbalanced, the split is stratified (equal class ratios in train and validation set).

In [54]:
X, vals1 = clean_data(train)
X_test, vals2 = clean_data(test, False)

X = hstack(vals1)
X_test = hstack(vals2)

X = hstack((X, train[features].values))
X_test = hstack((X_test, test[features].values))

In [55]:
y = train['gender'].values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y)

## 5. Model and Submission:
---
* Used a GradientBoostingClassifier with max_depth of 2 and 50 estimators.
* These parameters were achieved from trial and error, better methods like GridSearchCV can be done.
* Once these parameters were found, fit on the full data, not just train set.

In [56]:
model = GradientBoostingClassifier(n_estimators=50, max_depth=2)
model.fit(X, y)
print("Train Accuracy, Loss: ", accuracy_score(y_train, model.predict(X_train)), log_loss(y_train, model.predict(X_train)))
print("Val Accuracy, Loss: ", accuracy_score(y_val, model.predict(X_val)), log_loss(y_val, model.predict(X_val)))

Train Accuracy, Loss:  0.8894047619047619 3.8198853110026443
Val Accuracy, Loss:  0.8961904761904762 3.5855121869035003


In [57]:
preds = model.predict(X_test)
sub = pd.DataFrame({'session_id': test['session_id'], 'gender': pd.Series(preds).map({0: 'male', 1: 'female'})})
sub.to_csv('submission.csv', index=False)