# EBAY Assignment - by Sunny Wong

A private (non-business) seller who has just sold an item on eBay may be particularly receptive to a marketing message to buy on eBay. After all, the seller just had a positive selling experience and is in a good mood with respect to eBay, and he just made some money that's burning a hole in his pocket. 

By contract, eBay sends several transactional email messages for each successful auction

Due to their transactional nature, these emails have a high open rate and are thus a perfect candidate for including a buying-focused marketing message.

Against this background, your business partner has asked you whether you think there's potential in including a buying marketing message in transactional emails. The high-level question she asks is: “Do you think sellers who just sold an item will be responsive to a buying marketing message, and if yes, which sellers should we target?”.


# Problem breakdown and summary

The high level problem comes down to build a predictive model to see which seller be purchasing after a sell. And identifying those sellers be likely to be responsive to a marketing message. In the following I will illustrate the high level steps taken.

### Part A) Data  Processing

- data exploratory and any data assumptions 
- data cleaning (dealing with any null values)
- data engineering

### Part B) Modelling

- After the data transformation step, we now have a dataframe ready to build our predictive model
- will create a simple model as baseline (decision tree)
- build more advance models (such as random forest and gradient boosting) to beat the baseline model
- use grid search and cross validation to find optimal hyperparameters
- evaluate against test set
- discussion on model results

### Part C) Insights generated and possible improvement dicussions

- insights from models
- areas of possible improvement

### To take this even further I will discuss the use of experimentation to see if marketing message are effective.

***

### Part A Data Processing Summary

- took a look at data before starting the problem
- generated target label for next step using after_7d_value
- created data_transfomer class with 2 parameters 
- created new feature based on the difference of original/final sell price (i think it be a good predictor)
- drop nulls (ONLY because there was so relatively little of them), also provide alternatives if there was a lot of nulls
- prep_df function also use pd.dummie to handle categorial variables
- drop unneed columns


In [139]:
# import some common packages for data processing
import os
import pandas as pd
import numpy as np
import math

In [140]:
# load data - change path of data file to replciate results
path = r'C:\Users\sunny.wong2\JupyterNotebook\ebay assignment\Test1.csv'
# need list of column names to help read the csv
column_names = ['seller',
                'buyer_segment',
                'full_category',
                'category',
                'auction_duration',
                'start_price',
                'total_bids',
                'first_2d_bids',
                'last_2d_bids',
                'final_price',
                'final_price_cat_pctl',
                'last_7d_searches',
                'last_7d_item_views',
                'last_7d_purchases',
                'last_2d_searches',
                'last_2d_item_views',
                'last_2d_purchases',
                'after_7d_value',
                'after_7d_purchases'
                ]
df = pd.read_csv(path, sep=';', header=None, names=column_names)

  interactivity=interactivity, compiler=compiler, result=result)


In [141]:
# a transformer class to hold any feature transformation functions here 
class data_transformer(object):
    
    def __init__(self):
        self.amount_consider_as_responsive = 0
        self.perc_consider_as_significant_gain = 0.2
    
    # want generate label using whether responsive (1 is responsive)
    def generate_label(self, y):
        if y > self.amount_consider_as_responsive:
            return 1
        else:
            return 0
    
    # want to find price difference between start and final price, 
    # idea is seller more likely buy stuff if made more money
    def price_difference_gain_or_loss(self, y):
        if y > self.perc_consider_as_significant_gain:
            return 1
        else:
            return 0
        
dt = data_transformer()

In [142]:
# generate label
df['responsive_label'] = df['after_7d_value'].apply(dt.generate_label)

In [143]:
# this will give a rate of increase/decrease from original sell price to final price
df['sell_price_difference_percentage'] = (df['final_price']-df['start_price'])/df['start_price']
df['sell_price_difference_significant'] = df['sell_price_difference_percentage'].apply(dt.price_difference_gain_or_loss)

In [144]:
# this shows full_category and category has many classes - actually dropped it after playing with the model
print(len(df['buyer_segment'].unique()))
print(len(df['full_category'].unique()))
print(len(df['category'].unique()))

7
11935
3215


In [145]:
# this data is actually really nice with very litter null values, below is a count of nulls in each column
# so a simple dropna would do for this assignment
df.isnull().sum(axis = 0)

seller                                0
buyer_segment                        97
full_category                        97
category                             97
auction_duration                     97
start_price                          97
total_bids                           97
first_2d_bids                        97
last_2d_bids                         97
final_price                          97
final_price_cat_pctl                 97
last_7d_searches                     97
last_7d_item_views                   97
last_7d_purchases                    97
last_2d_searches                     97
last_2d_item_views                   97
last_2d_purchases                    97
after_7d_value                       97
after_7d_purchases                   97
responsive_label                      0
sell_price_difference_percentage     97
sell_price_difference_significant     0
dtype: int64

In [146]:
df = df.dropna(axis=0, how='any')

In [147]:
# although not needed for this assignment since so little data have null values - but always safe to have ways to deal with
# null values, below are 2 simple ways to deal with it (for categorial and numerical variables) when we dealing with a bigger
# dataset with more possible nulls
def prep_df(input_df):
    
    prepped_df = input_df.copy()

    # replace null values with not_known in these categorical variables
    columns = ['buyer_segment']
    for c in columns:
        prepped_df[c] = prepped_df[c].fillna('not_known')
    
    # a simple way to deal with NaN in these days difference features is to replace with the mean
    columns = [ 'auction_duration',
                'start_price',
                'total_bids',
                'first_2d_bids',
                'last_2d_bids',
                'final_price',
                'final_price_cat_pctl',
                'last_7d_searches',
                'last_7d_item_views',
                'last_7d_purchases',
                'last_2d_searches',
                'last_2d_item_views',
                'last_2d_purchases']
    for c in columns:
        prepped_df[c] = prepped_df[c].replace(np.NaN, df[c].mean())
        
    prepped_df = pd.get_dummies(prepped_df)
    return prepped_df

In [148]:
# keep only columns considered for modelling
# note seller is removed since is meaningless
# 'full_category' and 'category' are removed since is too sparse, over 10k and 3k classes of it
# 'after_7d_value' and 'after_7d_purchases' are removed since responsive_label be our target
column_names = [
                'buyer_segment',
                'auction_duration',
                'start_price',
                'total_bids',
                'first_2d_bids',
                'last_2d_bids',
                'final_price',
                'final_price_cat_pctl',
                'last_7d_searches',
                'last_7d_item_views',
                'last_7d_purchases',
                'last_2d_searches',
                'last_2d_item_views',
                'last_2d_purchases',
                'sell_price_difference_percentage',
                'sell_price_difference_significant',
                'responsive_label'
                ]
df = df[column_names]

In [149]:
# just to get an idea of the class distribution between responsive or not
r = len(df.loc[(df['responsive_label'] == 1)]) / len(df)
print(r)

0.06538217709625621


A fairly important finding - a very small subset of sellers end up buying something after 7 days.

This is important because it will drive how we evaulate our metric later - recall rate will be much more important rather than just looking at accuracy

___

### Part B) Modelling Summary


- After the data transformation step, we now have a dataframe ready to build our predictive model
- will create a simple model as baseline (decision tree)
- build more advance models (such as random forest and gradient boosting) to beat the baseline model
- use grid search and cross validation to find optimal hyperparameters
- evaluate against test set
- discussion of models results

In [150]:
# import typical ml and metrics packages
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection  import train_test_split, GridSearchCV
from sklearn.metrics import f1_score, confusion_matrix, precision_score, recall_score, accuracy_score
from xgboost.sklearn import XGBClassifier

In [151]:
# X is our predictors, y is the label we want to predict
y = df['responsive_label']
X = df.drop(['responsive_label'], axis=1)

In [152]:
# sanity check to see if y and X have same number of rows
print(len(y))
print(len(X))

695633
695633


In [153]:
# use function below to prepare our X dataframe for scikit learn models
X_prepped = prep_df(X)

In [154]:
# this function first splits into train and test, then do cross validation grid search on the train set 
# to identify best hyperparamters, using the best_fit model to validate against test set
def train_model_with_cv(model, params, X, y):

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

    # Use Train data to parameter selection in a Grid Search
    gs_clf = GridSearchCV(model, params, n_jobs=1, cv=5)
    gs_clf = gs_clf.fit(X_train, y_train)
    model = gs_clf.best_estimator_

    # Use best model and test data for final evaluation
    y_pred = model.predict(X_test)

    _f1 = f1_score(y_test, y_pred, average='micro')
    _confusion = confusion_matrix(y_test, y_pred).ravel()
    _accuracy = accuracy_score(y_test, y_pred)
    _precision = precision_score(y_test, y_pred)
    _recall = recall_score(y_test, y_pred)
    _statistics = {'f1_score': _f1,
                   'confusion_matrix': 'tn, fp, fn, tp' + str(_confusion),
                   'accuracy': _accuracy,
                   'precision': _precision,
                   'recall': _recall
                   }

    return model, _statistics 

In [155]:
# baseline model using decision tree model
clf = DecisionTreeClassifier()
param_grid = {"max_depth": [10, 15, 20],
              "min_impurity_decrease": [0],
              "criterion": ["gini"],
              "min_samples_split": [50],
              "min_samples_leaf": [50],
              "max_features": [None]
              }

dt_model , stats = train_model_with_cv(clf, param_grid, X_prepped, y)
print(stats)

{'f1_score': 0.9463275236431593, 'confusion_matrix': 'tn, fp, fn, tp[212264   2325   9996   4974]', 'accuracy': 0.9463275236431593, 'precision': 0.6814632141389232, 'recall': 0.3322645290581162}


In [156]:
# random forest model
# note running the grid may take a long time, may consider testing less combinatons of hyperparamters
clf = RandomForestClassifier()
param_grid = {"n_estimators": [100, 150],
                  "max_depth": [3, 8, 12],
                  "max_features": ["auto", "sqrt"],
                  "min_samples_split": [30, 75],
                  "min_samples_leaf": [30, 75],
                  "bootstrap": [True],
                  "criterion": ["gini"]
              }

rf_model , stats = train_model_with_cv(clf, param_grid, X_prepped, y)
print(stats)

{'f1_score': 0.9477781311122631, 'confusion_matrix': 'tn, fp, fn, tp[213203   1386  10602   4368]', 'accuracy': 0.9477781311122631, 'precision': 0.7591240875912408, 'recall': 0.29178356713426856}


In [157]:
# gradient boosting model
# note running the grid may take a long time, may consider testing less combinatons of hyperparamters
clf = XGBClassifier()
param_grid = {"learning_rate": [0.1],
              "n_estimators": [100, 150],  # Number of estimators
              "max_depth": [3, 8, 15],  # maximum depth of decision trees
              "colsample_bytree": [0.33, 0.66],   # Criterion for splitting
              "subsample": [0.5, 0.8, 1]
             }


gb_model , stats = train_model_with_cv(clf, param_grid, X_prepped, y)
print(stats)

{'f1_score': 0.9557063761385962, 'confusion_matrix': 'tn, fp, fn, tp[213388   1201   8967   6003]', 'accuracy': 0.9557063761385962, 'precision': 0.8332870627429206, 'recall': 0.40100200400801606}


All of the models I tested performed about the same, with XGBoost performing the best in terms of Recall rate.

Remember the accuracy metric isnt the most important because of the class imbalance (because if I only guess not responsive i would've got 94% accuracy) 

Below talked about a few possible ways to improve

___

### Part C) Insights generated and possible improvement dicussions


Using trained random forest model, can take advantage of its feature importances to see which features are strong predictors

In [158]:
feature_important_list = sorted(zip(map(lambda x: round(x, 4), rf_model.feature_importances_), list(X_prepped)), reverse=True)
feature_important_list_top5 = feature_important_list[:5]

In [159]:
# show top 5 features from feature important list
print(feature_important_list_top5)

[(0.3084, 'last_7d_purchases'), (0.1828, 'last_7d_item_views'), (0.1455, 'last_2d_item_views'), (0.1243, 'last_2d_purchases'), (0.0614, 'last_7d_searches')]


Insights to Leverage

The main factor that is best at predicting whether someone who purchase after selling is whether they did a last_7d_purchase and last_7d_item_views. Therefore Ebay should those sellers.

### Possible ideas for model improvement

- Dimensionality reduction, use feature selection and PCA to reduce feature space
- Want to consider binning certain variables
- Scaling be needed if want to use algorthms like SVM or KNN
- include in new features if possible

Just want to add a comment about unit testing (did not do for this assignment) - ideally would want to have unit tests for each of the functions wrote to ensure is behaving the way is intended

___

### Experimentation 


### 1) Experimentation metrics

##### To take this even further I will discuss the use of experimentation to see which type of marketing message are more effective.

Given the feature has been proposed by the Ebay Team to increase seller's incentive to buy after selling, the key measure of success should reflect an increase in the proportion of responsive seller. 

    - I would define the KPI as the average number of purchases after selling per seller.

In addition, we want to to track these metrics at the seller level

    i) average views of other products (are sellers considering purchasing?)
    ii) average revenue per seller (since Ebay's marketing email maybe annoying to non-responsive sellers, is this feature worthwhile)
    iii) open rate of these email (making sure the new feature won't cause sellers to not open the transactional email)

### 2) Experimentation plan

Suppose that sellers are exposed to one of two options at 4pm.

    - No special message 
    - Get a marketing message
    
We will refer these 2 groups of drivers to control group, and test group.
For this experiement, we should have both groups from the same city/region.

We are interested in testing a hypotheses: whether the marketing message provided increase average number of purchases compeleted after selling per seller. Since the metrics are averages over iid sellers (assuming we randomized correctly), we know the corresponding distributions should be normal, a 2 sample z-test will suffice to determine if the difference in avg purchases completed is statistical significant.

We need to consider experimental power. We need enough number of observations such that the minimal difference in a metric can be detected. This impacts the length of the experiment.

For a risky experiment, we may not want to expose the whole population at once. If the traffic is high enough, we can often get away with a lower exposure rate and still satisfy a reasonable time to completion.