
# <center>Predicting the returns of orders  for a retail shoe seller</center>




## Introduction
###  Challenge SD210 2018
#### Authors :  Florence D'Alché & Umut Şimşekli & Moussab Djerrab


**Context of the challenge:**

An electronic commerce company sells shoes, and has a high return rate of his products, more than 20%. This large number of returns and exchanges has a negative impact on its margin. To remedy this problem, the company wants to better understand this phenomenon, and have tools to quantify the probability of return for a given product. It makes available its database of orders placed between October 2011 and October 2015, its product feedback data, and its customer and product databases (provide the data dictionary).

**Goal of the challenge:**
<ul>
<li>Identify conditions that favor product return (eg what type of product is usually returned, which customer is more keen on returning a product, what type of order or purchase context most often leads to returns?)</li>
<li>Build a return forecast template for each product from a shopping cart.
</li>
</ul>

To go further: This project aimes at making stand out purchasing behaviors. With this knowledge, the e-merchant wishes to use this data to better plan his activity. In particular, he wants to forecast the turnover generated by his clients.



**Training data:**

There will be $N= 1067290$ lines of orders in the training dataset. For each order  the training dataset reports if the command has been returned (***ReturnQuantityBin***) and the quantity returned (***ReturnQuantity***). The column to target (***ReturnQuantityBin***) which is a binary column ($y = 1$ if returned and $y=0$ otherwise). 

**Test data:**

The test data contain $N_\text{test} = 800468$ lines of orders. Everything else is similar to the training data.


## Additional Data

As part of the challenge, two additional datasets are avalaible namely (**customers.csv**) and (**products.csv**). Those to sets contains informations on custmers and on the products. A good prediction model will necessarily require extraction of information comming from this dataset. Students are free to use these data as they see fit. Please keep in mind that both sets containes also customers and products that are not present in the training or test sets.

A dictionnary of variables (**dictionnary.xlsx**) is avalaible in the folder containing the datasets. Please refer to it so as to have a definition of the variables at hand.


## The goal and the performance criterion

In this challenge, we will use an evaluation metric, which is commonly used in binary prediction, namely the ROC AUC criteria. **The closest to 1 the better (be affarait if its below 0).**
Hence the form of the file to send is of the form :


| <center> probability </center>  |
| ------------- |
| <center> .90  </center>         |
| <center> ...  </center>         |
| <center> .42  </center>         |


The order of the probabilities needs to respect the order in the test set.



# Training Data

https://www.dropbox.com/sh/uo4oudw43j45mp3/AACA0UqkitNKSWdE_7fs2Wbla?dl=0


In [2]:
from __future__ import division
#from importlib import reload
import os
import sys
#reload(sys)
#sys.setdefaultencoding("utf-8") -> PYTHON 2 ONLY

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

## Loading the data

In [3]:
customers = pd.read_csv("customers.csv")
products = pd.read_csv("products.csv")
X_train = pd.read_csv("X_train.csv")
X_test   = pd.read_csv("X_test.csv")
y_train = pd.read_csv("y_train.csv")

## Defining a feature transformation

In [4]:
def funk_mask(d):
    " Defining a simple mask over the input data "
    columns_ext = ["OrderCreationDate","OrderNumber","VariantId", "CustomerId","OrderCreationDate","OrderShipDate","BillingPostalCode"]
    X1 = d.loc[:,[xx for xx in d.columns if xx not in columns_ext]]
    g = lambda x: x.replace(",",".")
    X1.UnitPMPEUR = map(np.float64,(map(g,X1.UnitPMPEUR)))
    columns2bin = [x for x in X1.columns if X1[x].dtype == np.dtype('O')]
    X2 = pd.get_dummies(X1.loc[:,columns2bin])
    X1 = X1.loc[:,[xx for xx in X1.columns if xx not in columns2bin]]
    res = pd.concat([X1,X2],axis=1)
    res = res.fillna(0)
    return(res)

## Applying the mask

In [5]:
x1 = funk_mask(X_train)
x2 = funk_mask(X_test)
seleckt_columns = np.intersect1d(x1.columns,x2.columns)
x1 = x1.loc[:,seleckt_columns]
x2 = x2.loc[:,seleckt_columns]

## Supervised learning : Logistic regression model

In [6]:
clf = LogisticRegression()
clf.fit(x1.iloc[:50000], y_train.ReturnQuantityBin[:50000])
y_tosubmit = clf.predict_proba(x2.loc[:,x1.columns])

## Score of our prediction : on the train

In [7]:
yres = clf.predict_proba(x1.loc[:100000,x1.columns])
roc_auc_score(y_train.ReturnQuantityBin.iloc[:100001],yres[:,1])

# Submission to the system
np.savetxt('y_pred.txt', y_tosubmit[:,1], fmt='%f')


# <center> That's all folks; Good Luck! </center>

In [8]:
print("CUSTOMERS :")
customers.head()

CUSTOMERS :


Unnamed: 0,CustomerId,CountryISOCode,BirthDate,Gender,FirstOrderDate
0,14089083.0,SE,1979-02-05 00:00:00,Femme,2013-03-16 23:00:05
1,12862066.0,FR,1982-08-04 00:00:00,Femme,2012-02-14 17:47:33
2,14791699.0,FR,1965-04-02 00:00:00,Femme,2013-10-04 23:10:42
3,10794664.0,FR,1966-04-09 00:00:00,Femme,2010-03-25 18:46:59
4,15268576.0,ES,1980-04-22 00:00:00,Femme,2014-03-19 10:48:39


In [9]:
print("PRODUCTS :")
products.head()

PRODUCTS :


Unnamed: 0,VariantId,GenderLabel,MarketTargetLabel,SeasonLabel,SeasonalityLabel,BrandId,UniverseLabel,TypeBrand,ProductId,ProductType,...,UpperHeight,HeelHeight,PurchasePriceHT,IsNewCollection,SubtypeLabel,UpperMaterialLabel,LiningMaterialLabel,OutSoleMaterialLabel,RemovableSole,SizeAdviceDescription
0,728257.0,Homme,Classique,Automne/Hiver,Saisonnier,66.0,DÃ©tente,Standard,17267.0,Baskets,...,,0.0,30.5,0.0,Montantes,,,,False,Prenez une taille en dessous de votre pointure...
1,806356.0,Femme,ND,Automne/Hiver,Saisonnier,842.0,ND,Standard,30824.0,Baskets,...,0.0,0.0,43.0,0.0,Montantes,,,,True,Prenez votre pointure habituelle
2,768790.0,Femme,ND,Automne/Hiver,Reconduit,988.0,Ville,Standard,62475.0,Bottines et boots,...,6.0,3.0,54.9,0.0,Bout pointu,,,,False,Prenez votre pointure habituelle
3,515679.0,Femme,ND,Automne/Hiver,Saisonnier,769.0,Ville,Standard,43983.0,Escarpins,...,0.0,13.0,34.5,0.0,Bout rond,,,,False,Prenez votre pointure habituelle
4,1025246.0,Femme,ND,Automne/Hiver,Saisonnier,1244.0,ND,Standard,81493.0,Bottines et boots,...,8.0,4.0,43.76,0.0,Bout rond,,,,False,Prenez votre pointure habituelle


In [10]:
X_train.head()

Unnamed: 0,OrderNumber,VariantId,LineItem,CustomerId,OrderStatusLabel,OrderTypelabel,SeasonLabel,PayementModeLabel,CustomerTypeLabel,IsoCode,DeviceTypeLabel,PricingTypeLabel,TotalLineItems,Quantity,UnitPMPEUR,OrderCreationDate,OrderShipDate,OrderNumCustomer,IsOnSale,BillingPostalCode
0,73521754,439729,1,12443972,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Plein Tarif,2,1,5264,2011-10-26 12:10:48,2011-10-26 18:27:00,1,0.0,87000
1,73521754,440174,2,12443972,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Plein Tarif,2,1,5264,2011-10-26 12:10:48,2011-10-26 18:27:00,1,0.0,87000
2,73525226,494501,1,12443958,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Plein Tarif,1,1,1317,2011-10-26 12:11:38,2011-10-26 17:48:00,1,0.0,77700
3,73529009,439590,1,12443946,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Plein Tarif,2,1,564,2011-10-26 12:13:09,2011-10-26 17:59:00,1,0.0,44600
4,73529009,559476,2,12443946,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Plein Tarif,2,1,37,2011-10-26 12:13:09,2011-10-26 17:59:00,1,0.0,44600


In [11]:
d1 = "1969-10-03 00:00:00"
from datetime import datetime

try:
    from tqdm import tqdm_notebook as tqdm
except :
    print("tqdm non installé. Lancez 'pip install tqdm' dans un terminal et relancez la cellule.")

    
    
def age_from_date(date):
    try:
        d = datetime.strptime(date,"%Y-%m-%d %H:%M:%S")
    except:
        d = datetime.strptime(date[0:19],"%Y-%m-%d %H:%M:%S")
    now = datetime.now()
    delta = now.year - d.year
    if (now.month == d.month and now.day<d.day):
        delta-=1
    if now.month < d.month:
        delta-=1
    return delta

age_from_date(d1)

48

In [12]:
from tqdm import tqdm_notebook as tqdm

tabAge = []
for date in tqdm(customers["BirthDate"]):
    tabAge.append(age_from_date(date))
customers["Age"] = tabAge




In [13]:
customers.head()

Unnamed: 0,CustomerId,CountryISOCode,BirthDate,Gender,FirstOrderDate,Age
0,14089083.0,SE,1979-02-05 00:00:00,Femme,2013-03-16 23:00:05,39
1,12862066.0,FR,1982-08-04 00:00:00,Femme,2012-02-14 17:47:33,35
2,14791699.0,FR,1965-04-02 00:00:00,Femme,2013-10-04 23:10:42,53
3,10794664.0,FR,1966-04-09 00:00:00,Femme,2010-03-25 18:46:59,51
4,15268576.0,ES,1980-04-22 00:00:00,Femme,2014-03-19 10:48:39,37


In [14]:
X_train.head()

Unnamed: 0,OrderNumber,VariantId,LineItem,CustomerId,OrderStatusLabel,OrderTypelabel,SeasonLabel,PayementModeLabel,CustomerTypeLabel,IsoCode,DeviceTypeLabel,PricingTypeLabel,TotalLineItems,Quantity,UnitPMPEUR,OrderCreationDate,OrderShipDate,OrderNumCustomer,IsOnSale,BillingPostalCode
0,73521754,439729,1,12443972,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Plein Tarif,2,1,5264,2011-10-26 12:10:48,2011-10-26 18:27:00,1,0.0,87000
1,73521754,440174,2,12443972,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Plein Tarif,2,1,5264,2011-10-26 12:10:48,2011-10-26 18:27:00,1,0.0,87000
2,73525226,494501,1,12443958,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Plein Tarif,1,1,1317,2011-10-26 12:11:38,2011-10-26 17:48:00,1,0.0,77700
3,73529009,439590,1,12443946,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Plein Tarif,2,1,564,2011-10-26 12:13:09,2011-10-26 17:59:00,1,0.0,44600
4,73529009,559476,2,12443946,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Plein Tarif,2,1,37,2011-10-26 12:13:09,2011-10-26 17:59:00,1,0.0,44600


In [15]:
tab = pd.merge(customers, X_train)

In [16]:
del tab['FirstOrderDate']
del tab['BirthDate']
del tab['OrderCreationDate']
del tab['OrderShipDate']

In [32]:
tab.head()

Unnamed: 0,CustomerId,CountryISOCode,Gender,Age,OrderNumber,VariantId,LineItem,OrderStatusLabel,OrderTypelabel,SeasonLabel,PayementModeLabel,CustomerTypeLabel,IsoCode,DeviceTypeLabel,PricingTypeLabel,TotalLineItems,Quantity,UnitPMPEUR,OrderNumCustomer,IsOnSale
0,14089100.0,SE,Femme,39,89882287,728257,1,Expédié,DIRECT,Printemps/Eté,Carte bancaire,Nouveau,SE,ND,Promo Sans CP,2,1,24.343,1,0.0
1,14089100.0,SE,Femme,39,89882287,806356,2,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,SE,ND,Promo Sans CP,2,1,32.8711,1,0.0
2,12862100.0,FR,Femme,35,67014446,288068,1,Expédié,DIRECT,Automne/Hiver,PayPal,Nouveau,FR,ND,Soldes,2,1,38.64,1,1.0
3,12862100.0,FR,Femme,35,67014446,515679,2,Expédié,DIRECT,Automne/Hiver,PayPal,Nouveau,FR,ND,Plein Tarif,2,1,32.77,1,0.0
4,12862100.0,FR,Femme,35,24318335,678091,1,Expédié,DIRECT,Printemps/Eté,PayPal,Fidélisé,FR,ND,Promo Avec CP,1,1,38.22,2,0.0


In [18]:
s='saousan'

In [19]:
column = []
for i in tqdm(range(len(tab["UnitPMPEUR"]))):
    column.append(tab["UnitPMPEUR"][i].replace(",","."))
tab["UnitPMPEUR"] = column;




In [20]:
tab['UnitPMPEUR']=tab['UnitPMPEUR'].astype(float)

In [23]:
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, RandomForestClassifier
import matplotlib as mpl
from sklearn.metrics import r2_score as r2
from matplotlib.pyplot import cm 
from sklearn import datasets, svm, preprocessing
from sklearn.model_selection import cross_val_score

In [24]:
del tab["BillingPostalCode"]

In [38]:
cols_to_transform = [ 'CountryISOCode', 'Gender', 'OrderStatusLabel', 'OrderTypelabel', 'SeasonLabel', 'PayementModeLabel','CustomerTypeLabel','IsoCode','DeviceTypeLabel','PricingTypeLabel' ]
tab = pd.get_dummies( tab,columns = cols_to_transform )



In [39]:


tree = DecisionTreeClassifier(max_depth= 30)
tree.fit(tab,y_train )


bagging = BaggingRegressor(base_estimator=tree)
bagging.fit(X_train, y_train)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [29]:
tab.tail()

Unnamed: 0,CustomerId,CountryISOCode,Gender,Age,OrderNumber,VariantId,LineItem,OrderStatusLabel,OrderTypelabel,SeasonLabel,PayementModeLabel,CustomerTypeLabel,IsoCode,DeviceTypeLabel,PricingTypeLabel,TotalLineItems,Quantity,UnitPMPEUR,OrderNumCustomer,IsOnSale
790408,14585000.0,NL,Femme,38,62112457,592732,1,Expédié,DIRECT,Printemps/Eté,Carte bancaire,Nouveau,NL,Tablet,Promo Avec CP,1,1,34.02,1,0.0
790409,14243000.0,FR,Femme,48,61488192,938737,1,Expédié,DIRECT,Printemps/Eté,Carte bancaire,Nouveau,FR,ND,Plein Tarif,1,1,17.8,1,0.0
790410,13711800.0,FR,Femme,30,70803896,538985,1,Expédié,DIRECT,Automne/Hiver,Carte bancaire,Nouveau,FR,ND,Promo Avec CP,1,1,75.0,1,0.0
790411,13154100.0,FR,Femme,34,60877753,430461,1,Expédié,DIRECT,Printemps/Eté,Carte bancaire,Nouveau,FR,ND,Promo Avec CP,1,1,24.95,1,0.0
790412,13176000.0,IT,Femme,43,10409304,574985,1,Expédié,DIRECT,Printemps/Eté,Carte bancaire,Nouveau,IT,ND,Plein Tarif,1,1,12.0058,1,0.0


In [None]:
y_train.head()