# Prepare data for training dataframe

The aim is to produce a training table where each `order_id` is mapped to all coupons available in the departments from which products were selected in that order on the date the order was made, along with information (`True`/`False`) if such available coupon was used in that order.

The steps are as follows:
1. Create `coupon_date_department` table which maps coupons to applicable departments, to coupon validity dates
2. Create `order_date_department` table which maps orders to departments from which products were ordered, to order dates.
3. Create `order_coupons_available` by merging the tables above on date and department - this will create a table which maps orders and coupons available at that order date, but limited by departments (i.e. rows are selected only if order contained products from a department for which there was a coupon)
4. Create `order_coupons_used` - by dropping columns from `order_details`, leaving only a mapping between order and coupons used in that order.
5. Cobine 3 and 4 to create a dataset with coupons and orders mapped, with information whether coupon was used or not.
6. Add details about customers (such as age, gender, etc., and also some statistics about customer purchases)
7. Add details about coupons (type, discount, mean product pric, etc.)

In [1]:
import datetime
import os

from IPython.display import Image
import numpy as np
import pandas as pd

In [2]:
Image('../data_diagram.png')

FileNotFoundError: No such file or directory: '../data_diagram.png'

FileNotFoundError: No such file or directory: '../data_diagram.png'

<IPython.core.display.Image object>

## Read and Prepare Tables

In [3]:
data_dir = 'data_0419_0'

In [4]:
coupons = pd.read_csv(os.path.join(data_dir, 'coupons.csv'))
coupons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 971 entries, 0 to 970
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          971 non-null    int64 
 1   type        971 non-null    object
 2   department  971 non-null    object
 3   discount    971 non-null    int64 
 4   how_many    971 non-null    int64 
 5   start_date  971 non-null    object
 6   end_date    971 non-null    object
dtypes: int64(3), object(4)
memory usage: 53.2+ KB


In [5]:
coupons.sample(5)

Unnamed: 0,id,type,department,discount,how_many,start_date,end_date
797,798,just_discount,Girls,17,1,2012-07-06,2012-07-27
802,803,buy_all,Boys,38,5,2012-07-11,2012-08-06
793,794,buy_all,Sport,10,4,2012-07-03,2012-08-01
442,443,just_discount,Girls,14,1,2011-05-31,2011-06-13
13,14,buy_all,Women,61,3,2010-01-01,2010-01-13


In [6]:
coupons.rename(columns={'id': 'coupon_id'}, inplace=True)
coupons.start_date = pd.to_datetime(coupons.start_date, format='%Y-%m-%d')
coupons.end_date = pd.to_datetime(coupons.end_date, format='%Y-%m-%d')

In [7]:
coupon_product = pd.read_csv(os.path.join(data_dir, 'coupon_product.csv'))
coupon_product.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1700 entries, 0 to 1699
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   coupon_id   1700 non-null   int64
 1   product_id  1700 non-null   int64
dtypes: int64(2)
memory usage: 26.7 KB


In [8]:
products = pd.read_csv(os.path.join(data_dir, 'products.csv'))
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           3000 non-null   int64  
 1   name         3000 non-null   object 
 2   category     3000 non-null   object 
 3   sizes        3000 non-null   object 
 4   vendor       3000 non-null   object 
 5   description  3000 non-null   object 
 6   buy_price    3000 non-null   float64
 7   department   3000 non-null   object 
dtypes: float64(1), int64(1), object(6)
memory usage: 187.6+ KB


In [9]:
products.sample(5)

Unnamed: 0,id,name,category,sizes,vendor,description,buy_price,department
2302,2303,SEHIHENTOND - Dark brown Onesy for Women,Onesy,S-XL,Nununu,SEHIHENTOND - Dark brown Onesy for Women by Nu...,10.95,Women
419,420,TERNETHSTOR - Bone Cargo short for Men,Cargo short,one-size,Converse,TERNETHSTOR - Bone Cargo short for Men by Conv...,10.73,Men
2199,2200,HADTITEDNDAR - Black V-neck t-shirt for Men,V-neck t-shirt,XS-XXL,Gucci,HADTITEDNDAR - Black V-neck t-shirt for Men by...,4.84,Men
2967,2968,TOHIS - Bisque Bathing suit for Sport,Bathing suit,28-48,Carhartt,TOHIS - Bisque Bathing suit for Sport by Carha...,4.3,Sport
1741,1742,NGRESTEVETI - Catawba Cargo short for Women,Cargo short,28-48,Dior,NGRESTEVETI - Catawba Cargo short for Women by...,6.25,Women


In [10]:
products.rename(columns={'id': 'product_id'}, inplace=True)

In [11]:
orders = pd.read_csv(os.path.join(data_dir, 'orders.csv'))
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302793 entries, 0 to 302792
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   id           302793 non-null  int64 
 1   customer_id  302793 non-null  int64 
 2   order_date   302793 non-null  object
dtypes: int64(2), object(1)
memory usage: 6.9+ MB


In [12]:
orders.rename(columns={'id': 'order_id', 'order_date': 'date'}, inplace=True)
orders.date = pd.to_datetime(orders.date, format='%Y-%m-%d').dt.date

In [13]:
order_details = pd.read_csv(os.path.join(data_dir, 'order_details.csv'))
order_details.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2162820 entries, 0 to 2162819
Data columns (total 7 columns):
 #   Column            Dtype  
---  ------            -----  
 0   id                int64  
 1   order_id          int64  
 2   product_id        int64  
 3   quantity_ordered  int64  
 4   original_price    float64
 5   buy_price         float64
 6   coupon_id         float64
dtypes: float64(3), int64(4)
memory usage: 115.5 MB


In [14]:
order_details.drop(['id'], axis=1, inplace=True)

In [15]:
order_details.sample(5)

Unnamed: 0,order_id,product_id,quantity_ordered,original_price,buy_price,coupon_id
1968308,275317,1722,17,10.33,10.33,
1088524,154704,2806,3,9.94,3.7772,476.0
1954012,273286,1023,6,7.0,7.0,
695959,96807,2257,20,2.08,2.08,
1886904,263618,37,1,18.83,14.3108,819.0


In [16]:
order_details.coupon_id.isnull().sum() / len(order_details)

0.7023219685410714

In [17]:
customers = pd.read_csv(os.path.join(data_dir, 'customers.csv'))
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           1000 non-null   int64 
 1   name         1000 non-null   object
 2   gender       1000 non-null   object
 3   age          1000 non-null   int64 
 4   phone        1000 non-null   object
 5   address      1000 non-null   object
 6   city         1000 non-null   object
 7   state        1000 non-null   object
 8   postalCode   1000 non-null   int64 
 9   country      1000 non-null   object
 10  creditLimit  1000 non-null   int64 
dtypes: int64(4), object(7)
memory usage: 86.1+ KB


In [18]:
customers.rename(columns={'id': 'customer_id'}, inplace=True)

In [19]:
customers.sample(5)

Unnamed: 0,customer_id,name,gender,age,phone,address,city,state,postalCode,country,creditLimit
723,724,Manogna DDS,F,59,(514)172-5225,181 Miller Trail,Valley Forge,Pennsylvania,19496,US,7603
169,170,Afton Batz,F,48,(210)599-1186,107 Highland Trail,Scammon,Kansas,66773,US,8271
664,665,Francisco Collins,M,32,(268)454-9145,59 10th Road,Pittsburgh,Pennsylvania,15251,US,7171
36,37,Yandell PhD,M,20,(633)347-8509,341 7th Avenue,Randolph,Maine,4346,US,9200
224,225,Ladainian MD,M,78,(773)588-2713,173 Oak Road,Keldron,South Dakota,57634,US,3366


## Data Prep

#### Step 1. Create `coupon_date_department` table

1. Create `coupon_department` mapping coupons to departments for which they are valid
2. Create `coupon_dates` mapping coupons to all dates on which they were valid
3. Merge the two into `coupon_department_dates`

In [20]:
# Step 1.1 Map coupons to departments
coupon_department = pd.merge(coupon_product, products[['product_id', 'department']], on='product_id')\
                      .drop(['product_id'], axis=1).drop_duplicates()

# Coupon_product does not include coupons valid for all products in a department, add this info here
department_coupons = coupons.loc[coupons.type == 'department'][['coupon_id', 'department']]
coupon_department = coupon_department.append(department_coupons).sort_values(by='coupon_id').reset_index(drop=True)
coupon_department

Unnamed: 0,coupon_id,department
0,1,Men
1,2,Men
2,3,Men
3,4,Sport
4,5,Sport
...,...,...
966,967,Girls
967,968,Men
968,969,Women
969,970,Boys


In [21]:
# Validate that all coupons are present
assert len(coupon_department) == len(coupons.coupon_id.unique())

In [22]:
# Step 1.2 Map coupons to dates on which they were valid
coupon_dates = coupons.drop(['type', 'department', 'discount', 'how_many'], axis=1)
# Get the earliest and latest date in the dataset
start = coupon_dates.start_date.min()
end = coupon_dates.end_date.max()
days = (end-start).days + 1
print(f'Start: {start.date()}, end: {end.date()}. {days} days')

Start: 2010-01-01, end: 2013-01-18. 1114 days


In [23]:
# Create a dataframe with row for each day from the earliest to the latest date in the set
all_dates = pd.DataFrame(pd.date_range(start=start, end=end, freq='D'), columns=['date'])
assert days == len(all_dates)

In [24]:
# Step 1.2.1: Perform a cross join of `all_dates` and `coupon_dates` - which contains info on validity periods
coupon_dates['key'] = 1
all_dates['key'] = 1
coupon_dates = pd.merge(coupon_dates, all_dates, on='key').drop('key', axis=1)

# Step 1.2.2 Drop rows where a date does not fall within the validity period of a coupon
coupon_dates = coupon_dates[(coupon_dates['date'] >= coupon_dates['start_date']) & \
                            (coupon_dates['date'] <= coupon_dates['end_date'])]
coupon_dates.drop(['start_date', 'end_date'], axis=1, inplace=True)
coupon_dates

Unnamed: 0,coupon_id,date
0,1,2010-01-01
1,1,2010-01-02
2,1,2010-01-03
3,1,2010-01-04
4,1,2010-01-05
...,...,...
1080559,970,2012-12-29
1080560,970,2012-12-30
1080561,970,2012-12-31
1080562,970,2013-01-01


In [25]:
# Validate coupon_dates is consistent with the original data in terms of coupon validity dates
coupons['days_valid'] = (coupons.end_date - coupons.start_date).dt.days + 1
df = pd.merge(coupons, coupon_dates.groupby(by='coupon_id').count().rename(columns={'date': 'days_valid'}), on='coupon_id')
assert 0 == len(df.loc[df.days_valid_x != df.days_valid_y])
coupons.drop(['days_valid'], axis=1, inplace=True)

# Validate no coupon has been lost
assert len(coupons) == len(coupon_dates.coupon_id.unique())

In [26]:
coupon_dates.groupby('date').count().median()

coupon_id    14.0
dtype: float64

In [27]:
# Step 1.3 Merge coupon_department and coupon_dates
coupon_date_department = pd.merge(coupon_dates, coupon_department, on='coupon_id', how='left')
coupon_date_department.date = coupon_date_department.date.dt.date
coupon_date_department

Unnamed: 0,coupon_id,date,department
0,1,2010-01-01,Men
1,1,2010-01-02,Men
2,1,2010-01-03,Men
3,1,2010-01-04,Men
4,1,2010-01-05,Men
...,...,...,...
15557,970,2012-12-29,Boys
15558,970,2012-12-30,Boys
15559,970,2012-12-31,Boys
15560,970,2013-01-01,Boys


#### Step 2. Create `order_date_department` table

1. Map `order_id` and product `department` -> `order_department`
2. Add dates by joining with `orders` -> `order_date_department`

In [28]:
order_department = pd.merge(order_details[['order_id', 'product_id']], products[['product_id', 'department']],
                            on='product_id', how='left').drop(['product_id'], axis=1).drop_duplicates()
order_department

Unnamed: 0,order_id,department
0,1,Men
1,1,Women
2,2,Boys
4,2,Women
5,2,Men
...,...,...
2162809,302791,Boys
2162813,302792,Boys
2162814,302793,Sport
2162817,302793,Girls


In [29]:
order_date_department = pd.merge(order_department, orders[['order_id', 'date']], on='order_id', how='right')
order_date_department

Unnamed: 0,order_id,department,date
0,1,Men,2010-01-01
1,1,Women,2010-01-01
2,2,Boys,2010-01-01
3,2,Women,2010-01-01
4,2,Men,2010-01-01
...,...,...,...
998789,302791,Boys,2012-12-30
998790,302792,Boys,2012-12-30
998791,302793,Sport,2012-12-30
998792,302793,Girls,2012-12-30


#### Step 3. Create `order_coupons_available`

In [30]:
order_coupons_available = pd.merge(order_date_department, coupon_date_department, on=['date', 'department'], how='left')\
                            .drop(['department', 'date'], axis=1).drop_duplicates()
order_coupons_available

Unnamed: 0,order_id,coupon_id
0,1,1.0
1,1,2.0
2,1,3.0
3,1,13.0
4,1,14.0
...,...,...
2821732,302793,967.0
2821733,302793,971.0
2821734,302793,945.0
2821735,302793,947.0


In [31]:
order_coupons_available.groupby('order_id').count().sort_values(by='coupon_id', ascending=False)

Unnamed: 0_level_0,coupon_id
order_id,Unnamed: 1_level_1
236046,15
99259,15
99264,15
224116,15
99269,15
...,...
15365,0
15422,0
15338,0
15410,0


#### Step 4. Create `order_coupons_used`

In [32]:
order_coupon_used = order_details[['order_id', 'coupon_id']].dropna().drop_duplicates()
order_coupon_used

Unnamed: 0,order_id,coupon_id
12,4,4.0
28,4,6.0
40,7,4.0
49,7,3.0
54,8,13.0
...,...,...
2162765,302786,967.0
2162773,302787,960.0
2162776,302788,971.0
2162780,302788,966.0


#### Step 5. Combine `order_coupons_available` and `order_coupons_used` to get final info on which coupons were used and which, although they were available, were ignored

In [33]:
order_coupon_used.set_index(['order_id', 'coupon_id'], inplace=True)
order_coupons_available.set_index(['order_id', 'coupon_id'], inplace=True)

In [34]:
order_coupons_available['coupon_used'] = order_coupons_available.index.isin(order_coupon_used.index)

In [35]:
order_coupons = order_coupons_available.reset_index().dropna().reset_index(drop=True)

In [36]:
order_coupons.coupon_used.value_counts(normalize=True)

False    0.905772
True     0.094228
Name: coupon_used, dtype: float64

## Prepare customer data
- From `customers` table, get:
    - gender
    - age bracket: young < 30, medium >= 30 & < 60, old >= 60
- From `order_details` get:
    - sum of unique products bought
    - sum of unique products bought at a discount
    - sum total of coupons used
    - total products bought
    - mean price paid
    - mean discount used

In [37]:
customer_demo = customers.drop(['name', 'phone', 'address', 'city', 'state', 'postalCode', 'country', 'creditLimit'],
                               axis=1)

In [38]:
customer_demo['age_bracket'] = None
customer_demo.loc[customer_demo.age < 30, 'age_bracket'] = 'young'
customer_demo.loc[(customer_demo.age >= 30) & (customer_demo.age < 60), 'age_bracket'] = 'mid'
customer_demo.loc[(customer_demo.age >= 60), 'age_bracket'] = 'old'
customer_demo.age = customer_demo.age_bracket
customer_demo.drop(['age_bracket'], axis=1, inplace=True)

In [39]:
customer_demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   customer_id  1000 non-null   int64 
 1   gender       1000 non-null   object
 2   age          1000 non-null   object
dtypes: int64(1), object(2)
memory usage: 23.6+ KB


In [40]:
customer_demo.rename(columns={'age': 'cust_age', 'gender': 'cust_gender'}, inplace=True)

In [41]:
cust_orders = pd.merge(order_details, orders[['order_id', 'customer_id']], on='order_id', how='left')\
                .drop(['order_id'], axis=1)
cust_orders['discount'] = 100 * ((cust_orders.original_price - cust_orders.buy_price) / cust_orders.original_price)
cust_orders.drop(['original_price'], axis=1, inplace=True)
cust_orders.sample(5)

Unnamed: 0,product_id,quantity_ordered,buy_price,coupon_id,customer_id,discount
977922,1502,20,9.28,,315,0.0
1551626,2538,2,4.99,,429,0.0
826876,2518,17,14.78,,726,0.0
2039051,1338,7,6.52,,948,0.0
1841153,1938,5,3.348,795.0,35,40.0


In [42]:
cust_stats = pd.pivot_table(cust_orders,
                            values=['product_id', 'buy_price', 'coupon_id', 'discount'],
                            index='customer_id',
                            aggfunc={
                                'product_id': lambda x: len(set(x)),  # sum of unique products bought
                                'coupon_id': lambda x: x.notnull().sum(),  # total coupons used
                                'discount': lambda x: np.round(np.mean(x), decimals=2),  # mean discount used
                                'buy_price': lambda x: np.round(np.mean(x), decimals=2)  # mean price paid
                            })
cust_stats.rename(columns={
    'product_id': 'cust_unique_products',
    'coupon_id': 'cust_total_coupons',
    'discount': 'cust_mean_discount',
    'buy_price': 'cust_mean_buy_price'
}, inplace=True)

cust_stats['cust_unique_products_coupon'] = cust_orders.loc[cust_orders.coupon_id.notnull()]\
    .groupby('customer_id').agg({'product_id': 'nunique'})
cust_stats.fillna(value=0, inplace=True)

cust_stats['cust_total_products'] = cust_orders.groupby('customer_id').count().product_id

In [43]:
customer_data = pd.merge(customer_demo, cust_stats, on='customer_id', how='left')
customer_data

Unnamed: 0,customer_id,cust_gender,cust_age,cust_mean_buy_price,cust_total_coupons,cust_mean_discount,cust_unique_products,cust_unique_products_coupon,cust_total_products
0,1,M,old,11.62,285.0,9.16,866.0,232.0,1102.0
1,2,F,mid,14.29,984.0,11.10,1566.0,634.0,2980.0
2,3,F,old,11.43,209.0,10.95,510.0,183.0,629.0
3,4,F,young,10.65,154.0,10.13,501.0,148.0,564.0
4,5,M,young,5.68,0.0,0.00,2.0,0.0,2.0
...,...,...,...,...,...,...,...,...,...
995,996,F,old,12.58,2826.0,10.69,2310.0,1092.0,9264.0
996,997,M,mid,12.70,932.0,9.67,1709.0,603.0,3365.0
997,998,M,old,13.38,118.0,11.11,327.0,107.0,355.0
998,999,F,mid,13.28,102.0,6.75,465.0,99.0,527.0


## Prepare coupon data
- From `coupons` table, take:
    - type
    - discount
    - department
- From merging `products` and `coupon_product`, `coupons`, take:
    - mean_product_price
    - no_prods_available

In [44]:
coupon_info = coupons.drop(['start_date', 'end_date'], axis=1)
coupon_info.loc[coupon_info.how_many == -1, 'how_many'] = 1
coupon_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 971 entries, 0 to 970
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   coupon_id   971 non-null    int64 
 1   type        971 non-null    object
 2   department  971 non-null    object
 3   discount    971 non-null    int64 
 4   how_many    971 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 38.1+ KB


In [45]:
coupon_info.rename(
    columns={
        'type': 'coupon_type',
        'how_many': 'coupon_how_many',
        'discount': 'coupon_discount',
        'department': 'coupon_dpt'},
    inplace=True)

In [46]:
coupon_stats = pd.merge(coupon_product, products[['product_id', 'buy_price']], on='product_id').drop_duplicates()

# Coupon_product does not include coupons valid for all products in a department, add those product prices here
department_coupons = coupons.loc[coupons.type == 'department'][['coupon_id', 'department']]\
                            .merge(products[['department', 'buy_price', 'product_id']], on='department', how='left')\
                            .drop(['department'], axis=1)
coupon_stats = coupon_stats.append(department_coupons).reset_index(drop=True)

coupon_stats = pd.pivot_table(coupon_stats, index='coupon_id', values=['buy_price', 'product_id'],
                              aggfunc={
                                  'buy_price': lambda x: np.round(np.mean(x), decimals=2),
                                  'product_id': lambda x: len(set(x))
                              })
coupon_stats.rename(columns={'buy_price': 'coupon_mean_prod_price', 'product_id': 'coupon_prods_avail'}, inplace=True)
coupon_stats

Unnamed: 0_level_0,coupon_mean_prod_price,coupon_prods_avail
coupon_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,7.16,4
2,9.12,1
3,1.13,1
4,5.85,4
5,9.59,1
...,...,...
967,4.81,1
968,1.05,1
969,6.41,1
970,11.36,3


In [47]:
coupon_data = pd.merge(coupon_info, coupon_stats, on='coupon_id', how='left')
coupon_data

Unnamed: 0,coupon_id,coupon_type,coupon_dpt,coupon_discount,coupon_how_many,coupon_mean_prod_price,coupon_prods_avail
0,1,buy_all,Men,10,4,7.16,4
1,2,buy_more,Men,23,3,9.12,1
2,3,just_discount,Men,12,1,1.13,1
3,4,buy_all,Sport,49,4,5.85,4
4,5,buy_more,Sport,20,4,9.59,1
...,...,...,...,...,...,...,...
966,967,buy_more,Girls,33,2,4.81,1
967,968,just_discount,Men,10,1,1.05,1
968,969,buy_more,Women,27,5,6.41,1
969,970,buy_all,Boys,28,3,11.36,3


## Merge everyghing into one dataframe

1. Add `customer_id` to `order_coupons` (from `orders`) -> `final`
2. Merge `final` with `customer_data`
3. Merge `final` with `coupon_data`

In [48]:
final = pd.merge(order_coupons, orders[['order_id', 'customer_id']], on='order_id', how='left')
final = pd.merge(final, customer_data, on='customer_id', how='left')
final = pd.merge(final, coupon_data, on='coupon_id', how='left')
final

Unnamed: 0,order_id,coupon_id,coupon_used,customer_id,cust_gender,cust_age,cust_mean_buy_price,cust_total_coupons,cust_mean_discount,cust_unique_products,cust_unique_products_coupon,cust_total_products,coupon_type,coupon_dpt,coupon_discount,coupon_how_many,coupon_mean_prod_price,coupon_prods_avail
0,1,1.0,False,9,M,young,12.67,337.0,8.32,930.0,283.0,1285.0,buy_all,Men,10,4,7.16,4
1,1,2.0,False,9,M,young,12.67,337.0,8.32,930.0,283.0,1285.0,buy_more,Men,23,3,9.12,1
2,1,3.0,False,9,M,young,12.67,337.0,8.32,930.0,283.0,1285.0,just_discount,Men,12,1,1.13,1
3,1,13.0,False,9,M,young,12.67,337.0,8.32,930.0,283.0,1285.0,just_discount,Women,8,1,4.87,1
4,1,14.0,False,9,M,young,12.67,337.0,8.32,930.0,283.0,1285.0,buy_all,Women,61,3,8.43,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2821563,302793,967.0,False,995,F,mid,14.28,2873.0,10.81,2363.0,1113.0,8961.0,buy_more,Girls,33,2,4.81,1
2821564,302793,971.0,False,995,F,mid,14.28,2873.0,10.81,2363.0,1113.0,8961.0,just_discount,Girls,24,1,1993.67,1
2821565,302793,945.0,False,995,F,mid,14.28,2873.0,10.81,2363.0,1113.0,8961.0,buy_all,Women,65,5,8.67,5
2821566,302793,947.0,False,995,F,mid,14.28,2873.0,10.81,2363.0,1113.0,8961.0,just_discount,Women,12,1,57.58,1


In [49]:
final.coupon_used.value_counts(normalize=True)

False    0.905772
True     0.094228
Name: coupon_used, dtype: float64

## Droping ids, encoding

In [50]:
train = final.drop(['order_id', 'coupon_id', 'customer_id'], axis=1)

In [51]:
train.coupon_used = train.coupon_used.astype(int)
train = pd.get_dummies(train, columns=['cust_gender', 'cust_age', 'coupon_type', 'coupon_dpt'])

In [52]:
train.columns

Index(['coupon_used', 'cust_mean_buy_price', 'cust_total_coupons',
       'cust_mean_discount', 'cust_unique_products',
       'cust_unique_products_coupon', 'cust_total_products', 'coupon_discount',
       'coupon_how_many', 'coupon_mean_prod_price', 'coupon_prods_avail',
       'cust_gender_F', 'cust_gender_M', 'cust_age_mid', 'cust_age_old',
       'cust_age_young', 'coupon_type_buy_all', 'coupon_type_buy_more',
       'coupon_type_department', 'coupon_type_just_discount',
       'coupon_dpt_Boys', 'coupon_dpt_Girls', 'coupon_dpt_Men',
       'coupon_dpt_Sport', 'coupon_dpt_Women'],
      dtype='object')

## Save as csv

In [53]:
train.to_csv(os.path.join(data_dir, 'train.csv'), index=False)

In [54]:
final.drop(['order_id'], axis=1).to_csv(os.path.join(data_dir, 'train_before_encoding.csv'), index=False)