<a href="https://colab.research.google.com/github/sushil1902/Statistics/blob/main/2_Feature_Engineering_(Beginning).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<h1><center><strong>Feature Engineering</strong></center></h1>

# **Table of Contents**

1. [Introduction](#Section1)<br><br>
2. [Importing Data](#Section2)<br><br>
3. [Dataset Description](#Section3)<br><br>
4. [Exploring Data](#Section4)<br><br>
5. [Feature Engineering](#Section5)<br><br>

<a id = Section1></a>
### **1. Introduction**

- **Feature engineering** is about **creating new input features** from your **existing** ones.

- Feature engineering is the process of **transforming raw data** into **features** that **better represent** the underlying **problem** to the predictive models, resulting in **improved** model **accuracy** on unseen data.

- Feature engineering is **manually designing** what the **input x’s** should be.

<br> 

- **Advantages** of Feature Engineering:

  - Better features means **flexibility**.
  
  - Better features means **simpler models**.
  
  - Better features means **better results**.

<br> 

- It is an **art** like engineering is an art, like programming is an art, like medicine is an art.

  - There are **well defined procedures** that are methodical, provable and understood.

  - The **data** is a **variable** and is **different** every time.
  
  - We get good at **deciding** which **procedures** to use and when, by **practice**.

<br> 

- **Iterative Process** of Feature Engineering:

  - **Brainstorm features**: Really get into the problem, look at a lot of data, study feature engineering on other problems and see what you can steal.
  
  - **Devise features**: Depends on your problem, but you may use automatic feature extraction, manual feature construction and mixtures of the two.
  
  - **Evaluate models**: Estimate model accuracy on unseen data using the features.

<a id = Section2></a>
### **2. Importing Data**

**Downloading the dataset**

In [None]:
!wget https://github.com/insaid2018/Term-2/raw/master/Data/train.zip

'wget' is not recognized as an internal or external command,
operable program or batch file.


**Unzipping the dataset**

In [None]:
!unzip -qq train.zip

'unzip' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
!ls

'ls' is not recognized as an internal or external command,
operable program or batch file.


<a id = Section3></a>
### **3. Dataset Description**

#### **Problem Statement:**

- XYZ Credit Card company regularly helps it’s merchants **understand** their **data** better and take key business decisions accurately by providing **machine learning** and **analytics consulting**.

  - ABC is an established Brick & Mortar retailer that frequently conducts marketing campaigns for its diverse product range.
  
  - As a merchant of XYZ, they have sought XYZ to **assist** them in their **discount marketing** process using the power of machine learning.

- **Discount marketing** and **coupon usage** are very widely used **promotional techniques** to **attract new customers** and to **retain** & **reinforce loyalty** of existing customers.

  - The measurement of a **consumer’s propensity towards coupon usage** and the **prediction** of the **redemption behavior** are crucial parameters in assessing the effectiveness of a marketing campaign.

- ABC’s promotions are shared across various channels including email, notifications, etc.

  - A number of these campaigns include **coupon discounts** that are offered for a specific product/range of products.
  
  - The retailer would like the ability to **predict** whether **customers redeem** the **coupons** received across channels, which will enable the retailer’s marketing team to accurately design coupon construct, and develop more precise and targeted marketing strategies.

- The data available in this problem contains the following information, including the details of a sample of campaigns and coupons used in previous campaigns:

  - **User Demographic** Details
  
  - **Campaign** and **coupon** Details
  
  - **Product** details
  
  - **Previous transactions**

- Based on previous transaction & performance data from the **last campaigns**, **predict** the probability for the **next campaigns** for each **coupon** and **customer combination**, whether the **customer** will **redeem** the coupon or not?

#### **Description:**

- Here is the **schema** for the different data tables available.

<center><img src="https://raw.githubusercontent.com/insaid2018/Term-2/master/images/dataset_schema.png" width="1000"/></center>

`train.csv:` Train data containing the **coupons offered** to the given customers under the campaigns.

| **Variable**      | **Definition**                                          |
| :---------------: | :-----------------------------------------------------: |
| id                | Unique id for coupon customer impression                |
| campaign_id       | Unique id for a discount campaign                       |
| coupon_id         | Unique id for a discount coupon                         |
| customer_id       | Unique id for a customer                                |
| redemption_status | (Target) (0 - Coupon not redeemed, 1 - Coupon redeemed) |

<br> 

`campaign_data.csv:` **Campaign information** for each of the campaigns.

| **Variable**  | **Definition**                    |
| :-----------: | :-------------------------------: |
| campaign_id   | Unique id for a discount campaign |
| campaign_type | Anonymized Campaign Type (X/Y)    |
| start_date    | Campaign Start Date               |
| end_date      | Campaign End Date                 |

<br> 

`coupon_item_mapping.csv:` **Mapping** of **coupon and items** valid for discount under that coupon.

| **Variable** | **Definition**                                               |
| :----------: | :----------------------------------------------------------: |
| coupon_id    | Unique id for a discount coupon (no order)                   |
| item_id      | Unique id for items for which given coupon is valid (no order) |

<br> 

`customer_demographics.csv:` **Customer demographic information** for some customers.

| **Variable**   | **Definition**                                               |
| :------------: | :----------------------------------------------------------: |
| customer_id    | Unique id for a customer                                     |
| age_range      | Age range of customer family in years                        |
| marital_status | Married/Single                                               |
| rented         | 0 - not rented accommodation, 1 - rented accommodation       |
| family_size    | Number of family members                                     |
| no_of_children | Number of children in the family                             |
| income_bracket | Label Encoded Income Bracket (Higher income corresponds to higher number) |

<br> 

`customer_transaction_data.csv:` **Transaction data** for all customers for duration of campaigns in the data.

| **Variable**    | **Definition**                                               |
| :-------------: | :----------------------------------------------------------: |
| date            | Date of Transaction                                          |
| customer_id     | Unique id for a customer                                     |
| item_id         | Unique id for item                                           |
| quantity        | quantity of item bought                                      |
| selling_price   | Sales value of the transaction                               |
| other_discount  | Discount from other sources such as manufacturer coupon/loyalty card |
| coupon_discount | Discount availed from retailer coupon                        |

<br> 

`item_data.csv:` **Item information** for each item sold by the retailer.

| Variable   | Definition                     |
| :--------: | :----------------------------: |
| item_id    | Unique id for item             |
| brand      | Unique id for item brand       |
| brand_type | Brand Type (local/Established) |
| category   | Item Category                  |


#### **Loading Data**

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

from datetime import datetime

In [None]:
train = pd.read_csv('train.csv')
campaign_data = pd.read_csv('campaign_data.csv')
coupon_item_mapping = pd.read_csv('coupon_item_mapping.csv')
customer_demographics = pd.read_csv('customer_demographics.csv')
customer_transaction_data = pd.read_csv('customer_transaction_data.csv')
item_data = pd.read_csv('item_data.csv')

In [None]:
print('train :', train.shape)
print('campaign_data :', campaign_data.shape)
print('coupon_item_mapping :', coupon_item_mapping.shape)
print('customer_demographics :', customer_demographics.shape)
print('customer_transaction_data :', customer_transaction_data.shape)
print('item_data :', item_data.shape)

train : (78369, 5)
campaign_data : (28, 4)
coupon_item_mapping : (92663, 2)
customer_demographics : (760, 7)
customer_transaction_data : (1324566, 7)
item_data : (74066, 4)


<a id = Section4></a>
### **4. Exploring Data**

#### **Train Data**

In [None]:
train.head()

Unnamed: 0,id,campaign_id,coupon_id,customer_id,redemption_status
0,1,13,27,1053,0
1,2,13,116,48,0
2,6,9,635,205,0
3,7,13,644,1050,0
4,9,8,1017,1489,0


In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78369 entries, 0 to 78368
Data columns (total 5 columns):
id                   78369 non-null int64
campaign_id          78369 non-null int64
coupon_id            78369 non-null int64
customer_id          78369 non-null int64
redemption_status    78369 non-null int64
dtypes: int64(5)
memory usage: 3.0 MB


#### **Campaign Data**

In [None]:
campaign_data.head()

Unnamed: 0,campaign_id,campaign_type,start_date,end_date
0,24,Y,21/10/13,20/12/13
1,25,Y,21/10/13,22/11/13
2,20,Y,07/09/13,16/11/13
3,23,Y,08/10/13,15/11/13
4,21,Y,16/09/13,18/10/13


In [None]:
campaign_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 4 columns):
campaign_id      28 non-null int64
campaign_type    28 non-null object
start_date       28 non-null object
end_date         28 non-null object
dtypes: int64(1), object(3)
memory usage: 1.0+ KB


#### **Coupon Item Mapping**

In [None]:
coupon_item_mapping.head()

Unnamed: 0,coupon_id,item_id
0,105,37
1,107,75
2,494,76
3,522,77
4,518,77


In [None]:
coupon_item_mapping.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92663 entries, 0 to 92662
Data columns (total 2 columns):
coupon_id    92663 non-null int64
item_id      92663 non-null int64
dtypes: int64(2)
memory usage: 1.4 MB


#### **Customer Demographics**

In [None]:
customer_demographics.head()

Unnamed: 0,customer_id,age_range,marital_status,rented,family_size,no_of_children,income_bracket
0,1,70+,Married,0,2,,4
1,6,46-55,Married,0,2,,5
2,7,26-35,,0,3,1.0,3
3,8,26-35,,0,4,2.0,6
4,10,46-55,Single,0,1,,5


In [None]:
customer_demographics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 760 entries, 0 to 759
Data columns (total 7 columns):
customer_id       760 non-null int64
age_range         760 non-null object
marital_status    431 non-null object
rented            760 non-null int64
family_size       760 non-null object
no_of_children    222 non-null object
income_bracket    760 non-null int64
dtypes: int64(3), object(4)
memory usage: 41.7+ KB


#### **Customer Transaction Data**

In [None]:
customer_transaction_data.head()

Unnamed: 0,date,customer_id,item_id,quantity,selling_price,other_discount,coupon_discount
0,2012-01-02,1501,26830,1,35.26,-10.69,0.0
1,2012-01-02,1501,54253,1,53.43,-13.89,0.0
2,2012-01-02,1501,31962,1,106.5,-14.25,0.0
3,2012-01-02,1501,33647,1,67.32,0.0,0.0
4,2012-01-02,1501,48199,1,71.24,-28.14,0.0


In [None]:
customer_transaction_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1324566 entries, 0 to 1324565
Data columns (total 7 columns):
date               1324566 non-null object
customer_id        1324566 non-null int64
item_id            1324566 non-null int64
quantity           1324566 non-null int64
selling_price      1324566 non-null float64
other_discount     1324566 non-null float64
coupon_discount    1324566 non-null float64
dtypes: float64(3), int64(3), object(1)
memory usage: 70.7+ MB


#### **Item Data**

In [None]:
item_data.head()

Unnamed: 0,item_id,brand,brand_type,category
0,1,1,Established,Grocery
1,2,1,Established,Miscellaneous
2,3,56,Local,Bakery
3,4,56,Local,Grocery
4,5,56,Local,Grocery


In [None]:
item_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74066 entries, 0 to 74065
Data columns (total 4 columns):
item_id       74066 non-null int64
brand         74066 non-null int64
brand_type    74066 non-null object
category      74066 non-null object
dtypes: int64(2), object(2)
memory usage: 2.3+ MB


<a id = Section5></a>
### **5. Feature Engineering**

**Note:** We will only be focussing on **creating new features** out of the existing features in this notebook.

#### **Working with Campaign Data**

- Creating new features using the `start_date` and `end_date` features of **campaign_data**.

- First, we need to **convert** our `start_data` and `end_date` columns from **object** dtype to **datetime** dtype.

In [None]:
campaign_data['start_date'] = pd.to_datetime(campaign_data['start_date'], format='%d/%m/%y', dayfirst=True)
campaign_data['end_date'] = pd.to_datetime(campaign_data['end_date'], format='%d/%m/%y', dayfirst=True)

- Our new features show the **time between** the **start** and **end dates** of each campaign on the basis of **days**, **months**, and **weeks**.

In [None]:
campaign_data['diff_d'] = (campaign_data['end_date'] - campaign_data['start_date']) / np.timedelta64(1, 'D')
campaign_data['diff_m'] = (campaign_data['end_date'] - campaign_data['start_date']) / np.timedelta64(1, 'M')
campaign_data['diff_w'] = (campaign_data['end_date'] - campaign_data['start_date']) / np.timedelta64(1, 'W')

In [None]:
campaign_data.head()

Unnamed: 0,campaign_id,campaign_type,start_date,end_date,diff_d,diff_m,diff_w
0,24,Y,2013-10-21,2013-12-20,60.0,1.971293,8.571429
1,25,Y,2013-10-21,2013-11-22,32.0,1.051356,4.571429
2,20,Y,2013-09-07,2013-11-16,70.0,2.299842,10.0
3,23,Y,2013-10-08,2013-11-15,38.0,1.248486,5.428571
4,21,Y,2013-09-16,2013-10-18,32.0,1.051356,4.571429


#### **Merging the Campaign Data with Train Data**

- We are **merging train** data with **campaign** data on `campaign_id`, performing a **left** join.

In [None]:
df = train.merge(campaign_data, on='campaign_id', how='left')
df.head()

Unnamed: 0,id,campaign_id,coupon_id,customer_id,redemption_status,campaign_type,start_date,end_date,diff_d,diff_m,diff_w
0,1,13,27,1053,0,X,2013-05-19,2013-07-05,47.0,1.54418,6.714286
1,2,13,116,48,0,X,2013-05-19,2013-07-05,47.0,1.54418,6.714286
2,6,9,635,205,0,Y,2013-03-11,2013-04-12,32.0,1.051356,4.571429
3,7,13,644,1050,0,X,2013-05-19,2013-07-05,47.0,1.54418,6.714286
4,9,8,1017,1489,0,X,2013-02-16,2013-04-05,48.0,1.577034,6.857143


In [None]:
df.shape

(78369, 11)

#### **Merging Item Data and Coupon Item Mapping Data**

In [None]:
item_data.head()

Unnamed: 0,item_id,brand,brand_type,category
0,1,1,Established,Grocery
1,2,1,Established,Miscellaneous
2,3,56,Local,Bakery
3,4,56,Local,Grocery
4,5,56,Local,Grocery


- Printing the **number of distinct values** in different item data columns.

In [None]:
for j in ['brand', 'brand_type', 'category']:
    print(j, ':', item_data[j].nunique())

brand : 5528
brand_type : 2
category : 19


- **Converting** the **dtype** of categorical features to **category** dtype.

In [None]:
for j in ['brand', 'brand_type', 'category']:
    item_data[j] = item_data[j].astype('category')

- We are **merging coupon_item_mapping** data with **item** data on `item_id`, performing a **left** join.

In [None]:
coupon_item_mapping = coupon_item_mapping.merge(item_data, on='item_id', how='left')

In [None]:
coupon_item_mapping.coupon_id.nunique()

1116

In [None]:
coupon_item_mapping.head()

Unnamed: 0,coupon_id,item_id,brand,brand_type,category
0,105,37,56,Local,Grocery
1,107,75,56,Local,Grocery
2,494,76,209,Established,Grocery
3,522,77,278,Established,Grocery
4,518,77,278,Established,Grocery


In [None]:
coupon_item_mapping.shape

(92663, 5)