# Introduction

This Kernel refers to Instacart competition, using customer orders over time to predict which previously purchased products will be in a user’s next order. Specifically, in this kernel is described in detail the feature engineering part, implementing basic data science methodologies on Python. 

# Business Insights
* Day-time
    * Delta-hour: difference between the exact hour of the day that a specific order occurred and the hour that the previous ones from this order occurred
* Orders
    * How many users bought a specific product for first time
    * How many orders each user has made
* Re-orders
    * In how many consecutive orders a product has been bought
    * How many orders have been made by a user who has bought a product at least one time
    * Probability that a product will be repurchased consecutive times
    * Probability that a product will be repurchased within "N" orders
* High-frequency products
    * Whether or not users have bought high-frequency products

# Python Skills
* Merge two data frames with inner, left, right join
* Use Lists
* Use Dictionaries
* Perform a .groupby( ) on data of a data frame that meet a condition
* Create a new column to an already existing data frame
* Divide columns element-wise of a data frame
* Check for NaN (Not a Number) values and modify them
* Set manually an index for a data frame
* Filter data frames based on a condition
* Select rows of a data frame, based on a condition
* Create a new data frame, with column(s) of a existing data frame
* Compute numerical data ranks along axis / Compute cumulative sum over an axis
* Create ratios through supportive variables
* Perform a loop
* Drop duplicate values

# 1. Variable: Delta Time
Our purpose is to find the delta hours between the last order’s hour of the day and the hour of the day of the previous ones. In other words, we want to calculate the difference between the exact hour of the day that a specific order occurred and the hour that the previous ones from this order occurred. This will be calculated for the last 3 orders. 

## 1.1 Loading CSV files
First of all, we load all the CSV files that we are going to use.

In [None]:
import pandas as pd
orders = pd.read_csv('../input/orders.csv' )
products = pd.read_csv('../input/products.csv')
order_products = pd.read_csv('../input/order_products__train.csv' )
order_products_prior = pd.read_csv('../input/order_products__prior.csv')
aisles = pd.read_csv('../input/aisles.csv')
departments = pd.read_csv('../input/departments.csv')

## 1.2 Modifications
Also, it could be useful to make some modifications in the table "orders". These changes will help us in the next steps of our methodology.

So, initially, we assign the data of the table "orders" to a new data frame called as "order_tbl" and then, we sort the elements of "order_tbl" by the user id and order number.

In [None]:
order_tbl = orders
order_tbl.sort_values(['user_id', 'order_number'], inplace=True)

## 1.3 Importing the three previous orders as columns
Our next step is to adjust the data frame "order_tbl" in a such way as to help us create the new variable that we want. This will be achieved by adding some columns.

Firstly, we add three columns. These columns refer to the last three orders that occurred before each order, using the order id. Specifically, symbolizing with the parameter "t" each order/row of "order_tbl", we add to each row the order id of "t-1","t-2", and "t-3" orders. 

For example, if we look at the 4th order of the first user, we will also see the order id of the 3rd, 2nd, and 1st order.

In the code, we use shift(i) with i=1,2,3 in order to set the first i rows of the three new columns with null values.

In [None]:
order_tbl['t-1_order_id'] = order_tbl.groupby('user_id')['order_id'].shift(1)
order_tbl['t-2_order_id'] = order_tbl.groupby('user_id')['order_id'].shift(2)
order_tbl['t-3_order_id'] = order_tbl.groupby('user_id')['order_id'].shift(3)

We use the function .head() in order to view the first five rows of the data frame.

In [None]:
order_tbl.head()

## 1.4 Importing more information for the previous orders
Apart from these three columns, we need two more columns for each previous order. The information that will be provided is the day of the week and the hour of that day the previous orders occurred.

To achieve this, we firstly create a list "col" which contains the variables the are going to be used. Then, at the name of the column, we add the necessary prefix "t-i" with i=1,2,3 and we merge the list with the "order_tbl" using left join. 

In [None]:
col = ['order_id', 'order_dow', 'order_hour_of_day']
order_tbl = pd.merge(order_tbl, order_tbl[col].add_prefix('t-1_'), on='t-1_order_id', how='left')
order_tbl = pd.merge(order_tbl, order_tbl[col].add_prefix('t-2_'), on='t-2_order_id', how='left')
order_tbl = pd.merge(order_tbl, order_tbl[col].add_prefix('t-3_'), on='t-3_order_id', how='left')

We can see the modified table using the function .head() as previously.

In [None]:
order_tbl.head()

## 1.5 Calculating the new variables
We are ready now to create the variables that we want. For each order we calculate the difference between the hour of the order occurred and the hour of the order of the previous ones. 

Obviously, we put the new information into three new columns calles as "delt_hour_t-i" with i=1,2,3.

In [None]:
order_tbl['delta_hour_t-1'] = order_tbl['order_hour_of_day'] - order_tbl['t-1_order_hour_of_day']
order_tbl['delta_hour_t-2'] = order_tbl['order_hour_of_day'] - order_tbl['t-2_order_hour_of_day']
order_tbl['delta_hour_t-3'] = order_tbl['order_hour_of_day'] - order_tbl['t-3_order_hour_of_day']
order_tbl.head()

## 1.6 Important notice
The usefulness of this variable may be seemed if we want to combine it with other pieces of information provided from the csv files or with other variables.

For instance, we may want to combine the difference in the hours of the orders with the common products included in these orders. That is to say, we can conclude that if there is not a big difference between the hours, it is highly likely the next order which will happen in these time limits will contain a great ratio of repurchased items. 

# 2. Variable: Probability that a product will be repurchased consecutive times
The second variable we are going to create will be the fraction of two other variables. 

The first of these variables refers to how many consecutive orders a product has been bought. The second variable refers to how many orders have been made by a user who has bought this product at least one time. So, it computes how many times a user had the chance to buy this item either he did it or not.

In order to make it more understandable, we give an example: 

If a user bought an item at least one time, the second variable will store the total number of orders that he or she has made. So, if the user has made 10 orders in which there is at least one time this item, the second variable is the number 10.

For the first variable there are many cases that should be examined.
* If a user bought it exactly one time, the first variable would be the number 0.
* If a user bought it more times, it depends on how many of them were consecutive. For instance, if there was this pattern:

**1st Order**: Bought
**2nd Order**: Not Bought
**3rd, 4th, 5th Order**: Bought
**6th Order**: Not Bought 
**7th, 8th Order**: Bought
**9th Order**: Not Bought
**10th Order**: Bought

, the first variable would be the number 5.

So, the final variable would be the ratio 5/10 = 0.5. But this calculation will refer to each product and will contain all the patterns such as the above made by all users.

To make this happen we are going to use elements of Python such as:
* Joins
* Dictionaries
* Loops

## 2.1 Creating a data frame that contains data from multiple sources
The first step is to set up the data frame we are going to use.

We create a data frame that contains the orders made from the customers and the products purchased in each order. 
Actually we use the data frames:

orders (all the orders made from all customers) ...

In [None]:
orders.head()

... and the order_products_prior (the products purchased in each order).

In [None]:
order_products_prior.head()

We merge these two data frames by their matching column, order_id. The method of merge, how='inner' keeps only these rows where each order_id can be found on both dataframes.

![](https://www.w3schools.com/Sql/img_innerjoin.gif)

In [None]:
prd = pd.merge(orders, order_products_prior, on='order_id', how='inner')
prd.head(10)

Now that we have a data frame that combines both the orders and the products purchased on each order, we will get insights for each product.

## 2.2 Finding how many orders have been made by each user
We want to add a column that will demonstrate the maximum number of orders that have been made by each user.

To do this, we firstly import the package "numpy" which we will need in order to find the maximum number of the column "order_number" for each user id. 

Then, we "group_by" the user id and the product id.

In [None]:
import numpy as np
prd['user_max_onb'] = prd.groupby('user_id').order_number.transform(np.max)
prd = prd.groupby(['user_id', 'product_id']).head(2)

In the picture below, we can see the new column that we create.

In [None]:
prd.head(30)

## 2.3 Creating variables which are used for the new data frame
In this section, we want to create all the variables that we are going to use in order to end up in the final data frame. The first thing that we should do is to import the package "defaultdict". 

After that, we create two dictionaries: item_cnt (item count) and item_chance. 

At this point, we have to describe what is exactly a dictionary in Python. A dictionary is a collection which is unordered, changeable and indexed. In Python dictionaries are written with curly brackets, and they have keys and values. 

The most important thing that attracts us to use dictionaries rather than lists is the fact that in dictionaries there are indexes; in lists there aren't. Hence, in our case, the keys will be the product id's and the values will be the values of our variables. In this way, we can easily turn the dictionaries into data frames in which the rows will be the keys, the columns will be the variables and the field will be the values of dictionaries. 

Back to our code, the reason we use the defaultdict(int) is to prevent from the KeyError, which is appeared when there are not keys in the dictionary. With this function, we fill the dictionary with the integer 0 as key wherever there is no key.

Finally, we assign the value "None" in our three new variables: pid_bk (product id), uid_bk (user id), and onb_bk (order number). These variables are going to help us in our loop.

In [None]:
from collections import defaultdict
item_cnt    = defaultdict(int)
item_chance = defaultdict(int)
pid_bk = uid_bk = onb_bk = None

## 2.4 Using a loop for the calculation of the main indexes
Now, we are ready to move on the most important part of the code. 

What we do here is to create a loop which calculates two important variables: item_cnt and item_chance. Specifically:
* item_cnt : this variable counts how many consecutive orders contained a specific product for all users
* item_chance : it counts how many orders have been made by all users and contained this specific product at least one time.

This variables have already been described in previous chapters.

How does this loop work?

The first row of the code declares that the code inside the loop runs for every row of the table "prd". 

The variables we previously gave the value "None" (pid_bk, uid_bk, onb_bk) change their value in every loop. They receive the information of the previous order and they are used in two conditions "if". 

The first condition checks if a user bought a product for consecutive time. Specifically, if the user_id and the product_id are the same as those of the previous order, which is checked with the condition current order number minus previous order number equals with one, then count plus 1 in the dictionary item_cnt to the respective index which is the product id in our case. 

The second condition checks if the current order number equals the total orders of the user. If this is true, then it adds one as previously but this time in dictionary item_chance.

This operation is occurred for every row of the table "prd".

In [None]:
for uid, pid, onb, max_onb in prd[['user_id', 'product_id', 'order_number', 'user_max_onb']].values:
        
    if uid==uid_bk and pid==pid_bk and (onb-onb_bk==1):
        item_cnt[pid] +=1
    if onb!=max_onb:
        item_chance[pid] +=1
    
    pid_bk = pid
    uid_bk = uid
    onb_bk = onb

## 2.5 Organizing our findings in data frames
Since we found the variables that we want, we will organize them in data frames.

As far as the item_cnt is concerned, we organize its numbers in a table with 2 columns, "product_id" and "item_first_cnt". We do the same for the item_chance as well, but the second column is called "item_first_chance". 

After that, we merge the two data frames in one called df, using as a key the "product_id" and applying left join.

Finally, we create another column which calculates the ratio of item_first_cnt to item_first_chance.

In [None]:
item_cnt = pd.DataFrame.from_dict(item_cnt, orient='index').reset_index()
item_cnt.columns = ['product_id', 'item_first_cnt']
item_chance = pd.DataFrame.from_dict(item_chance, orient='index').reset_index()
item_chance.columns = ['product_id', 'item_first_chance']
df = pd.merge(item_cnt, item_chance, on='product_id', how='outer').fillna(0)
df['item_first_ratio'] = df.item_first_cnt/df.item_first_chance

We use the function below in order to see our final data frame.

In [None]:
df.head()

# 3. Variable: Whether or not users have bought high-frequency products
After an exploration in the tables, we chose five high-frequency products: Banana, BoO-Banana, Organic Strawberry, Organic Baby Spinach, and Organic Hass Avocado.

This variable will provide us information about whether a user have bought one of these items or not. This information may help us to understand better a user's behaviour and draw useful conclusion about the likelihood of purchase for items with similar characteristics etc.

## 3.1 Deduplicate a data frame
Initially, we remove the duplicate values of user_id from the data frame "prd" and we assign the deduplicated column "user_id" to the new "user" data frame.

In [None]:
user = prd.drop_duplicates('user_id')[['user_id']].reset_index(drop=True)
user.head()

## 3.2 Creating and interpeting the new table
The methodology that will be applied is the following:

* Find the id of the high-frequency product that you are looking for.
* Create data frame and add to this a column with the name of the item and put "0" in each row. 
* If there is a match in product and user id, we put the value "1" in the field of the new table which is defined by the product as a column and the user id as a row. 

Here, we use five blocks for each of the five frequent products: Banana, BoO-Banana, Organic Strawberry, Organic Baby Spinach, and Organic Hass Avocado. These 5 items will be the columns of our new data frame along with the user_id. 

In general, we fill the table with zeros and if the user have bought one of this product, the value is substituted by 1. So, if a user bought banana, the field defined by the column "hyb_Banana" and his user_id contain the number 1.

To sum up, the answers to the question "have you ever bought (hyb)...?" are 1 for "yes" and 0 for "no".

In [None]:
tag_user = prd[prd.product_id==24852].user_id
user['hyb_Banana'] = 0
user.loc[user.user_id.isin(tag_user), 'hyb_Banana'] = 1
    
tag_user = prd[prd.product_id==13176].user_id
user['hyb_BoO-Bananas'] = 0
user.loc[user.user_id.isin(tag_user), 'hyb_BoO-Bananas'] = 1
    
tag_user = prd[prd.product_id==21137].user_id
user['hyb_Organic-Strawberries'] = 0
user.loc[user.user_id.isin(tag_user), 'hyb_Organic-Strawberries'] = 1
    
tag_user = prd[prd.product_id==21903].user_id
user['hyb_Organic-Baby-Spinach'] = 0
user.loc[user.user_id.isin(tag_user), 'hyb_Organic-Baby-Spinach'] = 1

tag_user = prd[prd.product_id==47209].user_id
user['hyb_Organic-Hass-Avocado'] = 0
user.loc[user.user_id.isin(tag_user), 'hyb_Organic-Hass-Avocado'] = 1
user.head()

# 4. Variable: Probability that a product will be repurchased within "N" orders
The procedure of the creation of these variables is almost the same as the procedure we implemented in the second chapter, but it is a bit more extended, complicated and the results provide us more information. The steps we are going to follow are the same though.

Previously, we calculated for each product the ratio of how many consecutive times a user bought it to the total number of orders the user made. Now, we are going to calculate 4 ratios that look like this one. The general ratio is how many times a product has been purchased within ''N'' orders to the total number of orders occurred by each user and it is at least ''N''. This is going to be calculated for N=2,3,4,5.

The big difference here is that we don't care about the consecutive times that an item is included in the user's order but we care about the range between the order that the item was purchased and the order that the item was repurchased after. This range is symbolized by "N".

We are not going to be so extensive since the procedure looks like the previous one we examined in depth.

## 4.1 Creating variables which are used for the new data frame
As we previously did, we are going to create dictionaries. We create 8 dictionaries and 3 variables with the value "None" which are gonna be used in our conditions.

In [None]:
prd['user_max_onb'] = prd.groupby('user_id').order_number.transform(np.max)   
item_N2_cnt    = defaultdict(int)
item_N2_chance = defaultdict(int)
item_N3_cnt    = defaultdict(int)
item_N3_chance = defaultdict(int)
item_N4_cnt    = defaultdict(int)
item_N4_chance = defaultdict(int)
item_N5_cnt    = defaultdict(int)
item_N5_chance = defaultdict(int)
pid_bk = uid_bk = onb_bk = None

## 4.2 Using a loop for the calculation of the main indexes
After that, we create the loop. The only differences are on the conditions. Specifically, if the user and product id are the same as in the previous <= N orders and the user has made at least N orders, the variable item_N_cnt is increased by 1 at the value of the respective index/product_id. 

The second condition is true and does the same for the dictionary item_N_chance when just an item has already been purchased once and the orders are at least N.

In [None]:
for pid, uid, onb, max_onb in prd[['product_id', 'user_id', 'order_number','user_max_onb']].values:
        
    if pid==pid_bk and uid==uid_bk and (onb-onb_bk)<=2 and (max_onb-onb) >=2:
        item_N2_cnt[pid] +=1
    if pid==pid_bk and uid==uid_bk and (max_onb-onb) >=2:
        item_N2_chance[pid] +=1

    if pid==pid_bk and uid==uid_bk and (onb-onb_bk)<=3 and (max_onb-onb) >=3:
        item_N3_cnt[pid] +=1
    if pid==pid_bk and uid==uid_bk and (max_onb-onb) >=3:
        item_N3_chance[pid] +=1

    if pid==pid_bk and uid==uid_bk and (onb-onb_bk)<=4 and (max_onb-onb) >=4:
        item_N4_cnt[pid] +=1
    if pid==pid_bk and uid==uid_bk and (max_onb-onb) >=4:
        item_N4_chance[pid] +=1

    if pid==pid_bk and uid==uid_bk and (onb-onb_bk)<=5 and (max_onb-onb) >=5:
        item_N5_cnt[pid] +=1
    if pid==pid_bk and uid==uid_bk and (max_onb-onb) >=5:
        item_N5_chance[pid] +=1

    pid_bk = pid
    uid_bk = uid
    onb_bk = onb

## 4.3 Organizing our findings in data frames
Then, we turn each dictionary into data frame which contains 2 columns: product id and the variable we have already calculated.

In [None]:
item_N2_cnt = pd.DataFrame.from_dict(item_N2_cnt, orient='index').reset_index()
item_N2_cnt.columns = ['product_id', 'item_N2_cnt']
item_N2_chance = pd.DataFrame.from_dict(item_N2_chance, orient='index').reset_index()
item_N2_chance.columns = ['product_id', 'item_N2_chance']

item_N3_cnt = pd.DataFrame.from_dict(item_N3_cnt, orient='index').reset_index()
item_N3_cnt.columns = ['product_id', 'item_N3_cnt']
item_N3_chance = pd.DataFrame.from_dict(item_N3_chance, orient='index').reset_index()
item_N3_chance.columns = ['product_id', 'item_N3_chance']

item_N4_cnt = pd.DataFrame.from_dict(item_N4_cnt, orient='index').reset_index()
item_N4_cnt.columns = ['product_id', 'item_N4_cnt']
item_N4_chance = pd.DataFrame.from_dict(item_N4_chance, orient='index').reset_index()
item_N4_chance.columns = ['product_id', 'item_N4_chance']

item_N5_cnt = pd.DataFrame.from_dict(item_N5_cnt, orient='index').reset_index()
item_N5_cnt.columns = ['product_id', 'item_N5_cnt']
item_N5_chance = pd.DataFrame.from_dict(item_N5_chance, orient='index').reset_index()
item_N5_chance.columns = ['product_id', 'item_N5_chance']

The next step is to merge each pair of data frame. We use the outer join in order to group the dictionaries by the N number. Hence, we create 4 new data frames with 2 columns: item_N_cnt and item_N_chance.

In [None]:
df2 = pd.merge(item_N2_cnt, item_N2_chance, on='product_id', how='outer')
df3 = pd.merge(item_N3_cnt, item_N3_chance, on='product_id', how='outer')
df4 = pd.merge(item_N4_cnt, item_N4_chance, on='product_id', how='outer')
df5 = pd.merge(item_N5_cnt, item_N5_chance, on='product_id', how='outer')

We continue the grouping of the data frames until we reach to the final data frame. Now, with the same use of join, we merge the last 4 data frame into one and we fill any null value with 0.

In [None]:
df = pd.merge(pd.merge(df2, df3, on='product_id', how='outer'),
              pd.merge(df4, df5, on='product_id', how='outer'), 
              on='product_id', how='outer').fillna(0)

Finally, in our final data frame we create four new columns which are the ratios of item_N_cnt to item_N_chance for every value of N.

In [None]:
df['item_N2_ratio'] = df['item_N2_cnt']/df['item_N2_chance']
df['item_N3_ratio'] = df['item_N3_cnt']/df['item_N3_chance']
df['item_N4_ratio'] = df['item_N4_cnt']/df['item_N4_chance']
df['item_N5_ratio'] = df['item_N5_cnt']/df['item_N5_chance']

At this point, we fill any null values with zero and we are ready to see the table.

In [None]:
df.fillna(0, inplace=True)
df.reset_index(drop=True, inplace=True)
df.head(20)

## Important Notice
At this point, we should underline the business meaning of these ratios. Actually, they are probabilities which show the possibility that the user will purchase the product on the next order he will make on the basis of the pattern of his purchases.