# Introduction
This kernel has been created by the [Information Systems Lab](http://islab.uom.gr) at the University of Macedonia, Greece for the needs of the elective course Special Topics of Information Systems I at the [Business Administration](http://www.uom.gr/index.php?tmima=2&categorymenu=2) department of the University of Macedonia, Greece.

The main creator of this kernel is [Anastasios Papadopoulos](https://www.kaggle.com/ba15104) and we would like to thank him for his contribution to our course.

# Business Insights

* Probability that a product will be repurchased consecutive times, based on variables:
    - In how many consecutive orders a product has been bought
    - How many orders have been made by a user who has bought a product at least one time
* Probability that a product will be repurchased within "N" orders



# Python Skills
* Create and use dictionaries to update index-based information for each product
* Perform a loop
* Convert a dictionary to a DataFrame.

# Loading the required packages and .csv files
We load the pandas and numpy package <br>
Regarding our datasets we use all the .csv we need to create a DataFrame with the prior orders and the products that have been purchased (prd DataFrame).

In [None]:
#Libraries
import pandas as pd
import numpy as np

#Datasets
orders = pd.read_csv('../input/orders.csv' )
order_products = pd.read_csv('../input/order_products__train.csv' )
order_products_prior = pd.read_csv('../input/order_products__prior.csv')

# 1. Probability that a product will be repurchased consecutive times
For our first feature we are going to create a fraction of two other variables. 

The first variable (nominator) refers to how many consecutive orders a product has been bought. Consecutive orders are these that the product has been bought at least two times in n and n+1 orders. <br>
The second variable (denominator)  refers to how many orders have been made by a user who has bought this product at least one time. So, it computes how many times a user had the chance to buy this item either he did it or not.

In order to make more clear our feature, we give an example: 

If a user bought an item at least one time, the second variable (denominator) will store the total number of orders that he or she has made. So, if the user has made 10 orders in which there is at least one time this item, the second variable is the number 10.

For the first variable there are many cases that should be examined.
* If a user bought it exactly one time, the first variable would be the number 0.
* If a user bought it more times,** it depends on how many of them were consecutive** 

For instance, if there was this pattern:

1st Order: Bought <br>
2nd Order: Not Bought <br>
**3rd, 4th, 5th Order**: Bought <br>
6th Order: Not Bought <br>
**7th, 8th Order**: Bought<br>
9th Order: Not Bought<br>
10th Order: Bought<br>

, the first variable would get the number 5.

So, the final variable would be the ratio 5/10 = 0.5. But this calculation will refer to each product and will contain all the patterns such as the above made by all users.

To make this happen we are going to use elements of Python such as:
* Joins
* Dictionaries
* Loops

## 1.1 Creating a data frame that contains data from multiple sources
The first step is to set up the data frame we are going to use.

We create a data frame that contains the orders made from the customers and the products purchased in each order. 
Actually we use the data frames:

orders (all the orders made from all customers) ...

In [None]:
orders.head()

... and the order_products_prior (the products purchased in each order).

In [None]:
order_products_prior.head()

We merge these two DataFrames by their matching column, order_id. The method of merge, how='inner' keeps only these rows where each order_id can be found on both DataFrames.

![](https://www.w3schools.com/Sql/img_innerjoin.gif)

In [None]:
prd = pd.merge(orders, order_products_prior, on='order_id', how='inner')
prd.head(10)

Now that we have a DataFrames that combines both all the prior orders and the products purchased on each order, we will get insights for each product.

## 1.2 Finding how many orders have been made by each user
In this step we add a column that will demonstrate the maximum number of orders that have been made by each user.

In [None]:
prd['user_max_onb'] = prd.groupby('user_id').order_number.transform(np.max)
prd.head(20)

## 1.3 Creating dictionaries to store the results of our loops.
To store the results during our iterations we import the package "defaultdict". 
The "defaultdict" method is creating an empty dictionary that can store variables of our desired type (in our case integers).
> A dictionary is a collection which is unordered, changeable and indexed. In Python dictionaries are written with curly brackets, and they have keys and values. 
> 

In our case, we create two dictionaries: item_cnt (item count) and item_chance. 
The reason we use the defaultdict(int) is to prevent from the KeyError, which is appeared when there are not keys in the dictionary. With this function, we fill the dictionary with the integer 0 as key wherever there is no key.

The keys of the dictionaries will be the product id's and the values will be the values of our variables.
In this way, we can easily turn the dictionaries into DataFrames in which the rows will be the keys, the columns will be the variables and the field will be the values of dictionaries. 

In [None]:
from collections import defaultdict
item_cnt    = defaultdict(int)
item_chance = defaultdict(int)

## 1.4 Explore that data that we are going to work with
Before creating our loop have a look on the prd DataFrame with the variables that we are going to use:

In [None]:
prd[['user_id', 'product_id', 'order_number', 'user_max_onb']].head(5)

## 1.5 Using a loop for the calculation of the main indices.
What we do here is to create a loop which calculates two variables: item_cnt and item_chance. <br>
Specifically:
* item_cnt : this variable counts how many consecutive orders contained a specific product for all users
* item_chance : it counts how many orders have been made by all users and contained this specific product at least one time.

How does this loop work?
1. We assign the value "None" to our three new variables: pid_back (product id), uid_back (user id), and onb_back (order number). These variables are initialized so we can use them in our loop.
2. The first row of the code declares that the code inside the loop runs for every row of the table "prd". 
3. The variables we previously gave the value "None" (pid_bk, uid_bk, onb_bk) change their value in every loop. They receive the information of the previous order and they are used in two conditions "if".
4. The first condition checks if a user bought a product for consecutive time. Specifically, if the user_id and the product_id are the same as those of the previous order, which is checked with the condition current order number minus previous order number equals with one, then count plus 1 in the dictionary item_cnt to the respective index which is the product id in our case. 
5. The second condition checks if the current order number equals the total orders of the user. If this is true, then it adds one as previously but this time in dictionary item_chance.<br>

This operation is occurred for every row of the table "prd"



In [None]:
pid_back = uid_back = onb_back = None

for user_id, product_id, order_number, max_onb in prd[['user_id', 'product_id', 'order_number', 'user_max_onb']].values:
        
    if user_id==uid_back and product_id==pid_back and (order_number-onb_back==1):
        item_cnt[product_id] +=1
    if order_number!=max_onb:
        item_chance[product_id] +=1
    
    uid_back = user_id
    pid_back = product_id
    onb_back = order_number

Have a look on the results for a specific product 24852 (bananas):

In [None]:
item_cnt[24852]

In [None]:
item_chance[24852]

What does the above metrics mean?

## 1.5 Transforming the dictionaries into DataFrame
Since we found the variables that we want, we will organize them in DataFrames.

As far as the item_cnt is concerned, we organize its numbers in a table with 2 columns, "product_id" and "item_first_cnt". We do the same for the item_chance as well, but the second column is called "item_first_chance". 

After that, we merge the two DataFrames in one called df, using as a key the "product_id" and applying left join.

In [None]:
item_cnt = pd.DataFrame.from_dict(item_cnt, orient='index').reset_index()
item_cnt.columns = ['product_id', 'item_first_cnt']
item_chance = pd.DataFrame.from_dict(item_chance, orient='index').reset_index()
item_chance.columns = ['product_id', 'item_first_chance']

df = pd.merge(item_cnt, item_chance, on='product_id', how='outer').fillna(0)

## 1.6 Creating the final feature

Finally, we create the final feature (ratio) of item_first_cnt to item_first_chance. <br>
We store the results on our prd DataFrame.

In [None]:
df['item_first_ratio'] = df.item_first_cnt/df.item_first_chance
prd = prd.merge(df[['product_id', 'item_first_ratio']], on='product_id', how='left')
prd.head()

# 2. Probability that a product will be repurchased within "N" orders
The procedure for the creation of this feature is almost the same as the procedure that we implemented in previous.

Previously, we calculated for each product the ratio of how many consecutive times a user bought it to the total number of orders the user made. Now, we are going to calculate 4 ratios that look like this one. The general ratio is how many times a product has been purchased within ''N'' orders to the total number of orders occurred by each user and it is at least ''N''. This is going to be calculated for N=2,3,4,5.

The big difference here is that we don't care about the consecutive times that an item is included in the user's order but we care about the range between the order that the item was purchased and the order that the item was repurchased after. This range is symbolized by "N".

## 2.1 Creating dictionaries for the new metrics
As we previously did, we are going to create dictionaries. We create 8 dictionaries and 3 variables with the value "None" which are gonna be used in our conditions.

In [None]:
prd['user_max_onb'] = prd.groupby('user_id').order_number.transform(np.max)   
item_N2_cnt    = defaultdict(int)
item_N2_chance = defaultdict(int)
item_N3_cnt    = defaultdict(int)
item_N3_chance = defaultdict(int)
item_N4_cnt    = defaultdict(int)
item_N4_chance = defaultdict(int)
item_N5_cnt    = defaultdict(int)
item_N5_chance = defaultdict(int)

## 2.2 Using a loop for the calculation of the main indices
After that, we create the loop. The only differences are on the conditions. Specifically, if the user and product id are the same as in the previous <= N orders and the user has made at least N orders, the variable item_N_cnt is increased by 1 at the value of the respective index/product_id. 

The second condition is true and does the same for the dictionary item_N_chance when just an item has already been purchased once and the orders are at least N.

In [None]:
pid_back = uid_back = onb_back = None

for product_id, user_id, order_number, max_order_number in prd[['product_id', 'user_id', 'order_number','user_max_onb']].values:
        
    if product_id==pid_back and user_id==uid_back and (order_number-onb_back)<=2 and (max_order_number-order_number) >=2:
        item_N2_cnt[product_id] +=1
    if product_id==pid_back and user_id==uid_back and (max_order_number-order_number) >=2:
        item_N2_chance[product_id] +=1

    if product_id==pid_back and user_id==uid_back and (order_number-onb_back)<=3 and (max_order_number-order_number) >=3:
        item_N3_cnt[product_id] +=1
    if product_id==pid_back and user_id==uid_back and (max_order_number-order_number) >=3:
        item_N3_chance[product_id] +=1

    if product_id==pid_back and user_id==uid_back and (order_number-onb_back)<=4 and (max_order_number-order_number) >=4:
        item_N4_cnt[product_id] +=1
    if product_id==pid_back and user_id==uid_back and (max_order_number-order_number) >=4:
        item_N4_chance[product_id] +=1

    if product_id==pid_back and user_id==uid_back and (order_number-onb_back)<=5 and (max_order_number-order_number) >=5:
        item_N5_cnt[product_id] +=1
    if product_id==pid_back and user_id==uid_back and (max_order_number-order_number) >=5:
        item_N5_chance[product_id] +=1

    pid_back = product_id
    uid_back = user_id
    onb_back = order_number

## 2.3 Transforming the dictionaries into DataFrame
Then, we turn each dictionary into data frame which contains 2 columns: product id and the variable we have already calculated.

In [None]:
item_N2_cnt = pd.DataFrame.from_dict(item_N2_cnt, orient='index').reset_index()
item_N2_cnt.columns = ['product_id', 'item_N2_cnt']
item_N2_chance = pd.DataFrame.from_dict(item_N2_chance, orient='index').reset_index()
item_N2_chance.columns = ['product_id', 'item_N2_chance']

item_N3_cnt = pd.DataFrame.from_dict(item_N3_cnt, orient='index').reset_index()
item_N3_cnt.columns = ['product_id', 'item_N3_cnt']
item_N3_chance = pd.DataFrame.from_dict(item_N3_chance, orient='index').reset_index()
item_N3_chance.columns = ['product_id', 'item_N3_chance']

item_N4_cnt = pd.DataFrame.from_dict(item_N4_cnt, orient='index').reset_index()
item_N4_cnt.columns = ['product_id', 'item_N4_cnt']
item_N4_chance = pd.DataFrame.from_dict(item_N4_chance, orient='index').reset_index()
item_N4_chance.columns = ['product_id', 'item_N4_chance']

item_N5_cnt = pd.DataFrame.from_dict(item_N5_cnt, orient='index').reset_index()
item_N5_cnt.columns = ['product_id', 'item_N5_cnt']
item_N5_chance = pd.DataFrame.from_dict(item_N5_chance, orient='index').reset_index()
item_N5_chance.columns = ['product_id', 'item_N5_chance']

The next step is to merge each pair of DataFrame. We use the outer join in order to group the dictionaries by the N number. Hence, we create 4 new data frames with 2 columns: item_N_cnt and item_N_chance.

In [None]:
df2 = pd.merge(item_N2_cnt, item_N2_chance, on='product_id', how='outer')
df3 = pd.merge(item_N3_cnt, item_N3_chance, on='product_id', how='outer')
df4 = pd.merge(item_N4_cnt, item_N4_chance, on='product_id', how='outer')
df5 = pd.merge(item_N5_cnt, item_N5_chance, on='product_id', how='outer')

We continue the grouping of the data frames until we reach to the final data frame. Now, with the same use of join, we merge the last 4 data frame into one and we fill any null value with 0.

In [None]:
df = pd.merge(pd.merge(df2, df3, on='product_id', how='outer'),
              pd.merge(df4, df5, on='product_id', how='outer'), 
              on='product_id', how='outer').fillna(0)

Finally, in our final data frame we create four new columns which are the ratios of item_N_cnt to item_N_chance for every value of N.

In [None]:
df['item_N2_ratio'] = df['item_N2_cnt']/df['item_N2_chance']
df['item_N3_ratio'] = df['item_N3_cnt']/df['item_N3_chance']
df['item_N4_ratio'] = df['item_N4_cnt']/df['item_N4_chance']
df['item_N5_ratio'] = df['item_N5_cnt']/df['item_N5_chance']

In [None]:
prd = prd.merge([['product_id', 'item_N2_ratio', 'item_N3_ratio', 'item_N4_ratio', 'item_N5_ratio']], on='product_id', how='left')

At this point, we fill any null values with zero and we are ready to see the table.

In [None]:
df.reset_index(drop=True, inplace=True)
df.head(20)

In [None]:
df.fillna(0, inplace=True)

## Important Notice
At this point, we should underline the business meaning of these ratios. Actually, they are probabilities which show the possibility that the user will purchase the product on the next order he will make on the basis of the pattern of his purchases.