In [None]:
#!pip install --upgrade pip
#!pip install apyori

In [None]:
import os 
import pandas as pd
import numpy as np
from apyori import apriori
from collections import Counter
from datetime import datetime
from itertools import combinations
import matplotlib.pyplot as plt

# Loading and studying the files:

In [None]:
aisles = pd.read_csv('../input/instacart-market-basket-analysis/aisles.csv')
aisles.dtypes

In [None]:
aisles

Here we have the names of the runners and the primary key for their identification. Let's check now for missing values:

In [None]:
aisles.isna().sum(axis = 0)

As there are no missing files, no special treatment for this database will be necessary.

In [None]:
departments = pd.read_csv('../input/instacart-market-basket-analysis/departments.csv')
departments.dtypes

In [None]:
departments

As with aisles, here we have two columns as the primary key and the name of the department.

In [None]:
departments.isna().sum(axis = 0)

In [None]:
products = pd.read_csv('../input/instacart-market-basket-analysis/products.csv')

In [None]:
products.dtypes

In [None]:
products

We can see that we have something new here, the appearance of foreign keys, the products are connected to the entrance and the corridor. It is possible to check the names of these corridors and departments:

In [None]:
aisles[aisles['aisle_id'] == 61]

In [None]:
departments[departments['department_id'] == 19]

In [None]:
products.describe()

The "count" shows equal values, indicating that there are no missing values.

In [None]:
orders = pd.read_csv('../input/instacart-market-basket-analysis/orders.csv')
orders.dtypes

In [None]:
orders

In [None]:
orders.shape

In [None]:
orders.eval_set.value_counts()

In this column, there is already the separation of the database in training and testing. We will filter only the "prior":

In [None]:
orders = orders[orders.eval_set == 'prior']

We will only use "prior" records

as this column does not represent any information, it is only the division of the database, we will exclude it:

In [None]:
orders.drop('eval_set', axis = 1, inplace = True)

In [None]:
orders.isna().sum(axis = 0)

Every user makes his first and is marked with *NaN* because he had never made a purchase there before, so there is no value for days before this purchase, after all it was his first. Therefore, this *NaN* is not considered a missing value. We can locate where these first orders placed by users are located:

In [None]:
orders.loc[orders.days_since_prior_order.isna()]

Let's see this with just 100 number of orders:

In [None]:
plt.plot(orders.order_number[:100])
plt.title('Sequence of order number')
plt.xlabel('Sequence in the dataframe')
plt.ylabel('Order Number');

This graph shows us the number of orders made by 100 users, the first user had 10 order numbers, right after that we see a low point that indicates the second user, this grows until it shows a number of orders of approximately 13 or 14. Within those 100 users, the one that reached 20 was placed in more orders.
Let's view the distribution considering the day of the week (dow) and the hours of the day:

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].boxplot(orders.order_dow)
ax[0].set_title('Boxplot day of week')
ax[0].set_ylabel('day of week')
ax[1].hist(orders.order_hour_of_day)
ax[1].set_title('Histogram hour of day')
ax[1].set_xlabel('hour')
ax[1].set_ylabel('count');

The average purchase is indicated by the orange line, showing that it is day 3. We also can see we have some orders in the hour 0 but the most orders are in hours between 14 and 16.
Finally an analysis on days_since_prior_order:

In [None]:
plt.figure(figsize = (15,5))
plt.bar(range(100), orders.days_since_prior_order[:100] + 1)
plt.title('Days since prior order')
plt.xlabel('index')
plt.ylabel('days since prior order + 1');

Here we see how many days later he returned after making the last purchase. Using the first user, we see that it considered *NaN* and counted 0 days but soon afterwards he said that it returned after 16/17 days for a new purchase. **The blanks mean the NaN.**

In [None]:
order_products = pd.read_csv('../input/instacart-market-basket-analysis/order_products__prior.csv')
order_products.dtypes

In [None]:
order_products

The ````order_id```` is related to the customer, the ````add_to_cart_order```` is the order to which the product is added to the cart, finally the ````reordered```` means if the product has already been ordered in previous purchases (**1 is when it was previously purchased and 0 the first time you buy the product**).

In [None]:
order_products.isna().sum(axis=0)

# Performing data exploration:

In [None]:
orders_apriori = orders.copy()
orders_user = orders.groupby('user_id')['order_number'].max() #it takes the maximum number of orders placed
orders_user.head()

Let's create a data frame that returns not only the ``user_id``and the maximum number of `order_number` but also the ```` products```` purchased by the user.

In [None]:
products_user = orders[['order_id', 'user_id']].merge(
    order_products[['order_id', 'add_to_cart_order']].groupby('order_id').max().rename({'add_to_cart_order': 'order_size'}, axis = 1),
                                                                                        on = 'order_id')
products_user

We can filter by user:

In [None]:
products_user[products_user.order_id == 2]

Let's change the data frame where we can view the products by `user_id` and the total of ` products` purchased:

In [None]:
products_user = products_user.drop('order_id', axis = 1).groupby('user_id')['order_size'].sum()
products_user

Creating a graph to be able to view these results:

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15,5))
ax[0].hist(orders_user, bins = max(orders_user) - min(orders_user))
ax[0].set_title('Count of orders by user')
ax[0].set_xlabel('number of orders')
ax[0].set_ylabel('count')

ax[1].hist(products_user, bins = 100)
ax[1].set_title('Count of products by user')
ax[1].set_xlabel('number of products')
ax[1].set_ylabel('count');

The graphs show the number of customers who performed these operations. In the first graph we see that more than 20000 customers have only made approximately 3 purchases in this market and that less than 5000 people have made more than 20 purchases. Next to the number 100 it has a peak, which can be a pattern of the database where purchases over 100 return to the value of 99 or 100 purchases, accumulating a number of people who may have made large numbers but the data only return up to 100 shopping. The second graph shows us that more than 50,000 customers have purchased approximately less than 50 products. And that less than 10,000 buy more than 500 products.

We will now delete the `user_id` and `order_id` columns because they will not be necessary for the creation of the membership rules

In [None]:
orders_apriori.drop(['user_id', 'order_id'], axis = 1, inplace=True)
orders_apriori.head()

# Order Number

In [None]:
orders.head()

I will create a graph to be able to count the total of `order_number`:

In [None]:
orders_by_order_number = orders.order_number.value_counts()
plt.bar(orders_by_order_number.index, orders_by_order_number)
plt.title('Number of orders by order number')
plt.xlabel('order number')
plt.ylabel('number of orders');

The graph makes it visible that we had over 200000 orders from `order_number` = 1

In [None]:
#Convert to categorical variables since we will work with membership rules:

def order_number_categorical(order_number):
  if order_number in range(3):
    return 'order_number_1-3'
  if order_number in range(3, 5):
    return 'order_number_4-5'
  if order_number in range(5, 10):
    return 'order_number_6-10'
  if order_number in range(10, 20):
    return 'order_number_11-20'
  if order_number in range(20, 40):
    return 'order_number_21-40'
  if order_number in range(40, 60):
    return 'order_number_41-60'
  if order_number >= 60:
    return 'order_number_60+'

In [None]:
orders_apriori.order_number = orders_apriori.order_number.map(order_number_categorical)
orders_apriori.head()

# Day of Week

In [None]:
#Total orders per day of the week:

orders_by_dow = orders.order_dow.value_counts()
orders_by_dow

In [None]:
#Total products per day of the week:

products_by_dow = orders[['order_id', 'order_dow']].merge(
    order_products[['order_id', 'add_to_cart_order']].groupby('order_id').max().rename({'add_to_cart_order': 'order_size'}, axis = 1),
    on = 'order_id')
products_by_dow = products_by_dow.drop('order_id', axis=1).groupby('order_dow')['order_size'].sum()
products_by_dow

In [None]:
#The results in a more visual way:

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].bar(orders_by_dow.index, orders_by_dow)
ax[0].set_title('Number of orders by day of week')
ax[0].set_xlabel('day of week')
ax[0].set_ylabel('number of orders')

ax[1].bar(products_by_dow.index, products_by_dow)
ax[1].set_title('Number of products by day of week')
ax[1].set_xlabel('day of week')
ax[1].set_ylabel('number of products');

In the first graph we can see that the day of week 0 has more than 500000 orders, as well as day 1. From day 2 the numbers fall to less than 400500 orders per day. The second graph shows that on day 0 we have more than 6 (le6) products ordered. From day 2 there are less than 5 (le6) products ordered.

Now let's create the function for transformation into a categorical variable. The graphs show a great movement on days 0 and 1, so they will be considered weekend (Saturday and Sunday)

In [None]:
def dow_categorical(dow):
    if dow in [0, 1]:
        return 'weekend'
    else:
        return 'weekday'

In [None]:
orders_apriori.order_dow = orders_apriori.order_dow.map(dow_categorical)
orders_apriori.head()

# Hours of day

In [None]:
orders_by_hour = orders.order_hour_of_day.value_counts()
orders_by_hour

Let's create a chart to view the number of orders per hour and the products per hour:

In [None]:
products_by_hour = orders[['order_id', 'order_hour_of_day']].merge(
    order_products[['order_id', 'add_to_cart_order']].groupby('order_id').max().rename({'add_to_cart_order': 'order_size'}, axis = 1),
    on = 'order_id')

products_by_hour = products_by_hour.drop('order_id', axis = 1).groupby('order_hour_of_day')['order_size'].sum()

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].bar(orders_by_hour.index, orders_by_hour)
ax[0].set_title('Number of orders by hour of day')
ax[0].set_xlabel('hour of day')
ax[0].set_ylabel('number of orders')

ax[1].bar(products_by_hour.index, products_by_hour)
ax[1].set_title('Number of products by hour of day')
ax[1].set_xlabel('hour of day')
ax[1].set_ylabel('number of products');

Graph 1 shows us that the number of orders is more distributed between the hours 9 and 17 hours. And the Graph 2 shows the same. We can see that we don't have so many diferent between the graphs

In [None]:
# conversion to categorical:

def hour_categorical(hour):
  if hour in range(7):
    return 'early_hours'
  if hour in range(7,10):
    return 'hour_' + str(hour)
  if hour in range(10, 17):
    return 'peak_hours'
  if hour in range(17, 24):
    return 'hour_' + str(hour)

In [None]:
orders_apriori.order_hour_of_day = orders_apriori.order_hour_of_day.map(hour_categorical)
orders_apriori.head()

# Days Since Prior Order

In [None]:
plt.hist(orders.days_since_prior_order, bins = 30)
plt.title('Histogram of days since prior order')
plt.xlabel('days')
plt.ylabel('count of days');

The graph shows us that a large number of customers return after 28 days. It also shows us that there is a big movement in a few days from the previous purchase.

In [None]:
# conversion to categorical:

def interval_categorical(interval):
    if np.isnan(interval):
        return 'first_order'
    elif interval in [7, 14, 21]:
        return 'interval_weekly'
    elif interval == 30:
        return 'interval_30+'
    else:
        return 'interval_others'

In [None]:
orders_apriori.days_since_prior_order = orders_apriori.days_since_prior_order.map(interval_categorical)
orders_apriori.head()

# Order Products

**Reordered Products:** Create a dictionary to associate ids with product names.

In [None]:
products_id_to_name = {k: v for k, v in zip(products.product_id, products.product_name)}
print(products_id_to_name)

In [None]:
#create a new data frame:

order_products_names = order_products.copy()
order_products_names['product_name'] = order_products_names.product_id.map(lambda x: products_id_to_name[x])
order_products_names

In [None]:
#count how many times the product was purchased for the first time and how many times a product was repurchased:

reorder_proportion = pd.crosstab(order_products_names.product_name, order_products_names.reordered)
reorder_proportion

In this case, 0 indicates the number of times the product was purchased for the first time and 1 indicates the number of times it was repurchased. Ordering the products that **were most purchased** for the first time:

In [None]:
reorder_proportion.sort_values(by = 0, ascending=False)

Ordering products that have been purchased **more than once**:

In [None]:
reorder_proportion.sort_values(by = 1, ascending=False)

In percentage:

In [None]:
reorder_proportion['total'] = reorder_proportion.sum(axis = 1)
reorder_proportion['0.perc'] = reorder_proportion[0] / reorder_proportion['total']
reorder_proportion['1.perc'] = reorder_proportion[1] / reorder_proportion['total']
reorder_proportion.head()

Products that were first purchased and never repurchased:

In [None]:
reorder_proportion.sort_values(by = ['0.perc', 'total'], ascending = False)[['0.perc', 'total']]

As it is a percentage, here the value 1 returns the information that 100% of the times it was only purchased for the first time.

Seeing the products that after the first purchase, are always bought again:

In [None]:
reorder_proportion.sort_values(by = ['1.perc', 'total'], ascending = False)[['1.perc', 'total']]

The most purchased products:

In [None]:
reorder_proportion.total.sort_values(ascending=False)

**Products not ordered:** is there a product that was never purchased?

In [None]:
products_bought = sorted(order_products.product_id.unique())
print(len(products_bought), len(products))

The first value returns the quantity of products ordered. The second takes the product toal in the data frame. Let's find out which are the 11 products that were not purchased!

In [None]:
products_not_bought = list(products.product_id[~products.product_id.isin(products_bought)])
products_not_bought

In [None]:
#the name of the products not bought
[products_id_to_name[product] for product in products_not_bought]

I will do a security check just to see if all products are registered:

In [None]:
products_not_registered = list(pd.Series(products_bought)[~pd.Series(products_bought).isin(products.product_id)])
print(len(products_not_registered), products_not_registered)

**Market Basket:** to do a study on the size of the market basket and how often it happens

In [None]:
cart_size = order_products.groupby('order_id')['add_to_cart_order'].max()
cart_size = cart_size.value_counts()
plt.bar(cart_size.index, cart_size)
plt.title('Count of order size')
plt.xlabel('order size')
plt.ylabel('count');

We can see that more than 200000 have an ``order_size`` between 5 and 15. And less than 50000 have an ``order_size`` that is more than 20.

**Most Frequent Products:** now we will link the `` id`` with the `` add_to_cart_order``

In [None]:
add_to_cart = pd.crosstab(order_products_names.product_name, order_products_names.add_to_cart_order)
add_to_cart

The table shows the number of times the product was added to the cart in that order.

Let's create a `` for`` to get a sense of the orders to which products are added to the cart:

In [None]:
for i in range(1,6):
    print('ORDER = ', i)
    print(add_to_cart.sort_values(by = i, ascending=False)[i][:5])
    print('\n')

It shows the products that are placed first in the cart.

# Association Rules

First, I will use Apriori for Shopping Habits to later make assostions with the products!

**Shopping Habits:**

In [None]:
orders_apriori.head()

In [None]:
orders_apriori.shape

In order to use the Apriori algorithm it will be necessary to transform the data frame into a list.

In [None]:
trans = []
for i in range(orders_apriori.shape[0]):
    trans.append([str(orders_apriori.values[i, j]) for j in range(orders_apriori.shape[1])])

In [None]:
trans[:4]

Now on the list we can start creating some rules to be able to notice possible patterns.

In [None]:
start = datetime.now()
rules = apriori(trans, min_support = 0.005, min_confidence = 0.2, min_lift = 2)
results = list(rules)
print('Execution time: ', datetime.now() - start)

In [None]:
results[0]

In [None]:
#more detailed analysis of the rule:
results[0][0]

In [None]:
#item 0, position 1:
results[0][1]

In [None]:
#create a vriable r with results in position 0 and position 2:
r = results[0][2]
r

In [None]:
#it will return the fist rule
r[0]

In [None]:
#it will return the second rule
r[1]

In [None]:
#it return the fist rule and the confidence, after it will return the fist rule and the lift:
r[0][2], r[0][3]

Now we create a function to transform the data in a data frame to get easier to make a best avaliation of the each rules it was created! 

In [None]:
A = []
B = []
support = []
confidence = []
lift = []

for result in results:
  s = result[1]
  result_rules = result[2]
  for result_rule in result_rules:
    a = list(result_rule[0])
    b = list(result_rule[1])
    c = result_rule[2]
    l = result_rule[3]
    A.append(a)
    B.append(b)
    support.append(s)
    confidence.append(c)
    lift.append(l) 

rules_df = pd.DataFrame({
    'A': A,
    'B': B,
    'support': support,
    'confidence': confidence,
    'lift': lift
})

rules_df = rules_df.sort_values(by = 'lift', ascending = False).reset_index(drop = True)
len(rules_df)

It returned 38 rules to us

In [None]:
A[0], B[0], A[1], B[1]

The data frame improves our view of the rules:

In [None]:
rules_df

Here it is much easier to visualize. Let's look at rule 0: `peak_hours` and` first_order`, that is, during peak hours and first purchase and their relationship in being `weekend` and being between` order_number_1-3`. Here we see the relationship between peak hours being the first purchase and occurring on a weekend where purchases are between 1-3. Let us now look at rule 33, in an interval of more than 30 days and its relationship with being the 4th and 5th purchase. Here shows us more about after 30 days the customers come back to make their purchases 4 or 5 (returning for the fourth or fifth time to make another purchase).

**Association between products:** Let's repeat the Apriori algorithm, but now manually, realizing the association of products! As we have a large number of products and users, it is easier to do it manually because it would take a long time just setting up the database.

Due to the size of the data, we will only take a sample of 5000 products, even to avoid problems of lack of memory on the machine.

In [None]:
transactions_df = order_products[['order_id', 'product_id']][:5000]
transactions_df

In [None]:
n_orders = len(set(transactions_df.order_id))
n_products = len(set(transactions_df.product_id))
print(n_orders, n_products)

We have 499 orders and 2809 products.

I will create a dataframe that shows the frequency of products in transactions:

In [None]:
product_frequency = transactions_df.product_id.value_counts() / n_orders
plt.hist(product_frequency, bins = 100)
plt.title('Number of times each product frequency occurs')
plt.xlabel('product frequency')
plt.ylabel('number of times');

The graph shows us that more than 1750 products appear less than 0.02 times. And that less than 250 products appear more than 0.02 times in transactions.

In [None]:
#a zoom:

plt.hist(product_frequency, bins = 100)
plt.title('Number of times each product frequency occurs')
plt.xlabel('product frequency')
plt.ylabel('number of times')
plt.ylim([0, 100]);

We will now do an analysis to be able to later remove the products that appear few times. We will only search for products that appear at least 4 times, and for that you will need a support of 0.01 (0.01 * 499 = 4.99, we can round to 5).

In [None]:
min_support = 0.01
products_apriori = product_frequency[product_frequency >= min_support]
print(products_apriori)

With this filter, we were able to reduce from 2809 products to 149! Let's create a data frame to see better:

In [None]:
transactions_apriori = transactions_df[transactions_df.product_id.isin(products_apriori.index)]
transactions_apriori

In [None]:
order_sizes = transactions_apriori.order_id.value_counts()
order_sizes

The `order_size` 431 has 15 products.

In [None]:
plt.hist(order_sizes, bins = max(order_sizes) - min(order_sizes))
plt.title('Number of times each order size occurs')
plt.xlabel('order size')
plt.ylabel('number of times');

Most orders only have 1 or 2 products! Less than 20 orders you have 8 products! It's kind of weird and there's no point in making membership rules with just 1 product. So let's delete orders that only have a single product.

In [None]:
min_lenght = 2
orders_apriori = order_sizes[order_sizes >= min_lenght]
print(orders_apriori)

In [None]:
transactions_apriori = transactions_apriori[transactions_apriori.order_id.isin(orders_apriori.index)]
transactions_apriori

Let's make all possible combinations:

In [None]:
transactions_by_order = transactions_apriori.groupby('order_id')['product_id']
for order_id, order_list in transactions_by_order:
  print('Order_id:', order_id, '\nOrder_list: ', list(order_list))
  product_combinations = combinations(order_list, 2)
  print('Product combinations:')
  print([i for i in product_combinations])
  print('\n')

In this code it returns the order id, the products and the combinations of those products. Now let's put all the combinations together and then count the number of occurrences:

In [None]:
def product_combinations(transactions_df, max_length = 5):
  transactions_by_order = transactions_df.groupby('order_id')['product_id']
  max_length_reference = max_length
  for order_id, order_list in transactions_by_order:
    max_length = min(max_length_reference, len(order_list))
    order_list = sorted(order_list)
    for l in range(2, max_length + 1):
      product_combinations = combinations(order_list, l)
      for combination in product_combinations:
        yield combination

In [None]:
combs = product_combinations(transactions_apriori)
combs

In [None]:
#view all combinations of products that have been generated:

for _ in range(100):
  print(next(iter(combs)))

In [None]:
#how often each of these combinations appears:

combs = product_combinations(transactions_apriori)
counter = Counter(combs).items()
combinations_count = pd.Series([x[1] for x in counter], index = [x[0] for x in counter])
combinations_frequency = combinations_count / n_orders
print(combinations_frequency)

In [None]:
combinations_apriori = combinations_frequency[combinations_frequency >= min_support]
combinations_apriori = combinations_apriori[combinations_apriori.index.map(len) >= min_lenght]
print(combinations_apriori, len(combinations_apriori))

Let's create a code so that it can perform the combinations for the entire database:

In [None]:
A = []
B = []
AB = []
for c in combinations_apriori.index:
  c_length = len(c)
  for l in range(1, c_length):
    comb = combinations(c, l)
    for a in comb:
      AB.append(c)
      b = list(c)
      for e in a:
        b.remove(e)
      b = tuple(b)
      if len(a) == 1:
        a = a[0]
      A.append(a)
      if len(b) == 1:
        b = b[0]
      B.append(b)

In [None]:
apriori_df = pd.DataFrame({'A': A,
                           'B': B,
                           'AB': AB})

In [None]:
apriori_df.head()

In [None]:
products_apriori

In [None]:
combinations_frequency

In [None]:
support = {**{k: v for k, v in products_apriori.items()},
           **{k: v for k, v in combinations_frequency.items()}}
support

In [None]:
#updating thevapriori_df with the news combinations:

apriori_df[['support_A', 'support_B', 'support_AB']] = apriori_df[['A', 'B', 'AB']].applymap(lambda x: support[x])
apriori_df

In [None]:
apriori_df.drop('AB', axis = 1, inplace=True)
apriori_df.head()

In [None]:
#generating confidence and lift:

apriori_df['confidence'] = apriori_df.support_AB / apriori_df.support_A
apriori_df['lift'] = apriori_df.confidence / apriori_df.support_B
apriori_df

In [None]:
min_confidence = 0.2
min_lift = 1.0
apriori_df = apriori_df[apriori_df.confidence >= min_confidence]
apriori_df = apriori_df[apriori_df.lift >= min_lift]
apriori_df = apriori_df.sort_values(by = 'lift', ascending=False).reset_index(drop = True) #ordering by the lift
apriori_df.head()

Now it is easy to understand that whoever takes A - 12341 also takes B - 16797, we have the support of each of these A and B instances, and AB support as well as the lift!

In [None]:
#getting the names of the products:

def convert_product_id_to_name(product_ids):
  if type(product_ids) == int:
    return products_id_to_name[product_ids]
  names = []
  for prod in product_ids:
    name = products_id_to_name[prod]
    names.append(name)
  names = tuple(names)
  return names

In [None]:
#applying the names in the data frame:

apriori_df[['A', 'B']] = apriori_df[['A', 'B']].applymap(convert_product_id_to_name)
apriori_df

Now we have our final dataframe to apply the rules.

**The fuction to generate association rules**: Put all the results together to create this function

In [None]:
def association_rules(order_products, min_support, min_length = 2, max_length = 5, 
                      min_confidence = 0.2, min_lift = 1.0):
    
    print('Loading data...')
    transactions_df = order_products[['order_id', 'product_id']]

    print('Calculating product supports...')
    n_orders = len(set(transactions_df.order_id))
    product_frequency = transactions_df.product_id.value_counts()/n_orders
    products_apriori = product_frequency[product_frequency >= min_support]
    transactions_apriori = transactions_df[transactions_df.product_id.isin(products_apriori.index)]
    
    order_sizes = transactions_apriori.order_id.value_counts()
    orders_apriori = order_sizes[order_sizes >= min_length]
    transactions_apriori = transactions_apriori[transactions_apriori.order_id.isin(orders_apriori.index)]
    
    print('Calculating product combinations and supports...')
    
    def product_combinations(transactions_df, max_length = max_length):
        transactions_by_order = transactions_df.groupby('order_id')['product_id']
        max_length_reference = max_length
        for order_id, order_list in transactions_by_order:
            max_length = min(max_length_reference, len(order_list))
            order_list = sorted(order_list)
            for l in range(2, max_length + 1):
                product_combinations = combinations(order_list, l)
                for combination in product_combinations:
                    yield combination
   
    combs = product_combinations(transactions_apriori)
    counter = Counter(combs).items()
    combinations_count = pd.Series([x[1] for x in counter], index = [x[0] for x in counter])
    combinations_frequency = combinations_count/n_orders
    combinations_apriori = combinations_frequency[combinations_frequency >= min_support]
    combinations_apriori = combinations_apriori[combinations_apriori.index.map(len) >= min_length]
    
    print('Populating dataframe...')
    A = []
    B = []
    AB = []
    for c in combinations_apriori.index:
        c_length = len(c)
        for l in range(1, c_length):
            comb = combinations(c, l)
            for a in comb:
                AB.append(c)
                b = list(c)
                for e in a:
                    b.remove(e)
                b = tuple(b)
                if len(a) == 1:
                    a = a[0]
                A.append(a)
                if len(b) == 1:
                    b = b[0]
                B.append(b)
            
    apriori_df = pd.DataFrame({'A': A,
                               'B': B,
                               'AB': AB})
    support = {**{k: v for k, v in products_apriori.items()}, 
               **{k: v for k, v in combinations_frequency.items()}}
    apriori_df[['support_A', 'support_B', 'support_AB']] = apriori_df[['A', 'B', 'AB']].applymap(lambda x: support[x])
    apriori_df.drop('AB', axis = 1, inplace = True)
    apriori_df['confidence'] = apriori_df.support_AB/apriori_df.support_A
    apriori_df['lift'] = apriori_df.confidence / apriori_df.support_B
    apriori_df = apriori_df[apriori_df.confidence >= min_confidence]
    apriori_df = apriori_df[apriori_df.lift >= min_lift]
    apriori_df = apriori_df.sort_values(by = 'lift', ascending = False).reset_index(drop = True)
    
    def convert_product_id_to_name(product_ids):
        if type(product_ids) == int:
            return products_id_to_name[product_ids]
        names = []
        for prod in product_ids:
            name = products_id_to_name[prod]
            names.append(name)
        names = tuple(names)
        return names
    
    apriori_df[['A', 'B']] = apriori_df[['A', 'B']].applymap(convert_product_id_to_name)

    print('{} rules were generated'.format(len(apriori_df)))

    return apriori_df

Let's now apply the function:

In [None]:
start = datetime.now()
rules = association_rules(order_products, min_support = 0.01)
print('Execution time: ', datetime.now() - start)

In this example, we see that 11 rules were generated. Let's see them in more detail:

In [None]:
rules

In this first example, we were able to see the presence of Banana and organic foods strongly. We can conclude that customers tend to take organic food accompanied by Banana in their purchase.

I will create another example and thus have new insights:

In [None]:
start = datetime.now()
rules = association_rules(order_products, min_support = 0.009, max_length = 4)
print('Execution time: ', datetime.now() - start)

In [None]:
rules

Not unlike the previous example, here it shows just a few more new rules. Keeping the set formed by organic products and the chance of Banana being taken too. It is clear that there is a pattern in the consumption of these food products: they are all healthy and natural. Another similarity is that they are from the same family of foods: fruits, vegetables, greens...

In [None]:
start = datetime.now()
rules = association_rules(order_products, min_support = 0.002, max_length=3)
print('Execution time: ', datetime.now() - start)

In [None]:
rules.head(20)

Here the lift and the confidence already show higher values. If we analyze the first 7 rules we see that the proportion of one occurring, and the other also occurring is very high, showing a preferable relationship between such products. Rules 11 to 14 show a taste for flavored waters. It shows the habit of occurrence among the products, where one is taken, the chances of taking the other is high. Showing that possibly consumers take more than one flavored water, this explains the high chances of being taken one if the other is also taken.

In [None]:
start = datetime.now()
rules = association_rules(order_products, min_support = 0.001, max_length=2)
print('Execution time: ', datetime.now() - start)

In [None]:
rules.head(20)

In this other example, there is a high frequency of yogurt. Rules 2 and 3 bring us back to the waters but the yorgutes show a similar relationship to the previous case, consumers tend to always catch a yorgute if they have already caught another one. Here the lift is even greater than in the example shown above, the chances of these products being taken together are even higher. 

Now taking a look at the last 10 rules:

In [None]:
rules.tail(10)

Banana reappears, here it is taken if another product has already been ordered. As in the first example, there are purchases of organic products that increase the chances of taking the banana together. But here it already brings curiosities about the consumption of the products, if we notice rule number 389, Parmesan Grated and Banana together, it seems an interesting choice.

# CONCLUSIONS:
Due to limitations in the computer's memory, it was not possible to run more examples. However, it was possible to perceive certain patterns of purchases such as organic products accompanied by bananas, in addition to cold products such as yorgut which are often bought together. There is a pattern of products from the same department, such as organic food products, mostly fruits, always bought together with other food products. Just like flavored waters always bought together. Looking at the latest rules, the Banana appears, as seen previously, the Banana is frequent in the carts and still tends to be the first product to be added to the cart. This suggests that the market has a great demand for organic and healthy food products, as seen in the examples, these orders are protagonists in the market basket.