## Python segmentation and basket analysis for dog food

Hi, I would like to present you my work on basket analysis for dog food. 
This is a simple marketing segmentation for people buying dog food and wishing rise their customers and incomes.

<span style="color:green">Please upvote my kernel !</span>

![Ici](https://cdn.openpr.com/T/8/T803228229_g.jpg)

**Importing Librairies**

In [None]:
from plotly.offline import init_notebook_mode, iplot
from mlxtend.frequent_patterns import apriori, association_rules
from plotly.subplots import make_subplots
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import holoviews as hv
init_notebook_mode(connected=True)
hv.extension('bokeh')

**Loading csv files**

In [None]:
aisles = pd.read_csv("../input/instacart-market-basket-analysis/aisles.csv")
departements = pd.read_csv("../input/instacart-market-basket-analysis/departments.csv")
order_products__prior = pd.read_csv("../input/instacart-market-basket-analysis/order_products__prior.csv")
order_products__train = pd.read_csv("../input/instacart-market-basket-analysis/order_products__train.csv")
orders = pd.read_csv("../input/instacart-market-basket-analysis/orders.csv")
products = pd.read_csv("../input/instacart-market-basket-analysis/products.csv")


In [None]:
aisles[aisles['aisle'].str.contains('dog')] #Which aisle does contain the dog food ? 
dog_food_products = products.query("aisle_id == 40") #Dog products

### I - Exploratory Data Analysis

In [None]:
#users, orders and products on the same dataframe
orders_order_product = pd.merge(orders, order_products__train, on = ['order_id'])
user_product_dog = pd.merge(orders_order_product, dog_food_products, on = ['product_id'])

### How many customers does purchase dog food ?

In [None]:
nb_users = len(orders['user_id'].value_counts()) ## nb of users on the dataset
nb_users_buying_fooddog = len(user_product_dog['user_id'].value_counts()) ## nb of users buying dog food on the dataset
labels = ['Other food','Food dog']
values = [nb_users-nb_users_buying_fooddog, nb_users_buying_fooddog]

# pull is given as a fraction of the pie radius
fig = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0, 0.2, 0])])
fig.show()

There are only a few customers buying dog food in this dataset (only 0.57%). However there are enough data of dog food to continue our analysis.

### Whose are the most bought dog food bought ?

In [None]:
most_dogfood_bought = user_product_dog['product_name'].value_counts().head(15).index.tolist() # The most dog food bought
user_product_dog_top = user_product_dog[user_product_dog['product_name'].isin(most_dogfood_bought)] # Selection in Dataframe

In [None]:
fig = px.pie(user_product_dog_top, values='product_id', names='product_name', title='Top 15 of food dog bought')
fig.show()

We got the top of the 15 food dog bought. Now we are going to see some line charts in order to visualize to which moment people bug dog food.

Let's draw some variations of top dog food bought in a day : 

In [None]:
dog_product_1 = user_product_dog_top.query("product_name == 'Standard Size Pet Waste bags'")['order_hour_of_day'].value_counts().sort_index().reindex(np.arange(24), fill_value=0)
dog_product_2 = user_product_dog_top.query("product_name == 'Select Tender Chicken with Vegetables & Brown Rice Dog Food Recipe'")['order_hour_of_day'].value_counts().sort_index().reindex(np.arange(24), fill_value=0)


fig = go.Figure([go.Scatter(x=np.arange(24), y=dog_product_1)])
fig.update_yaxes(ticks="inside", 
                 dtick = 1, 
                title="How much by hour ?")
fig.update_xaxes(ticks="inside", 
                 dtick = 1, 
                title="Hours by day")
fig.update_layout(title = " How much 'Standard Size Pet Waste bags' are bought by hour in a day ?")

fig.show()


Dog food are bought in morning and in the afternoon, which is normal hours of grocery stores. But there is also some people that are buying dog food in the evening, maybe because they walk the dog and they need dog food.

In [None]:
fig = go.Figure([go.Scatter(x=np.arange(24), y=dog_product_2)])
fig.update_yaxes(ticks="inside", 
                 dtick = 1, 
                title="How much by hour ?")
fig.update_xaxes(ticks="inside", 
                 dtick = 1, 
                title="Hours by day")
fig.update_layout(title = " How much 'Select Tender Chicken with Vegetables & Brown Rice Dog Food Recipe' are bought by hour in a day ?")

fig.show()

We are now combining dog food bought by hour in a day with a median (most realistic than mean) to create a model of bought dog food in the day and have a better visbility of the behavior of customers buying dog food in a day.

In [None]:
top_dogfood_hour_list = []
top_dogfood_weekday_list = []
def combiningHourSeriesVariation(df, product_id):
    return df[df['product_id'] == product_id]['order_hour_of_day'].value_counts().sort_index().reindex(np.arange(24), fill_value=0)

def combiningWeekDaySeriesVariation(df, product_id):
    return df[df['product_id'] == product_id]['order_dow'].value_counts().sort_index().reindex(np.arange(7), fill_value=0)

for value in user_product_dog_top['product_id'].unique():
    top_dogfood_hour_list.append(combiningHourSeriesVariation(user_product_dog, value))
    top_dogfood_weekday_list.append(combiningWeekDaySeriesVariation(user_product_dog, value))
    
    

In [None]:
median_hour_dogfood = pd.concat(top_dogfood_hour_list, axis = 1).median(axis = 1)

fig = go.Figure([go.Scatter(x=np.arange(24), y=median_hour_dogfood)])
fig.update_yaxes(ticks="inside", 
                 dtick = 1, 
                title="How much by hour ?")
fig.update_xaxes(ticks="inside", 
                 dtick = 1, 
                title="Hours by day")
fig.update_layout(title = "Whose hours customers buy dog food the most ?")

fig.show()



It seems that mostly in the morning and during the breakfast, people go to the market to buy dog food. There are also people buying dog food in the afternoon.

Now, let's do this by day.

In [None]:
median_weekday_dogfood = pd.concat(top_dogfood_weekday_list, axis = 1).median(axis = 1)

fig = go.Figure([go.Scatter(x=np.arange(7), y=median_weekday_dogfood)])
fig.update_yaxes(ticks="inside", 
                 dtick = 1, 
                title="How much in median ?")
fig.update_xaxes(ticks="inside", 
                 dtick = 1, 
                title="Days of the week", 
                tickvals=[0, 1, 2, 3, 4, 5, 6],
                ticktext = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
fig.update_layout(title = "Which day of week customers buy dog food the most ?")

fig.show()

According to this model, it seems that dog food is more sold on Sunday and Saturday

### II - Association rules learning - Apriori

Now, we are going to use association rule algorithm named Apriori in order to get implicatives rules between the item. It means answer to this question : if this article A is bought is observed, then usually the article B is also bought with some confidence.

These assocation can lead us to choose appropriate offers to customers in order to answer the best to their needs and improve the dog food sales.

I am using Apriori which is the most commun association rule learning algorithm used in this field.

First, I will apply the algorithm on the entire dataset, not only dog food and focus association rules with dog food in order to get the most confident implicative rules between other products and dog food.

Secondly, I will apply the algorithm only on dog food to get the most confident implicative rules between dog food.

#### 1 - Data processing 

In [None]:
#I'm creating a mask for order_id who have bought dog food
mask_orderid_withdoffood = order_products__train[order_products__train['product_id'].isin(dog_food_products['product_id'].unique().tolist())]['order_id'].tolist()
#Choosing orders with dog food
order_products_withdogfood = order_products__train[order_products__train['order_id'].isin(mask_orderid_withdoffood)]
#Adding the name of dog food
order_products_withdogfood = pd.merge(order_products_withdogfood, products, how = 'inner', on = 'product_id')
#Cross tabulation to get data in a one-hot encoded pandas DataFrame. If 1, there is the product in the order. Otherwise, there is not.
df_apriori_alldataset = pd.crosstab(order_products_withdogfood['order_id'], order_products_withdogfood['product_id']).astype('bool').astype('int')

#### 2 - Apriori on the entire dataset with dog food

In [None]:
#Looking for the most frequents itemsets with minimum support = 0.005 (frequency of apparition of product)
freq_itemsets = apriori(df_apriori_alldataset, min_support=0.005, use_colnames=True) 
#Looking for the rules
rules = association_rules(freq_itemsets,metric='lift',min_threshold=1)

In [None]:
#Some dataprocessing on the columns
rules['antecedents'] = rules['antecedents'].apply(list)
rules['consequents'] = rules['consequents'].apply(list)

Now let's draw the top implicative rule ([A] -> [B]) with the ascending order of confidence.
The confidence formula is based on frequency apparition of items (the support). 
More precisely, it is defined by :
Confidence is defined as:

<math>\mathrm{conf}(X \Rightarrow Y) = \mathrm{supp}(X \cup Y) / \mathrm{supp}(X)</math>

In [None]:
mask_exploded_rules = rules.explode('antecedents').explode('consequents')
mask_exploded_rules_antecedents = mask_exploded_rules[mask_exploded_rules['antecedents'].isin(dog_food_products['product_name'].unique().tolist())]
mask_exploded_rules_consequents = mask_exploded_rules[mask_exploded_rules['consequents'].isin(dog_food_products['product_name'].unique().tolist())]
mask_exploded_rules_antecedents.append(mask_exploded_rules_consequents).sort_index().drop_duplicates().sort_values(by = ['confidence'], ascending = False)[['antecedents', 'consequents', 'confidence']]

We got the implicative rules in which there is dog food. 
All the rules are binary rules ([product A] -> [product B]).
This show us that people buying dog food, buy also other dog food as we see in the top of the list.

#### 3 - Apriori only on dog food

In [None]:
order_products_onlydogfood = order_products__train[order_products__train['product_id'].isin(dog_food_products['product_id'].unique().tolist())]
#Adding the name of dog food
order_products_onlydogfood = pd.merge(order_products_onlydogfood, products, how = 'inner', on = 'product_id')
#Cross tabulation to get data in a one-hot encoded pandas DataFrame. If 1, there is the product in the order. Otherwise, there is not.
df_apriori_onlydogfood = pd.crosstab(order_products_onlydogfood['order_id'], order_products_onlydogfood['product_name']).astype('bool').astype('int')
#Looking for the most frequents itemsets with minimum support = 0.005 (frequency of apparition of product)
freq_itemsets = apriori(df_apriori_onlydogfood, min_support=0.005, use_colnames=True) 
#Looking for the rules
rules = association_rules(freq_itemsets, metric='lift',min_threshold=1)
rules[['antecedents', 'consequents', 'confidence']].sort_values(['confidence'], ascending = False)

This is what we said before : people buying dog food are buying other dog food.

In [None]:
rules_order_confidence = rules[['antecedents', 'consequents', 'confidence']].sort_values(['confidence'], ascending = False)
rules_order_confidence["antecedents"] = rules_order_confidence["antecedents"].apply(lambda x: list(x)[0]).astype("unicode")
rules_order_confidence["consequents"] = rules_order_confidence["consequents"].apply(lambda x: list(x)[0]).astype("unicode")

In [None]:
graph_list = list(zip(list(rules_order_confidence['antecedents']), 
         list(rules_order_confidence['consequents']), 
         list(rules_order_confidence['confidence'])))

In [None]:
graph = hv.Graph(graph_list, vdims = "confidence")
graph.opts(directed = True, arrowhead_length=0.05)

This graph shows us that there is 3 groups of people buying dog food. The arrows are the direction of the rule from the antecedent to the consequence. 

For example, people buying "Prepared Meals Beef Stew Wet Dog Food" are also buying "Prepared Meals Savory Rice & Lamb Stew Dog Food" and "Prepared Meals Simmered Chicken Medley Wet Dog Food".



## Conclusion

Now that we have finished our basket analysis of dog food, we can conclude some facts on the behaviour of dog food buyers :
- People are buying dog food most in the beginning of the week and in the end of the week ;
- Most people are buying dog food in the midday and less in the afternoon ; 

The Apriori algorithm shows us the the grocery store should sell their dog food into 3 parts to have a better consideration of the customer's behaviour :
- Prepared dog food together ;
- Organix dog food ; 
- The other food should be sell one by one.

Thank you for reading my kernel, I hope you enjoy it, please upvote it ! 


