# Pet Food Customer Orders Data Insights

* [Basic exploration of data](#exploration)
* [Attribute construction](#attributecon)
* [Data preparation](#dataprep)
* [Data labelling](#datalabel)
* [Visualisation](#visualisation)

## Modelling
* [Prepare Data before Implementing Feature Selection Methods](#dataprepmodel)
* [Feature selection methods](#featureselectionmethods)
* [Models](#models)
* [Models for 1order/ 1+orders classification](#secondmodel)




In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
import datetime as dt
import seaborn as sns

<a id='exploration'></a>
## Basic exploration of data

In [None]:
df = pd.read_csv('/kaggle/input/pet-food-customer-orders-online/pet_food_customer_orders.csv')
df.head()

In [None]:
df.shape

In [None]:
df['order_payment_date'].min()

In [None]:
df['order_payment_date'].max()

- Dataset contains order details of customers and profile data of their pets 
- There are 49042 transactions in the dataset covering the 15 month period from 2018-12-30 to 2020-03-30
- Data is available on approx 36 transactional and profile features 

In [None]:
df.describe()

In [None]:
sns.set(rc={'figure.figsize':(12,9)})

In [None]:
sns.distplot(df['kibble_kcal'])
# might need to do scaling for such variables

In [None]:
df.describe(include=np.object)

In [None]:
df['customer_id'].nunique()

In [None]:
df['pet_id'].nunique()

- The dataset contains 11168 unique customers and 13087 unique pets
- Average number of web sessiosis approx 8 with the average amount of time spent before last order being around 90 minutes
- Most common pet_food_tier in the dataset is superpremium
- pet_signup_datetime is set an an arbitrary demo number not indicative of the actual datetime
- There are 136 brands pre-tails in the dataset. Harrightons gets the most mention
- Most pets are at mature stage

In [None]:
df.columns

In [None]:
# selected some variables for aggregation of data around them

shortlisted_vars = ['wet_food_order_number', 'pet_has_active_subscription', 'pet_food_tier', 'neutered', 
                    'gender', 'pet_breed_size', 'ate_wet_food_pre_tails', 'pet_life_stage_at_order', 'wet_tray_size', 
                    'wet_food_textures_in_order', 'customer_support_ticket_category']

In [None]:
for var in shortlisted_vars:
    print(df.groupby(var).size())

#### Some insights from above data manipulation

- Wet order number 1 appears 4181 times in the dataset. Effectively means that 4181 pets have ordered wet food at least once (customers ordering for their pets)
- Wet order number 2 appears 3035 times in dataset. Effectively means that 3035 have made a follow on second order for wet food
- Of the 49042 transactions, approx 33 thousand concern pets who have active subscription
- Of the 49042 transactions, approx 31 thousand concern pets who did not eat wet food prior to Tails
- Of the 49042 transactions, approx 28 thousand concern pets who come under food tier 'superpremium'
- Most popular wet food texture is 'gravy jelly pate'
- In terms of tray size, 150g has a slight edge over 300g in terms of number of orders

In [None]:
df.isnull().sum()

<a id='attributecon'></a>
# Further Feature Extraction / Attribute construction

#### Attribute: 'number_of_pets_in_house' 
- since some customers have multiple pets

In [None]:
# Shows how many pets each customer has. Some have multiple pets. 
# Pet population in a household may impact the nature of orders placed for each pet. Hence potentially an important explanatory variable

df.groupby('customer_id')['pet_id'].nunique().reset_index()

In [None]:
pet_population = pd.DataFrame(df.groupby('customer_id')['pet_id'].nunique().reset_index())

In [None]:
pet_population.shape

In [None]:
pet_population.columns=['customer_id', 'number_of_pets_in_house']

In [None]:
pet_population.sample(5)

In [None]:
# update the df with the new attribute
df_updated = pd.merge(df,pet_population, on='customer_id')

In [None]:
df_updated.head().T

### Attribute 'communication_gap' (time between last_customer_support_ticket_date and the order_payment_date)
- With the assumption that a communication gap may impact the freqency or other nature of orders placed

In [None]:
df_updated['last_customer_support_ticket_date']= df_updated['last_customer_support_ticket_date'].astype('datetime64[ns]')
df_updated['order_payment_date']= df_updated['order_payment_date'].astype('datetime64[ns]')

In [None]:
df_updated['communication_gap'] = ((df_updated['order_payment_date'].dt.date -  df_updated['last_customer_support_ticket_date'].dt.date)/np.timedelta64(1, 'D'))

In [None]:
df_updated.head().T

In [None]:
df_updated['last_customer_support_ticket_date'].isnull().sum()

- of the 49 thousand transactions almost 39 thousand have no last_customer_support_ticket_date
- when aggregated against each pet, of the 13087 unique pets the mean of comm_gap is absent for 10178
- Suggest we still use a marker for whether there was some communication because there is at least customer_suppot_ticket data on 3000 pets which is like 25% of the pets

In [None]:
df_updated['communication_gap'].describe()

In [None]:
# Marking those orders with 1 where some communication had taken place (regardless of subject)

df_updated['communication'] = np.where(df_updated['communication_gap'] > 0, 1, 0)

In [None]:
df_updated['communication'].isnull().sum()

In [None]:
df_updated.groupby('communication').size()    

- Of the 49 thousand transactions approximately 9 thousand had a precedent of customer communication
- (SEE BELOW Interpretation: on average there is an average gap of 162 days from the time company communicates proactively with client till the time client orders
- call_back, product, website are good 'categories' of communication in terms of leading to an order from the client
- communication on things like promotion, blend, yodel, packaging does not seem to elicit a quick response in terms of materializing into an order

In [None]:
df_updated.groupby('customer_support_ticket_category')['communication_gap'].mean().sort_values()

In [None]:
del df_updated['communication_gap']

# since we will not be needing it from here on. we have kept the 'communication' market in the dataframe though

### Attribute ''days_before_closing' 
- This attribute will give day difference between order_payment_date and last order date in the dataset.
- This variable can play anchor for use cases where we need something like a cut-off date (e.g. churn analysis in non-subscription scenarios)
- An attribute that gives customer tenure would have been more helpful but for that we need an accurate customer sign-up date which is not available
- Hopefully 'days_before_closing' would be a good substitute to 'tenure' and will be helpful in modelling

In [None]:
df_updated['days_before_closing'] = ((df_updated['order_payment_date'].max() -  df_updated['order_payment_date'])/np.timedelta64(1, 'D'))

In [None]:
df_updated['days_before_closing'].describe()

### Attribute : 'Ratio of wet and dry calories in an order'
- To examine how the ratio/mix of wet and dry food drives the wet food orders
- This attribute may be used in one-order/more-orders classification. will be more appropriate there


In [None]:
df_updated['wet_dry_cal_ratio'] = df_updated['wet_kcal']/df_updated['kibble_kcal']

In [None]:
df_updated['wet_dry_cal_ratio'] = df_updated['wet_dry_cal_ratio'].round(2)

# rounding the result to 2 decimal places

In [None]:
df_updated.head().T

In [None]:
df_updated['wet_dry_cal_ratio'].describe()

### Further attributes that might help
- 'wet_trays_in_first_wet_order' presuming that the number of wet trays ordered will determine how many are ordered subsequently and when. and that this could be a variable for one/more classification
- some derived variable from 'orders since first wet order varialble' at the top

<a id='dataprep'></a>
# Data Preparation

### Data Preparation: Aggregated transactional data table

- We can skip customer_id column since, except the derived variable 'number_of_pets_in_house', there is no customer specific information/variable that could explain whether a customer buys wet food or not. All the profile variables and even order dates and web session relate to pets 
- We can skip customer_id column therefore and aggregate transactional data around pet_id and conduct analysis around pet_id. We will include this column 'number_of_pets_in_house' as it relates to pets as well. It can be an explanatory variable indicating the nature of in-house pet company each pet has

In [None]:
df_updated.columns

#### Lets us now include variables that make sense (non categorical first) - variables that can be explanatory i.e. variables that can possibely explain whether a customer/pet would try wet food and also among those who do, explain variance in the number and nature of wet food orders

In [None]:
# this is selected transactional data to be aggregated for each pet 

numbers_df = df_updated.groupby('pet_id').agg({'pet_order_number':['max'],                    # number of dry orders can be related to number of wet food orders
                                             'kibble_kcal':['mean'],                          # high dry kcal could lead to ordering more food like wet food
                                             'wet_food_discount_percent':['mean'],            # discount given can determine whether client buys wet food or not
                                             'total_minutes_on_website_since_last_order':['mean'],  # time spent on web can determine whether client buys wet or not
                                               'number_of_pets_in_house':['mean'],            # this variable could have a bearing
                                               'communication':['max'],                       # whether there has been customer communication
                                               'days_before_closing':['max'],                 # this column could be used for specifying cut off date for churn analysis
                                               'wet_dry_cal_ratio':['mean'],                  # a mix/ratio like this could determine how important wet food is for the pet
                                               'wet_food_order_number':['max']})               # this column will be used for labelling later on, and then dropped

In [None]:
numbers_df.head()

In [None]:
numbers_df.shape

In [None]:
df['pet_id'].nunique()

In [None]:
numbers_df.columns.ravel()

In [None]:
#Join the field names to the newly aggregate fields names
numbers_df.columns = ["_".join(numbers_df) for numbers_df in numbers_df.columns.ravel()]

In [None]:
numbers_df.head()

In [None]:
#Reset the index
numbers_df = numbers_df.reset_index()

In [None]:
numbers_df.head()

In [None]:
numbers_df.shape

In [None]:
numbers_df.isnull().sum()

#### We can fill missing values of wet_food_discount_percent_mean and wet_food_order_number_max with 0 without affecting anything because those are the appropriate values for missing values of these two variables. The imputing is fulfilled below. 

# Get Pet profile data

In [None]:
df.columns

In [None]:
# Category variables which will be used to create a dataframe for visualisation of pets who place wet orders versus those who dont
# These category variables will later be dummified for modelling purposes


category_columns = ['pet_has_active_subscription', 'pet_food_tier', 'pet_allergen_list', 'pet_fav_flavour_list', 
                    'pet_health_issue_list', 'neutered', 'gender','pet_breed_size', 'signup_promo', 'ate_wet_food_pre_tails',
                    'dry_food_brand_pre_tails', 'pet_life_stage_at_order']

#### This below will take about 30 seconds depending upon how fast the machine is

In [None]:
category_df = df.groupby('pet_id')[category_columns].max().reset_index()

# This is to extract pet profile (of categorical variables) from the transactional data df
# Here perhaps we can use median. does not matter. mean, median, max or min values will all be the same

In [None]:
category_df.head()

In [None]:
category_df.shape

In [None]:
# Merging two tables (no dummification done yet)
visual_df = pd.merge(numbers_df, category_df, on='pet_id')
visual_df.head()

<a id='datalabel'></a>
# Data Labelling
- We will create two label columns in visual_df: one column depicting whether a pet has ever placed a wet food order, and the other column depicting three different categories (no wet order, 1 wet order, multiple wet orders)

In [None]:
# Data labelling depicting whether a pet/customer has ever placed a wet food order
visual_df['Label'] = np.where(visual_df['wet_food_order_number_max'] > 0, 1, 0)

In [None]:
# Data labelling depicting whether a pet/customer has placed one wet food order or multiple wet food orders over the months

visual_oneormore_df = visual_df[visual_df['Label']==1]

In [None]:
del visual_oneormore_df['Label']

In [None]:
visual_oneormore_df['Label'] = np.where(visual_oneormore_df['wet_food_order_number_max'] > 1, 2, 1)

In [None]:
visual_oneormore_df.shape

In [None]:
# visual_df.to_csv('visual_df.csv')    
# for visualisation in tableau

In [None]:
# visual_oneormore_df.to_csv('visual_oneormore_df.csv')  
# for visualisation in tableau

In [None]:
# insert another label column LabelB for 3 class labelling. '2' to mark those who have placed multiple wet food orders

visual_df['LabelB'] = 0
visual_df.loc[visual_df.wet_food_order_number_max>0,'LabelB'] = 1
visual_df.loc[visual_df.wet_food_order_number_max>1,'LabelB'] = 2

In [None]:
visual_df.head().T

In [None]:
visual_df.groupby('Label').size()

# of 13087 pets 4263 have purchased wet food at least once
# see cell 14 above. why does the number (4263) not tie in with the number 4181
# because imagine a pet whose wet_food_order_number_max is eg 4 but order number 1 was missing in the original transactional dataset, our data labelling 
# will still assign '1' to such users. 
# This also means that wet_food_order_number 1 is missing in case of 4263-4181 (=82) pets in the original transactional dataset

In [None]:
visual_df.groupby('LabelB').size()

# of the 4263 pets who have purchased wet food at least once, 3132 went on to buy the wet food again

In [None]:
visual_df.isnull().sum()
# keep missing values as they are right now. try the visualization first

<a id='visualisation'></a>
# Visualisation

In [None]:
chart=sns.countplot(x='Label', data = visual_df, palette = 'hls')
for p in chart.patches:
    height = p.get_height()
    chart.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.0f}'.format(height),
            ha="center") 

plt.show()

In [None]:
chart=sns.countplot(x='Label', data = visual_oneormore_df, palette = 'hls')
for p in chart.patches:
    height = p.get_height()
    chart.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.0f}'.format(height),
            ha="center") 

plt.show()

In [None]:
chart=sns.countplot(x='LabelB', data = visual_df, palette = 'hls')
for p in chart.patches:
    height = p.get_height()
    chart.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.0f}'.format(height),
            ha="center") 

plt.show()

In [None]:
visual_df.groupby('Label').mean().transpose()

In [None]:
pd.crosstab(visual_df.gender,visual_df.Label).plot(kind='bar')
plt.title('gender vs Label')
plt.xlabel('gender')
plt.ylabel('Number of Pets')


pd.crosstab(visual_df.gender,visual_oneormore_df.Label).plot(kind='bar')
plt.title('gender vs Label')
plt.xlabel('gender')
plt.ylabel('Number of Pets')



pd.crosstab(visual_df.gender,visual_df.LabelB).plot(kind='bar')
plt.title('gender vs LabelB')
plt.xlabel('gender')
plt.ylabel('Number of Pets')

In [None]:
pd.crosstab(visual_df.number_of_pets_in_house_mean,visual_df.Label).plot(kind='bar')
plt.title('number_of_pets_in_house_mean vs LabelB')
plt.xlabel('number_of_pets_in_house_mean')
plt.ylabel('Number of Pets')


pd.crosstab(visual_oneormore_df.number_of_pets_in_house_mean,visual_oneormore_df.Label).plot(kind='bar')
plt.title('number_of_pets_in_house_mean vs Label')
plt.xlabel('number_of_pets_in_house_mean')
plt.ylabel('Number of Pets')


pd.crosstab(visual_df.number_of_pets_in_house_mean,visual_df.LabelB).plot(kind='bar')
plt.title('number_of_pets_in_house_mean vs LabelB')
plt.xlabel('number_of_pets_in_house_mean')
plt.ylabel('Number of Pets')

In [None]:
pd.crosstab(visual_df.ate_wet_food_pre_tails,visual_df.Label).plot(kind='bar')
plt.title('ate_wet_food_pre_tails vs Label')
plt.xlabel('ate_wet_food_pre_tails')
plt.ylabel('Number of Pets')



pd.crosstab(visual_oneormore_df.ate_wet_food_pre_tails,visual_oneormore_df.Label).plot(kind='bar')
plt.title('ate_wet_food_pre_tails vs Label')
plt.xlabel('ate_wet_food_pre_tails')
plt.ylabel('Number of Pets')



pd.crosstab(visual_df.ate_wet_food_pre_tails,visual_df.LabelB).plot(kind='bar')
plt.title('ate_wet_food_pre_tails vs LabelB')
plt.xlabel('ate_wet_food_pre_tails')
plt.ylabel('Number of Pets')

In [None]:
pd.crosstab(visual_df.ate_wet_food_pre_tails,visual_df.Label).plot(kind='bar')
plt.title('ate_wet_food_pre_tails vs Label')
plt.xlabel('ate_wet_food_pre_tails')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.ate_wet_food_pre_tails,visual_oneormore_df.Label).plot(kind='bar')
plt.title('ate_wet_food_pre_tails vs Label')
plt.xlabel('ate_wet_food_pre_tails')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.ate_wet_food_pre_tails,visual_df.LabelB).plot(kind='bar')
plt.title('ate_wet_food_pre_tails vs LabelB')
plt.xlabel('ate_wet_food_pre_tails')
plt.ylabel('Number of Pets')

In [None]:
pd.crosstab(visual_df.number_of_pets_in_house_mean,visual_df.Label).plot(kind='bar')
plt.title('number_of_pets_in_house_mean vs Label')
plt.xlabel('number_of_pets_in_house_mean')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.number_of_pets_in_house_mean,visual_oneormore_df.Label).plot(kind='bar')
plt.title('number_of_pets_in_house_mean vs Label')
plt.xlabel('number_of_pets_in_house_mean')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.number_of_pets_in_house_mean,visual_df.LabelB).plot(kind='bar')
plt.title('number_of_pets_in_house_mean vs LabelB')
plt.xlabel('number_of_pets_in_house_mean')
plt.ylabel('Number of Pets')

In [None]:
visual_df.columns

In [None]:
pd.crosstab(visual_df.communication_max,visual_df.Label).plot(kind='bar')
plt.title('communication_max vs Label')
plt.xlabel('communication_max')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.communication_max,visual_oneormore_df.Label).plot(kind='bar')
plt.title('communication_max vs Label')
plt.xlabel('communication_max')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.communication_max,visual_df.LabelB).plot(kind='bar')
plt.title('communication_max vs LabelB')
plt.xlabel('communication_max')
plt.ylabel('Number of Pets')

In [None]:
visual_df.columns

In [None]:
pd.crosstab(visual_df.pet_has_active_subscription,visual_df.Label).plot(kind='bar')
plt.title('pet_has_active_subscription vs Label')
plt.xlabel('pet_has_active_subscription')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.pet_has_active_subscription,visual_oneormore_df.Label).plot(kind='bar')
plt.title('pet_has_active_subscription vs Label')
plt.xlabel('pet_has_active_subscription')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.pet_has_active_subscription,visual_df.LabelB).plot(kind='bar')
plt.title('pet_has_active_subscription vs LabelB')
plt.xlabel('pet_has_active_subscription')
plt.ylabel('Number of Pets')

In [None]:
visual_df.columns

In [None]:
pd.crosstab(visual_df.pet_food_tier,visual_df.Label).plot(kind='bar')
plt.title('pet_food_tier vs Label')
plt.xlabel('pet_food_tier')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.pet_food_tier,visual_oneormore_df.Label).plot(kind='bar')
plt.title('pet_food_tier vs Label')
plt.xlabel('pet_food_tier')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.pet_food_tier,visual_df.LabelB).plot(kind='bar')
plt.title('pet_food_tier vs LabelB')
plt.xlabel('pet_food_tier')
plt.ylabel('Number of Pets')

In [None]:
visual_df.columns

In [None]:
pd.crosstab(visual_df.pet_fav_flavour_list,visual_df.Label).plot(kind='bar')
plt.title('pet_fav_flavour_list vs Label')
plt.xlabel('pet_fav_flavour_list')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.pet_fav_flavour_list,visual_oneormore_df.Label).plot(kind='bar')
plt.title('pet_fav_flavour_list vs Label')
plt.xlabel('pet_fav_flavour_list')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.pet_fav_flavour_list,visual_df.LabelB).plot(kind='bar')
plt.title('pet_fav_flavour_list vs LabelB')
plt.xlabel('pet_fav_flavour_list')
plt.ylabel('Number of Pets')

In [None]:
visual_df.columns

In [None]:
pd.crosstab(visual_df.neutered,visual_df.Label).plot(kind='bar')
plt.title('neutered vs Label')
plt.xlabel('neutered')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.neutered,visual_oneormore_df.Label).plot(kind='bar')
plt.title('neutered vs Label')
plt.xlabel('neutered')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.neutered,visual_df.LabelB).plot(kind='bar')
plt.title('neutered vs LabelB')
plt.xlabel('neutered')
plt.ylabel('Number of Pets')

In [None]:
visual_df.columns

In [None]:
pd.crosstab(visual_df.pet_breed_size,visual_df.Label).plot(kind='bar')
plt.title('pet_breed_size vs Label')
plt.xlabel('pet_breed_size')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.pet_breed_size,visual_oneormore_df.Label).plot(kind='bar')
plt.title('pet_breed_size vs Label')
plt.xlabel('pet_breed_size')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.pet_breed_size,visual_df.LabelB).plot(kind='bar')
plt.title('pet_breed_size vs LabelB')
plt.xlabel('pet_breed_size')
plt.ylabel('Number of Pets')

In [None]:
visual_df.columns

In [None]:
pd.crosstab(visual_df.signup_promo,visual_df.Label).plot(kind='bar')
plt.title('signup_promo vs Label')
plt.xlabel('signup_promo')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.signup_promo,visual_oneormore_df.Label).plot(kind='bar')
plt.title('signup_promo vs Label')
plt.xlabel('signup_promo')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.signup_promo,visual_df.LabelB).plot(kind='bar')
plt.title('signup_promo vs LabelB')
plt.xlabel('signup_promo')
plt.ylabel('Number of Pets')

In [None]:
visual_df.columns

In [None]:
pd.crosstab(visual_df.ate_wet_food_pre_tails,visual_df.Label).plot(kind='bar')
plt.title('ate_wet_food_pre_tails vs Label')
plt.xlabel('ate_wet_food_pre_tails')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.ate_wet_food_pre_tails,visual_oneormore_df.Label).plot(kind='bar')
plt.title('ate_wet_food_pre_tails vs Label')
plt.xlabel('ate_wet_food_pre_tails')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.ate_wet_food_pre_tails,visual_df.LabelB).plot(kind='bar')
plt.title('ate_wet_food_pre_tails vs LabelB')
plt.xlabel('ate_wet_food_pre_tails')
plt.ylabel('Number of Pets')

In [None]:
visual_df.columns

In [None]:
pd.crosstab(visual_df.pet_life_stage_at_order,visual_df.Label).plot(kind='bar')
plt.title('pet_life_stage_at_order vs Label')
plt.xlabel('pet_life_stage_at_order')
plt.ylabel('Number of Pets')

pd.crosstab(visual_oneormore_df.pet_life_stage_at_order,visual_oneormore_df.Label).plot(kind='bar')
plt.title('pet_life_stage_at_order vs Label')
plt.xlabel('pet_life_stage_at_order')
plt.ylabel('Number of Pets')

pd.crosstab(visual_df.pet_life_stage_at_order,visual_df.LabelB).plot(kind='bar')
plt.title('pet_life_stage_at_order vs LabelB')
plt.xlabel('pet_life_stage_at_order')
plt.ylabel('Number of Pets')

In [None]:
visual_df.columns

In [None]:
# interpretation: the longer the clinet stays with the company (higher number of orders, the higher the probability of buying wet food)

plt.figure(figsize=(10,6))
visual_df[visual_df['Label']==1]['pet_order_number_max'].hist(alpha=0.5,color='blue',
                                              bins=10,label='Label=1')
visual_df[visual_df['Label']==0]['pet_order_number_max'].hist(alpha=0.2,color='green',
                                              bins=10,label='Label=0')
plt.legend()
plt.xlabel('pet_order_number_max')
plt.ylabel('Frequency')




plt.figure(figsize=(10,6))
visual_oneormore_df[visual_oneormore_df['Label']==2]['pet_order_number_max'].hist(alpha=0.5,color='brown',
                                              bins=10,label='Label=2')
visual_oneormore_df[visual_oneormore_df['Label']==1]['pet_order_number_max'].hist(alpha=0.5,color='blue',
                                              bins=10,label='Label=1')
plt.legend()
plt.xlabel('pet_order_number_max')
plt.ylabel('Frequency')




plt.figure(figsize=(10,6))
visual_df[visual_df['LabelB']==2]['pet_order_number_max'].hist(alpha=0.5,color='brown',
                                              bins=20,label='Label=2')
visual_df[visual_df['LabelB']==1]['pet_order_number_max'].hist(alpha=0.5,color='blue',
                                              bins=20,label='Label=1')
visual_df[visual_df['LabelB']==0]['pet_order_number_max'].hist(alpha=0.2,color='green',
                                              bins=20,label='Label=0')
plt.legend()
plt.xlabel('pet_order_number_max')
plt.ylabel('Frequency')

In [None]:
visual_df.columns

In [None]:
plt.figure(figsize=(10,6))
visual_df[visual_df['Label']==1]['kibble_kcal_mean'].hist(alpha=0.5,color='blue',
                                              bins=10,label='Buyers')
visual_df[visual_df['Label']==0]['kibble_kcal_mean'].hist(alpha=0.2,color='green',
                                              bins=13,label='Non-Buyers')
plt.legend()
plt.xlabel('kibble_kcal_mean')
plt.ylabel('Number of Pets')




plt.figure(figsize=(10,6))
visual_oneormore_df[visual_oneormore_df['Label']==2]['kibble_kcal_mean'].hist(alpha=0.5,color='brown',
                                              bins=8,label='Buyers (1+ orders)')
visual_oneormore_df[visual_oneormore_df['Label']==1]['kibble_kcal_mean'].hist(alpha=0.5,color='blue',
                                              bins=10,label='Buyers (1 order)')
plt.legend()
plt.xlabel('kibble_kcal_mean')
plt.ylabel('Number of Pets')




plt.figure(figsize=(10,6))
visual_df[visual_df['LabelB']==2]['kibble_kcal_mean'].hist(alpha=0.5,color='brown',
                                              bins=20,label='Label=2')
visual_df[visual_df['LabelB']==1]['kibble_kcal_mean'].hist(alpha=0.5,color='blue',
                                              bins=20,label='Label=1')
visual_df[visual_df['LabelB']==0]['kibble_kcal_mean'].hist(alpha=0.2,color='green',
                                              bins=20,label='Label=0')
plt.legend()
plt.xlabel('kibble_kcal_mean')
plt.ylabel('Frequency')

In [None]:
visual_df.columns

In [None]:
plt.figure(figsize=(10,6))
visual_df[visual_df['Label']==1]['days_before_closing_max'].hist(alpha=0.5,color='blue',
                                              bins=10,label='Label=1')
visual_df[visual_df['Label']==0]['days_before_closing_max'].hist(alpha=0.2,color='green',
                                              bins=10,label='Label=0')
plt.legend()
plt.xlabel('days_before_closing_max')
plt.ylabel('Frequency')




plt.figure(figsize=(10,6))
visual_oneormore_df[visual_oneormore_df['Label']==2]['days_before_closing_max'].hist(alpha=0.5,color='brown',
                                              bins=10,label='Label=2')
visual_oneormore_df[visual_oneormore_df['Label']==1]['days_before_closing_max'].hist(alpha=0.5,color='blue',
                                              bins=10,label='Label=1')
plt.legend()
plt.xlabel('days_before_closing_max')
plt.ylabel('Frequency')




plt.figure(figsize=(10,6))
visual_df[visual_df['LabelB']==2]['days_before_closing_max'].hist(alpha=0.5,color='brown',
                                              bins=20,label='Label=2')
visual_df[visual_df['LabelB']==1]['days_before_closing_max'].hist(alpha=0.5,color='blue',
                                              bins=20,label='Label=1')
visual_df[visual_df['LabelB']==0]['days_before_closing_max'].hist(alpha=0.2,color='green',
                                              bins=20,label='Label=0')
plt.legend()
plt.xlabel('days_before_closing_max')
plt.ylabel('Frequency')

<a id='dataprepmodel'></a>
# MODELLING
## Prepare Data before Implementing Feature Selection Methods
- We aggregated transactional/numerical data for each pet into numbers_df earlier and catetegorical data of pets in category_df
- We will now dummify categorical data in category_df, then aggregate that for each pet
- Then we will merge the categorical and numerical data

In [None]:
numbers_df.head()

In [None]:
category_df.head()

In [None]:
#dummify categorical variables
dummy_df = pd.get_dummies(category_df, 
                             columns=['pet_has_active_subscription', 'pet_food_tier', 'pet_allergen_list', 'pet_fav_flavour_list', 
                    'pet_health_issue_list', 'neutered', 'gender','pet_breed_size', 'signup_promo', 'ate_wet_food_pre_tails',
                    'dry_food_brand_pre_tails', 'pet_life_stage_at_order'], 
                             drop_first = False)

In [None]:
dummy_df.head()

### Merge the two tables (agrregated transactional data and dummified profile/categorical data)

In [None]:
transformed_df = pd.merge(numbers_df, dummy_df, on='pet_id')
transformed_df.head().T

In [None]:
transformed_df.shape

## Data Labelling of final dataframes prior to modelling 
### (above data labelling was for the visual_df, the dataframe used for visualisations)
- Create 2 tables: 
- (1) pet orders table no-wet-food-orders/at-least-one-wet-food-order
- (2) pet orders table one-wet-food-order/multiple-wet_food_orders

In [None]:
# (Table 1) pet orders table no-wet-food-orders/at-least-one-wet-food-order
transformed_df['Label'] = np.where(transformed_df['wet_food_order_number_max'] > 0, 1, 0)

In [None]:
# (Table 2) pet orders table one-wet-food-order/multiple-wet_food_orders
wet_food_df = transformed_df[transformed_df['Label']==1]

In [None]:
del wet_food_df['Label']

In [None]:
wet_food_df['Label'] = np.where(wet_food_df['wet_food_order_number_max'] > 1, 2, 1)
# Label '2' meaning 2 or more wet food orders

### Some correlation analysis below which can sometimes be used for feature selection as well

In [None]:
transformed_df.corr()['wet_food_order_number_max'].sort_values(ascending=False)

In [None]:
transformed_df.corr()['wet_food_order_number_max'].sort_values(ascending=True)

In [None]:
wet_food_df.corr()['wet_food_order_number_max'].sort_values(ascending=False)

In [None]:
wet_food_df.corr()['wet_food_order_number_max'].sort_values(ascending=True)

In [None]:
# now we can delete the column that lets us decide data labels
del transformed_df['wet_food_order_number_max']
del wet_food_df['wet_food_order_number_max']

In [None]:
numbers_df.columns

In [None]:
# furthermore we delete a couple more columns from transformed_df because these variables are not appropriate as explanatory variables for zero-wet-orders/some-wet-orders classification
del transformed_df['wet_food_discount_percent_mean']
del transformed_df['wet_dry_cal_ratio_mean']

# The above two explanatory variables are kept in wet_food_df table since they are relevant variables in one-wet-order/more-wet-orders classification

In [None]:
transformed_df.sample(9).T

In [None]:
wet_food_df.head()

In [None]:
transformed_df.groupby('Label').size()
# Figures consistent with earlier analysis

In [None]:
wet_food_df.groupby('Label').size()
# Figures consistent with earlier analysis

In [None]:
transformed_df.columns[transformed_df.isnull().any()].tolist()

In [None]:
wet_food_df.columns[wet_food_df.isnull().any()].tolist()

# Discount does not show any missing values because all wet-order-numbers either show a discount value or a 0. not a NaN. It shows NaN only when the wet-order-number shows NaN 

## Balancing labelled data (time permitting)

In [None]:
transformed_df.groupby('Label').size()
# imbalanced data. we need to take out half of examples labelled with 0. otherwise the algorithm will train itself more on '0' examples and then 
# the algorithm will be able to predict 0s more easily than the 1s. That will impact Recall. If time permits balance the data

## Scaling of columns like kibble_kcal and days_before_closing_max (time permitting)

In [None]:
numbers_df['kibble_kcal_mean'].describe()

In [None]:
sns.distplot(numbers_df['kibble_kcal_mean'])

In [None]:
numbers_df.columns

In [None]:
sns.distplot(numbers_df['days_before_closing_max'])

<a id='featureselectionmethods'></a>
# Feature Selection Methods

# feature importances

In [None]:
features = transformed_df[transformed_df.columns.difference(['Label','pet_id'])]

labels = transformed_df['Label']

- Keeping all data because we are not doing classification model yet. only getting feature importances

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

clf.fit(features,labels)

preds = clf.predict(features)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(preds,labels)
print(accuracy)

In [None]:
VI = pd.DataFrame(clf.feature_importances_, columns = ["RF"], index=features.columns)
VI = VI.reset_index()
VI

In [None]:
VI.sort_values(['RF'],ascending=0)[0:20]
# Get the top features from this list below

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

model = SelectKBest(score_func=chi2, k=5)
fit = model.fit(features.abs(), labels)

In [None]:
pd.options.display.float_format = '{:.2f}'.format
chi_sq = pd.DataFrame(fit.scores_, columns = ["Chi_Square"], index=features.columns)

In [None]:
chi_sq = chi_sq.reset_index()

In [None]:
chi_sq.sort_values('Chi_Square',ascending=0)[0:20]

## checking for multicollinearity uisng variance inflation factor (time permitting)

## Important Features: 

These below seem to be the important features

- Random Forest (Ate Wet Food Pre Tails, Kibble_kcal, Days before closing, Total minutes on website since last order, Pet order number)
- SelectKbest (Kibble_kcal, Total minutes on website since last order, Ate Wet Food Pre Tails, Pet Life Stage at Order, Breed Size)

In [None]:
transformed_df.corr()['Label'].sort_values(ascending=False)

In [None]:
transformed_df.corr()['Label'].sort_values(ascending=True)

<a id='models'></a>
# Models

In [None]:
classification_df = transformed_df[transformed_df.columns.difference(['pet_id'])]

In [None]:
classification_df.head().T

In [None]:
features = transformed_df[transformed_df.columns.difference(['Label','pet_id'])]

labels = transformed_df['Label']

In [None]:
Class_Features = transformed_df.columns.difference(['Label','pet_id'])

In [None]:
Selected_Features = ['kibble_kcal_mean', 'total_minutes_on_website_since_last_order_mean', 'ate_wet_food_pre_tails_True', 
                     'pet_life_stage_at_order_mature', 'pet_order_number_max',
                     'number_of_pets_in_house_mean', 'days_before_closing_max']

In [None]:
from sklearn import model_selection
from sklearn.model_selection import train_test_split

train, test = train_test_split(classification_df, test_size = 0.4, random_state=21)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

In [None]:
features_train = train[Class_Features]
label_train = train['Label']
features_test = test[Class_Features]
label_test = test['Label']

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

clf.fit(features_train,label_train)

pred_train = clf.predict(features_train)
pred_test = clf.predict(features_test)

from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(pred_train,label_train)
accuracy_test = accuracy_score(pred_test,label_test)

from sklearn import metrics
fpr, tpr, _ = metrics.roc_curve(np.array(label_train), clf.predict_proba(features_train)[:,1])
auc_train = metrics.auc(fpr,tpr)

fpr, tpr, _ = metrics.roc_curve(np.array(label_test), clf.predict_proba(features_test)[:,1])
auc_test = metrics.auc(fpr,tpr)

print("{:.2f}".format(accuracy_train),"{:.2f}".format(accuracy_test),"{:.2f}".format(auc_train),"{:.2f}".format(auc_test))

In [None]:
features.shape

In [None]:
pd.crosstab(label_train,pd.Series(pred_train),rownames=['ACTUAL'],colnames=['PRED'])

In [None]:
pd.crosstab(label_test,pd.Series(pred_test),rownames=['ACTUAL'],colnames=['PRED'])

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

print('Accuracy Score')
print(accuracy_score(label_test, pred_test),'\n')

print('Precision Score')
print(precision_score(label_test, pred_test,average = None),'\n')

print('Confusion Matrix')
array = confusion_matrix(label_test, pred_test)
columns = ['Buyer','Not Buyer']  #to adapt to your classification problem
print(pd.DataFrame(array,columns = columns, index = columns),'\n')

print('Classification Report')
print(classification_report(label_test, pred_test),'\n')

### Doing classification with limited feature columns now

In [None]:
from sklearn import model_selection
from sklearn.model_selection import train_test_split

train, test = train_test_split(classification_df, test_size = 0.3, random_state=21)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

In [None]:
features_train = train[Selected_Features]
label_train = train['Label']
features_test = test[Selected_Features]
label_test = test['Label']

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

clf.fit(features_train,label_train)

pred_train = clf.predict(features_train)
pred_test = clf.predict(features_test)

from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(pred_train,label_train)
accuracy_test = accuracy_score(pred_test,label_test)

from sklearn import metrics
fpr, tpr, _ = metrics.roc_curve(np.array(label_train), clf.predict_proba(features_train)[:,1])
auc_train = metrics.auc(fpr,tpr)

fpr, tpr, _ = metrics.roc_curve(np.array(label_test), clf.predict_proba(features_test)[:,1])
auc_test = metrics.auc(fpr,tpr)

print("{:.2f}".format(accuracy_train),"{:.2f}".format(accuracy_test),"{:.2f}".format(auc_train),"{:.2f}".format(auc_test))

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lrclf = LogisticRegression()

lrclf.fit(features_train,label_train)

pred_train = lrclf.predict(features_train)
pred_test = lrclf.predict(features_test)

from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(pred_train,label_train)
accuracy_test = accuracy_score(pred_test,label_test)

from sklearn import metrics
fpr, tpr, _ = metrics.roc_curve(np.array(label_train), lrclf.predict_proba(features_train)[:,1])
auc_train = metrics.auc(fpr,tpr)

fpr, tpr, _ = metrics.roc_curve(np.array(label_test), lrclf.predict_proba(features_test)[:,1])
auc_test = metrics.auc(fpr,tpr)

print("{:.2f}".format(accuracy_train),"{:.2f}".format(accuracy_test),"{:.2f}".format(auc_train),"{:.2f}".format(auc_test))

# Decision tree classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=0)
tree.fit(features_train, label_train)
# predict train set
pred_train=tree.predict(features_train)
# predict test set
pred_test=tree.predict(features_test)

from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(pred_train,label_train)
accuracy_test = accuracy_score(pred_test,label_test)

from sklearn import metrics
fpr, tpr, _ = metrics.roc_curve(np.array(label_train), tree.predict_proba(features_train)[:,1])
auc_train = metrics.auc(fpr,tpr)

fpr, tpr, _ = metrics.roc_curve(np.array(label_test), tree.predict_proba(features_test)[:,1])
auc_test = metrics.auc(fpr,tpr)

print("{:.2f}".format(accuracy_train),"{:.2f}".format(accuracy_test),"{:.2f}".format(auc_train),"{:.2f}".format(auc_test))

<a id='secondmodel'></a>
# Model for 1order/1+orders classification

# feature importances for second classification

In [None]:
features = wet_food_df[wet_food_df.columns.difference(['Label','pet_id'])]


labels = wet_food_df['Label']

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

clf.fit(features,labels)

preds = clf.predict(features)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(preds,labels)
print(accuracy)

In [None]:
VII = pd.DataFrame(clf.feature_importances_, columns = ["RF"], index=features.columns)
VII = VII.reset_index()
VII

In [None]:
VII.sort_values(['RF'],ascending=0)[0:30]
# Get the top 10 features from this list below

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

model = SelectKBest(score_func=chi2, k=5)
fit = model.fit(features.abs(), labels)

In [None]:
pd.options.display.float_format = '{:.2f}'.format
chi_sq = pd.DataFrame(fit.scores_, columns = ["Chi_Square"], index=features.columns)

In [None]:
chi_sq = chi_sq.reset_index()

In [None]:
chi_sq.sort_values('Chi_Square',ascending=0)[0:20]

## pearson correlation

In [None]:
wet_food_df.corr()['Label'].sort_values(ascending=False)

In [None]:
wet_food_df.corr()['Label'].sort_values(ascending=True)

### Features that emarge after the above feature selection methods
- 'wet_food_discount_percent_mean','pet_order_number_max','wet_dry_cal_ratio_mean','total_minutes_on_website_since_last_order_mean',
                      'kibble_kcal_mean','communication_max','days_before_closing_max'

# Models for 1order/1+orders classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

clf.fit(features_train,label_train)

pred_train = clf.predict(features_train)
pred_test = clf.predict(features_test)

from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(pred_train,label_train)
accuracy_test = accuracy_score(pred_test,label_test)

from sklearn import metrics
fpr, tpr, _ = metrics.roc_curve(np.array(label_train), clf.predict_proba(features_train)[:,1])
auc_train = metrics.auc(fpr,tpr)

fpr, tpr, _ = metrics.roc_curve(np.array(label_test), clf.predict_proba(features_test)[:,1])
auc_test = metrics.auc(fpr,tpr)

print("{:.2f}".format(accuracy_train),"{:.2f}".format(accuracy_test),"{:.2f}".format(auc_train),"{:.2f}".format(auc_test))

In [None]:
pd.crosstab(label_train,pd.Series(pred_train),rownames=['ACTUAL'],colnames=['PRED'])

In [None]:
pd.crosstab(label_test,pd.Series(pred_test),rownames=['ACTUAL'],colnames=['PRED'])

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

print('Accuracy Score')
print(accuracy_score(label_test, pred_test),'\n')

print('Precision Score')
print(precision_score(label_test, pred_test,average = None),'\n')

print('Confusion Matrix')
array = confusion_matrix(label_test, pred_test)
columns = ['Buyer1','Buyer1+']  #to adapt to your classification problem
print(pd.DataFrame(array,columns = columns, index = columns),'\n')

print('Classification Report')
print(classification_report(label_test, pred_test),'\n')

## Running the algorithm using seleted features that emerged through the feature selection methods above

In [None]:
Selected_Features2 = ['wet_food_discount_percent_mean','pet_order_number_max','wet_dry_cal_ratio_mean','total_minutes_on_website_since_last_order_mean',
                      'kibble_kcal_mean','communication_max','days_before_closing_max']

In [None]:
from sklearn import model_selection
from sklearn.model_selection import train_test_split

train, test = train_test_split(wet_food_df, test_size = 0.3, random_state=21)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

In [None]:
features_train = train[Selected_Features2]
label_train = train['Label']
features_test = test[Selected_Features2]
label_test = test['Label']

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

clf.fit(features_train,label_train)

pred_train = clf.predict(features_train)
pred_test = clf.predict(features_test)

from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(pred_train,label_train)
accuracy_test = accuracy_score(pred_test,label_test)


print("{:.2f}".format(accuracy_train),"{:.2f}".format(accuracy_test))