# **Pacual Capstone Group 4 Notebook - Route Optimization**

Group members: *Abdullah Alshaarawi, James Alarde, Hiromitsu Fujiyama, Sanjo Joy, Thomas Arturo Renwick Morales*

---

This notebook is organized in the following sections:

* [Part 0 - Importing the Necessary Libraries](#0)

* [Part 1 - Data Loading](#1)

* [Part 2 - Data Cleaning/ Wrangling](#2)
  * [Part 2.1 - Preliminary Analysis of the Dataset](#2.1)
  * [Part 2.2 - Converting Column Names to Pythonic Snake-Case](#2.2)
  * [Part 2.3 - Dealing with Duplicates](#2.3)
  * [Part 2.4 - Ensuring Correct Data Types](#2.4)
  * [Part 2.5 - Dealing with Null/Missing Values](#2.5)
  * [Part 2.6 - Final Checks](#2.6)

* [Part 3 - Aggregating the Client-Level Dataset](#3)

* [Part 4 - Exploratory Data Analysis](#3)

---

<a id='0'></a>
## Part 0 - Importing the Necessary Libraries

First, we imported the libraries which were necessary for our analysis.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn 
import numpy as np
import joblib

# Displaying only 2 decimal points for visual purposes
pd.set_option('display.float_format', '{:.2f}'.format) 

In [2]:
#To reset to default display option if needed later on:
## pd.reset_option('display.float_format')

<a id='1'></a>
# Part 1 - Data Loading

Then, we proceeded to load the dataset.

In [3]:
df = pd.read_csv('dataset/Orders_Master_Data(in).csv')

<a id='2'></a>
# Part 2 - Data Cleaning/ Wrangling

<a id='2.1'></a>
## Part 2.1 - Preliminary Analysis of the Dataset

Before beginning with data cleaning/wrangling we ran basic pandas functions for preliminary analysis/view of the dataset.

In [4]:
df.head()

Unnamed: 0,Date,City,Channel,Client ID,Promotor ID,Volume,Income,Number of orders,Median Ticket (€),Prom Contacts Month,Tel Contacts Month
0,01.01.2024,Alicante,AR,398150871,729030652,5.94,0.0,1,0.0,0,0
1,01.01.2024,Alicante,HR,410234355,551409294,48.0,21.02,1,21.02,4,0
2,02.01.2024,Alicante,AR,123463493,551409294,125.25,92.57,1,92.57,1,0
3,02.01.2024,Alicante,AR,124527399,729030652,83.0,60.94,1,60.94,4,0
4,02.01.2024,Alicante,AR,130100821,729030652,768.0,244.33,1,244.33,1,3


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035735 entries, 0 to 1035734
Data columns (total 11 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   Date                 1035735 non-null  object 
 1   City                 1035735 non-null  object 
 2   Channel              1035735 non-null  object 
 3   Client ID            1035735 non-null  int64  
 4   Promotor ID          1035735 non-null  int64  
 5   Volume               1035735 non-null  float64
 6   Income               1035735 non-null  float64
 7   Number of orders     1035735 non-null  int64  
 8   Median Ticket (€)    1035735 non-null  float64
 9   Prom Contacts Month  1035735 non-null  int64  
 10  Tel Contacts Month   1035735 non-null  int64  
dtypes: float64(3), int64(5), object(3)
memory usage: 86.9+ MB


<a id='2.2'></a>
## Part 2.2 - Converting Column Names to Pythonic Snake-Case

Next, we converted column names to Pythonic snake-case as this would simplify the process later in when doing machine learning.

In [6]:
df = df.rename(columns={'Date':'date', 
                        'City':'city', 
                        'Channel':'channel', 
                        'Client ID': 'client_id',
                        'Promotor ID': 'promotor_id',
                        'Volume': 'volume',
                        'Income': 'income',
                        'Number of orders': 'number_of_orders',
                        'Median Ticket (€)':'median_ticket',
                        'Prom Contacts Month': 'prom_contacts_month',
                        'Tel Contacts Month': 'tel_contacts_month'})

<a id='2.3'></a>
## Part 2.3 - Dealing with Duplicates

Then, we checked if there were duplicates, which was in fact the case.

In [7]:
df.duplicated().any()

True

We checked how many rows were duplicated out of the whole dataset and found there were quite a few.

In [8]:
# Total number of rows
total_rows = df.shape[0]

# Number of exact duplicates (all columns identical)
exact_duplicates = df.duplicated().sum()
print(f"Exact Duplicates: {exact_duplicates} out of {total_rows}")

Exact Duplicates: 20770 out of 1035735


We explored the duplicates, to check whether these were exact duplicates.

In [9]:
exact_duplicates = df[df.duplicated(keep=False)]
exact_duplicates.sort_values(by=['client_id', 'date']).head(10)

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month
919356,11.03.2024,Tarragona,HR,100854769,306190165,54.3,117.02,1,117.02,4,0
1018754,11.03.2024,Tarragona,HR,100854769,306190165,54.3,117.02,1,117.02,4,0
917803,12.02.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0
1017201,12.02.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0
925032,13.06.2024,Tarragona,HR,100854769,306190165,45.2,90.5,1,90.5,4,0
1024430,13.06.2024,Tarragona,HR,100854769,306190165,45.2,90.5,1,90.5,4,0
930843,16.09.2024,Tarragona,HR,100854769,306190165,129.0,74.14,1,74.14,4,0
1030241,16.09.2024,Tarragona,HR,100854769,306190165,129.0,74.14,1,74.14,4,0
917063,29.01.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0
1016461,29.01.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0


As we found the duplicates were exact duplicates, we proceeded to drop the duplicated row, keeping the first occurrence to not lose any valuable data points.

In [10]:
df = df.drop_duplicates(keep='first')

We checked once more, to see if we had dealt with the duplicates properly and to observe if there were any remaining ones.

In [11]:
# Total number of rows
total_rows = df.shape[0]

# Number of exact duplicates (all columns identical)
exact_duplicates = df.duplicated().sum()
print(f"Exact Duplicates: {exact_duplicates} out of {total_rows}")

Exact Duplicates: 0 out of 1014965


As there were'nt any remaining duplicates (i.e., we had dealt with them properly), we proceeded to ensure the columns were in their correct/appropriate data type next.

<a id='2.4'></a>
## Part 2.4 - Ensuring Correct Data Types

After having had a preliminary view of the dataset, we determined that the columns should of of the following data types:
* `date`: datetime
* `city`: object
* `channel`: object	
* `client_id`: object
* `promotor_id` : object
* `volume`: float	
* `income`: float	
* `number_of_orders`: integer	
* `median_ticket`: float	
* `prom_contacts_month`: integer	
* `tel_contacts_month`: integer

Therefore we proceeded to check if the columns were in fact in the data types we wanted them to be.

In [12]:
df.dtypes

date                    object
city                    object
channel                 object
client_id                int64
promotor_id              int64
volume                 float64
income                 float64
number_of_orders         int64
median_ticket          float64
prom_contacts_month      int64
tel_contacts_month       int64
dtype: object

Most columns were already of the appropriate data type except for: `date`, `client_id`, and `promotor_id`. Therefore we proceeded to modify these into their appropriate data types.

In [13]:
df['date'] = pd.to_datetime(df['date'], format='%d.%m.%Y')
df['client_id'] = df['client_id'].astype(str)
df['promotor_id'] = df['promotor_id'].astype(str)

We made a final check to make sure we had properly transformed these columns into their correct data type.

In [14]:
df.dtypes

date                   datetime64[ns]
city                           object
channel                        object
client_id                      object
promotor_id                    object
volume                        float64
income                        float64
number_of_orders                int64
median_ticket                 float64
prom_contacts_month             int64
tel_contacts_month              int64
dtype: object

Having all columns in the correct data type we proceeded to check for missing/null values in the dataset.

<a id='2.5'></a>
## Part 2.5 - Dealing with Null/Missing Values

In [15]:
df.isna().any()

date                   False
city                   False
channel                False
client_id              False
promotor_id            False
volume                 False
income                 False
number_of_orders       False
median_ticket          False
prom_contacts_month    False
tel_contacts_month     False
dtype: bool

In [16]:
df.isna().any().sum()

0

We found there were no missing/null values across the whole dataset.

Having completed this data wrangling step, we proceeded to make some final checks to the dataset before proceeding to aggregate data at the client-level.

<a id='2.6'></a>
## Part 2.6 - Final checks

We had another prelimianry view of the data.

In [17]:
df.head()

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month
0,2024-01-01,Alicante,AR,398150871,729030652,5.94,0.0,1,0.0,0,0
1,2024-01-01,Alicante,HR,410234355,551409294,48.0,21.02,1,21.02,4,0
2,2024-01-02,Alicante,AR,123463493,551409294,125.25,92.57,1,92.57,1,0
3,2024-01-02,Alicante,AR,124527399,729030652,83.0,60.94,1,60.94,4,0
4,2024-01-02,Alicante,AR,130100821,729030652,768.0,244.33,1,244.33,1,3


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1014965 entries, 0 to 1014964
Data columns (total 11 columns):
 #   Column               Non-Null Count    Dtype         
---  ------               --------------    -----         
 0   date                 1014965 non-null  datetime64[ns]
 1   city                 1014965 non-null  object        
 2   channel              1014965 non-null  object        
 3   client_id            1014965 non-null  object        
 4   promotor_id          1014965 non-null  object        
 5   volume               1014965 non-null  float64       
 6   income               1014965 non-null  float64       
 7   number_of_orders     1014965 non-null  int64         
 8   median_ticket        1014965 non-null  float64       
 9   prom_contacts_month  1014965 non-null  int64         
 10  tel_contacts_month   1014965 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(3), object(4)
memory usage: 92.9+ MB


We checked for duplicates again.

In [19]:
df.duplicated().any()

False

We checked for data types again.

In [20]:
df.dtypes

date                   datetime64[ns]
city                           object
channel                        object
client_id                      object
promotor_id                    object
volume                        float64
income                        float64
number_of_orders                int64
median_ticket                 float64
prom_contacts_month             int64
tel_contacts_month              int64
dtype: object

Finally, we checked for any missing/null values again.

In [21]:
df.isna().any()

date                   False
city                   False
channel                False
client_id              False
promotor_id            False
volume                 False
income                 False
number_of_orders       False
median_ticket          False
prom_contacts_month    False
tel_contacts_month     False
dtype: bool

In [22]:
df.isna().any().sum()

0

Having confirmed, that the dataset was clean, we proceeded to continue with the next step which was to aggregate the data at the client-level.

Just in case, we wrote to csv a copy of the cleaned dataset.

In [23]:
#df.to_csv('dataset/clean_orders_data/clean_orders_data.csv', index=False)

---

# **Questions for Pascual**

## *Data-related questions*

In [24]:
#Create month column for the analysis
df['month'] = df['date'].dt.to_period('M')

**NOTE**: We split our analysis by positive and negative entries of income.

### 1. What is the relationship between median ticket and number of orders with net income?

We found it follows this general rule: 
* income = median_ticket * number_of_orders

However, there are some exceptions which do not coincide with this. We found for these exceptions that:
* income != median_ticket * number_of_orders
* Instead --> income = median_ticket

This exception is not taking the number of orders into account.

For this case, we only took into account entries with income equal to or greater than 0 (non-negative income).

In [25]:
positive_income = df[df['income']>=0]

#Created a column to check the consistency of this income and median_ticket relationship
positive_income['check'] = positive_income['income'] - (positive_income['median_ticket']*positive_income['number_of_orders'])

#The idea is that this consistency check should be equal to 0
#If this is not equal to 0, then the relationship does not hold

#This new dataframe is to check those inconsistencies
#We do a threshold of > 1, as those "inconsistencies" which are due to decimal point errors are not an issue (minimal error)
inconsistent_rel = positive_income[positive_income['check']>1]

#We found there are around 2000 entries which do not match this criteria
inconsistent_rel

#This is not a very large proportion of the whole dataset, should we drop them?
#How would you recommend we proceed?

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  positive_income['check'] = positive_income['income'] - (positive_income['median_ticket']*positive_income['number_of_orders'])


Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month,check
924,2024-01-15,Alicante,AR,194410127,729030652,247.88,143.48,1,71.74,0,8,2024-01,71.74
1069,2024-01-16,Alicante,HR,739047412,729030652,0.00,25.69,0,25.69,0,0,2024-01,25.69
1230,2024-01-18,Alicante,AR,426657251,39304770,166.94,45.56,1,22.78,4,0,2024-01,22.78
1603,2024-01-24,Alicante,AR,531963963,218497097,8.58,870.04,1,435.02,0,0,2024-01,435.02
2483,2024-02-06,Alicante,AR,413503307,551409294,890.94,412.08,1,206.04,2,0,2024-02,206.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1010580,2024-11-29,Valencia,HR,449827392,307450899,0.00,120.95,0,60.48,0,0,2024-11,120.95
1010703,2024-12-02,Valencia,AR,380180714,444765134,1551.34,583.99,1,292.00,0,4,2024-12,292.00
1011400,2024-12-05,Valencia,AR,588478841,52875287,245.90,94.58,1,47.29,4,0,2024-12,47.29
1011634,2024-12-09,Valencia,AR,644476280,998162842,1080.00,2743.49,1,1371.74,0,4,2024-12,1371.74


In [26]:
#We explored the data a bit more 
#-->by month for a client within this inconsistent_rel dataframe
## (first client_id which appears in the inconsistent_rel dataframe)
client_194410127= df[df['client_id']== '194410127']
client_194410127

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
8,2024-01-02,Alicante,AR,194410127,729030652,350.69,207.48,1,207.48,0,8,2024-01
429,2024-01-08,Alicante,AR,194410127,729030652,252.32,197.30,2,197.30,0,8,2024-01
924,2024-01-15,Alicante,AR,194410127,729030652,247.88,143.48,1,71.74,0,8,2024-01
1093,2024-01-17,Alicante,AR,194410127,729030652,148.99,162.11,1,162.11,0,8,2024-01
1369,2024-01-22,Alicante,AR,194410127,729030652,271.20,186.23,1,186.23,0,8,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
24826,2024-12-04,Alicante,AR,194410127,729030652,400.20,232.76,1,232.76,0,8,2024-12
25214,2024-12-11,Alicante,AR,194410127,729030652,384.60,255.60,1,255.60,0,8,2024-12
25518,2024-12-16,Alicante,AR,194410127,729030652,157.20,94.43,1,94.43,0,8,2024-12
25676,2024-12-18,Alicante,AR,194410127,729030652,190.80,123.63,1,123.63,0,8,2024-12


In [27]:
#January
client_194410127[client_194410127['month']=='2024-01']
##2nd entry: the relationship is inconsistent

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
8,2024-01-02,Alicante,AR,194410127,729030652,350.69,207.48,1,207.48,0,8,2024-01
429,2024-01-08,Alicante,AR,194410127,729030652,252.32,197.3,2,197.3,0,8,2024-01
924,2024-01-15,Alicante,AR,194410127,729030652,247.88,143.48,1,71.74,0,8,2024-01
1093,2024-01-17,Alicante,AR,194410127,729030652,148.99,162.11,1,162.11,0,8,2024-01
1369,2024-01-22,Alicante,AR,194410127,729030652,271.2,186.23,1,186.23,0,8,2024-01
1915,2024-01-29,Alicante,AR,194410127,729030652,931.74,362.69,2,181.34,0,8,2024-01


In [28]:
#March
client_194410127[client_194410127['month']=='2024-03']
##3rd entry: the relationship is inconsistent

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
4433,2024-03-04,Alicante,AR,194410127,729030652,325.68,198.4,1,198.4,0,8,2024-03
4936,2024-03-11,Alicante,AR,194410127,729030652,492.55,330.09,1,330.09,0,8,2024-03
5489,2024-03-18,Alicante,AR,194410127,729030652,288.74,194.35,2,194.35,0,8,2024-03
5858,2024-03-22,Alicante,AR,194410127,729030652,186.0,170.23,1,170.23,0,8,2024-03
6173,2024-03-27,Alicante,AR,194410127,729030652,343.5,215.04,1,215.04,0,8,2024-03


In [29]:
#May 
client_194410127[client_194410127['month']=='2024-05']
##3rd entry: the relationship is inconsistent

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
8946,2024-05-06,Alicante,AR,194410127,729030652,322.99,246.52,1,246.52,0,8,2024-05
9508,2024-05-13,Alicante,AR,194410127,729030652,333.25,211.0,1,211.0,0,8,2024-05
10083,2024-05-20,Alicante,AR,194410127,729030652,395.17,227.13,2,227.13,0,8,2024-05
10624,2024-05-27,Alicante,AR,194410127,729030652,331.74,186.21,1,186.21,0,8,2024-05


## 2. What does negative income for an entry mean? Are these reimbursements?

We would assume that these are reimbursements, however there are some things we don't understand:
* For different rows of negative income entries, the volumes are either positive, negative or zero? What does this mean? 

In [30]:
negative_income = df[df['income']<0]

In [31]:
#Negative volume
negative_vol = negative_income[negative_income['volume']< 0]
negative_vol

#There are 1689 rows with negative income entries and negative volume

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
144,2024-01-03,Alicante,HR,310637681,91937945,-103.99,-60.90,1,-30.45,0,0,2024-01
289,2024-01-04,Alicante,HR,454699461,551409294,-1.20,-3.07,0,-3.07,0,0,2024-01
376,2024-01-05,Alicante,HR,129590664,91937945,-3.50,-21.34,0,-21.34,0,0,2024-01
441,2024-01-08,Alicante,AR,394499568,39304770,-10.20,-50.75,0,-50.75,0,0,2024-01
991,2024-01-15,Alicante,HR,986671407,551409294,-2.34,-23.92,0,-23.92,0,0,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
1009697,2024-11-25,Valencia,AR,859513033,249555220,-138.20,-105.35,1,-52.67,0,0,2024-11
1012470,2024-12-13,Valencia,AR,835982499,460456701,-90.00,-113.52,0,-113.52,0,0,2024-12
1012956,2024-12-17,Valencia,HR,527370739,327176535,-2.00,-46.90,0,-46.90,0,0,2024-12
1013410,2024-12-19,Valencia,HR,468061603,444765134,-0.13,-24.60,0,-24.60,0,0,2024-12


In [32]:
#Positive volume
positive_vol = negative_income[negative_income['volume']> 0]
positive_vol

#There are 238 rows with negative income entries and positive volume

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
4053,2024-02-27,Alicante,HR,557261162,39304770,12.00,-745.01,1,-248.34,0,0,2024-02
5928,2024-03-22,Alicante,HR,392868386,729030652,1.00,-35.37,1,-35.37,0,0,2024-03
10607,2024-05-24,Alicante,HR,867377147,729030652,120.00,-120.22,1,-60.11,0,0,2024-05
11572,2024-06-06,Alicante,HR,279230392,39304770,34.95,-13.72,1,-13.72,0,0,2024-06
13430,2024-07-02,Alicante,AR,688556611,551409294,11.50,-6.12,1,-6.12,0,0,2024-07
...,...,...,...,...,...,...,...,...,...,...,...,...
987813,2024-07-04,Valencia,HR,761984352,139088935,143.16,-42.43,1,-42.43,0,0,2024-07
988980,2024-07-11,Valencia,HR,297605797,52875287,32.64,-123.23,1,-123.23,0,0,2024-07
1001565,2024-10-03,Valencia,HR,371882962,249555220,108.00,-73.52,1,-73.52,0,0,2024-10
1008398,2024-11-14,Valencia,HR,796836014,376164172,1.00,-0.52,1,-0.52,0,0,2024-11


In [33]:
#Zero volume
zero_vol = negative_income[negative_income['volume'] == 0]
zero_vol

#There are 1116 rows with negative income entries and zero volume

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
33,2024-01-02,Alicante,AR,697976135,39304770,0.00,-17.98,0,-17.98,0,0,2024-01
1295,2024-01-18,Alicante,HR,972240381,551409294,0.00,-3.60,0,-3.60,0,0,2024-01
1323,2024-01-19,Alicante,AR,702594377,218497097,0.00,-6.04,0,-6.04,0,0,2024-01
1516,2024-01-23,Alicante,HR,101926782,91937945,0.00,-49.68,0,-49.68,0,0,2024-01
2021,2024-01-30,Alicante,AR,912139581,91937945,0.00,-0.31,1,-0.15,0,0,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
1009521,2024-11-22,Valencia,HR,211786455,52875287,0.00,-21.14,0,-21.14,0,0,2024-11
1010185,2024-11-27,Valencia,HR,371882962,249555220,0.00,-90.32,0,-45.16,0,0,2024-11
1011607,2024-12-09,Valencia,AR,383525450,998162842,0.00,-776.16,0,-776.16,0,0,2024-12
1011624,2024-12-09,Valencia,AR,576161489,998162842,0.00,-86.18,0,-86.18,0,0,2024-12


### 3. What does it mean to have clients with a negative total income (i.e., the sum of income across all entries amounts to a negative number)?

In [37]:
#Summing total income across all entries by client
total_income = df.groupby('client_id')['income'].sum().sort_values()

#Creating a series which shows just negative total income (from most negative to least)
negative_income_clients = total_income[total_income < 0]

negative_income_clients.sort_values()

client_id
216722324   -17252.42
680649272    -2433.60
850271991    -2212.17
249067654    -1299.63
874370762     -758.87
686054949     -714.27
833968223     -604.13
434326174     -540.00
954110509     -441.56
193935197     -388.95
671328462     -329.81
572342924     -288.74
338415545     -279.69
326529393     -263.93
375350895     -249.09
119006128     -222.92
300010850     -210.92
473454133     -171.29
534193468     -155.28
533833249     -132.39
358565223     -110.22
327274699      -98.28
708945253      -87.16
127804495      -76.62
364776507      -68.21
327696319      -61.26
155877396      -59.21
353065642      -56.40
868006169      -56.00
516009589      -55.90
113456478      -55.90
545660944      -55.69
620948028      -55.60
462322659      -50.84
626656988      -49.50
909210497      -43.96
615645715      -32.60
908365968      -24.76
937477703      -21.60
129761549      -21.28
619532117      -20.55
523894728      -17.89
227049778      -16.64
223778293      -13.07
603561469       -8.15


In [40]:
print(f'There are {len(negative_income_clients)} clients with a negative total income')

There are 55 clients with a negative total income


What does this mean? What should we do with these clients? Should we drop them from our analysis?

### 4. What does it mean to have more than one order on a specific date?

In this case we have taken the client with the highest amount of orders for demonstration purposes. We did a short month by month analysis.

In [44]:
highest_no_of_orders_client = df.groupby('client_id')['number_of_orders'].sum().sort_values(ascending=False).index[0]
most_orders_df = df[df['client_id']== highest_no_of_orders_client]
most_orders_df

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
955431,2024-01-03,Valencia,HR,577029300,376164172,934.32,1188.17,8,148.52,0,4,2024-01
955927,2024-01-05,Valencia,HR,577029300,376164172,581.35,451.73,2,225.87,0,4,2024-01
956153,2024-01-08,Valencia,HR,577029300,376164172,0.00,0.00,0,0.00,0,0,2024-01
956402,2024-01-09,Valencia,HR,577029300,376164172,992.93,1494.96,12,124.58,0,4,2024-01
957212,2024-01-12,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
1012139,2024-12-11,Valencia,HR,577029300,376164172,1326.25,1301.94,11,118.36,0,4,2024-12
1012962,2024-12-17,Valencia,HR,577029300,376164172,626.10,569.97,12,47.50,0,4,2024-12
1014109,2024-12-25,Valencia,HR,577029300,376164172,861.81,1231.28,12,102.61,0,4,2024-12
1014490,2024-12-27,Valencia,HR,577029300,376164172,462.00,350.00,1,350.00,0,4,2024-12


In [45]:
#January
most_orders_df[most_orders_df['month'] == '2024-01'].head()

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
955431,2024-01-03,Valencia,HR,577029300,376164172,934.32,1188.17,8,148.52,0,4,2024-01
955927,2024-01-05,Valencia,HR,577029300,376164172,581.35,451.73,2,225.87,0,4,2024-01
956153,2024-01-08,Valencia,HR,577029300,376164172,0.0,0.0,0,0.0,0,0,2024-01
956402,2024-01-09,Valencia,HR,577029300,376164172,992.93,1494.96,12,124.58,0,4,2024-01
957212,2024-01-12,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01


In [None]:
#February
most_orders_df[most_orders_df['month'] == '2024-02'].head()

##Entry 5: Negative income, median ticket doesn't follow the standard relationship
##positive volume --> All related to previous questions. 

##What do we do with these types of entries?

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
960835,2024-02-02,Valencia,HR,577029300,376164172,172.4,121.44,1,60.72,0,4,2024-02
961034,2024-02-05,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-02
961556,2024-02-07,Valencia,HR,577029300,376164172,972.45,1325.27,11,120.48,0,4,2024-02
962150,2024-02-09,Valencia,HR,577029300,376164172,138.6,105.0,1,105.0,0,4,2024-02
962353,2024-02-12,Valencia,HR,577029300,376164172,138.6,-12.62,1,-4.21,0,0,2024-02


In [48]:
#March
most_orders_df[most_orders_df['month'] == '2024-03'].head()

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
965913,2024-03-01,Valencia,HR,577029300,376164172,231.0,175.0,1,175.0,0,4,2024-03
966134,2024-03-04,Valencia,HR,577029300,376164172,231.0,175.0,1,175.0,0,4,2024-03
966661,2024-03-06,Valencia,HR,577029300,376164172,831.22,858.69,8,107.34,0,4,2024-03
966963,2024-03-07,Valencia,HR,577029300,376164172,214.8,155.6,2,77.8,0,4,2024-03
967284,2024-03-08,Valencia,HR,577029300,376164172,138.6,105.0,1,105.0,0,4,2024-03


### 5. Is it normal that the client which generates the highest median ticket/income has 0 promotor visits?

In [52]:
print(f'Client which generates the highest median ticket --> client_id = {df.groupby('client_id')['median_ticket'].sum().sort_values(ascending=False).index[0]}')
print(f'Client which generates the highest total income --> client_id = {df.groupby('client_id')['income'].sum().sort_values(ascending=False).index[0]}') 

Client which generates the highest median ticket --> client_id = 386121207
Client which generates the highest total income --> client_id = 386121207


The client which generates the highest income and median ticket is the same one. This makes sense as both these features are clearly related.

In [None]:
highest_income_client = df.groupby('client_id')['income'].sum().sort_values(ascending=False).index[0]

## *Conceptual questions*

### 1. In which unit is Volume recorded (kg, litres, cases, other)? 

We assume that in kg.

---

# Playing around --> (based on tutor suggestions) 

Tutor suggestions:
* Focus on some interesting clients, e.g. high median ticket, high efficiency, high no. of contacts, etc.
* Envision the target variable --> The variable we are trying to predict/optimize.

Important discoveries:
* Median ticket seems to be the same as income at the daily level. What the instructions appear to be telling us is how to aggregate median ticket.

---

## exploration 

In [22]:
df.head()

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month
0,2024-01-01,Alicante,AR,398150871,729030652,5.94,0.0,1,0.0,0,0
1,2024-01-01,Alicante,HR,410234355,551409294,48.0,21.02,1,21.02,4,0
2,2024-01-02,Alicante,AR,123463493,551409294,125.25,92.57,1,92.57,1,0
3,2024-01-02,Alicante,AR,124527399,729030652,83.0,60.94,1,60.94,4,0
4,2024-01-02,Alicante,AR,130100821,729030652,768.0,244.33,1,244.33,1,3


Client with the largest median ticket & income (summing): client_id = '386121207'

In [23]:
df.groupby('client_id')['median_ticket'].sum().sort_values(ascending=False)

client_id
386121207    532399.730000
215933226    424433.325000
223570774    311281.457333
697872448    262831.362500
169675793    260507.315000
                 ...      
874370762      -758.870000
680649272     -1216.800000
249067654     -1299.630000
850271991     -2212.170000
216722324    -10989.060000
Name: median_ticket, Length: 42149, dtype: float64

In [44]:
df.groupby('client_id')['income'].sum().sort_values(ascending=False)

client_id
386121207    599249.42
215933226    527932.32
223570774    494810.12
697872448    351473.41
908335862    327914.20
               ...    
874370762      -758.87
249067654     -1299.63
850271991     -2212.17
680649272     -2433.60
216722324    -17252.42
Name: income, Length: 42149, dtype: float64

Client with the highest amount of orders: client_if = '577029300'

In [37]:
df.groupby('client_id')['number_of_orders'].sum().sort_values(ascending=False)

client_id
577029300    761
365042657    263
966347937    251
744372710    233
240393159    220
            ... 
210556623      0
595570908      0
158388030      0
380244978      0
776482901      0
Name: number_of_orders, Length: 42149, dtype: int64

In [26]:
#Creating month column --> can be useful to see how contacts work, and to find which is the established period.
df['month'] = df['date'].dt.to_period('M')

In [27]:
df['month'].value_counts()

month
2024-05    93571
2024-10    93197
2024-04    90794
2024-07    89315
2024-01    87700
2024-02    85706
2024-06    83213
2024-09    82282
2024-03    82164
2024-11    78518
2024-12    76008
2024-08    72497
Freq: M, Name: count, dtype: int64

In [28]:
df.head()

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
0,2024-01-01,Alicante,AR,398150871,729030652,5.94,0.0,1,0.0,0,0,2024-01
1,2024-01-01,Alicante,HR,410234355,551409294,48.0,21.02,1,21.02,4,0,2024-01
2,2024-01-02,Alicante,AR,123463493,551409294,125.25,92.57,1,92.57,1,0,2024-01
3,2024-01-02,Alicante,AR,124527399,729030652,83.0,60.94,1,60.94,4,0,2024-01
4,2024-01-02,Alicante,AR,130100821,729030652,768.0,244.33,1,244.33,1,3,2024-01


### 1. Exploring the client with the largest sum of median tickets

'386121207'

Suprisingly, the largest client has 0 contacts/visits per month.

In [29]:
largest_sum_mt = df[df['client_id']== '386121207']
largest_sum_mt

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
871432,2024-01-03,Sevilla,AR,386121207,662836107,18570.68,27796.43,1,27796.43,0,2,2024-01
873833,2024-01-22,Sevilla,AR,386121207,662836107,6131.34,9251.91,1,9251.91,0,2,2024-01
875031,2024-01-31,Sevilla,AR,386121207,662836107,9064.8,14361.84,1,14361.84,0,2,2024-01
877151,2024-02-15,Sevilla,AR,386121207,662836107,18484.26,24774.23,1,24774.23,0,2,2024-02
878920,2024-02-29,Sevilla,AR,386121207,662836107,9333.54,16979.35,1,16979.35,0,2,2024-02
880635,2024-03-13,Sevilla,AR,386121207,662836107,17949.5,23338.66,1,23338.66,0,2,2024-03
882662,2024-03-27,Sevilla,AR,386121207,662836107,17035.46,27066.38,1,27066.38,0,2,2024-03
883481,2024-04-05,Sevilla,AR,386121207,662836107,10438.74,13957.42,1,13957.42,0,2,2024-04
885741,2024-04-24,Sevilla,AR,386121207,662836107,19300.76,26658.77,1,26658.77,0,2,2024-04
887528,2024-05-08,Sevilla,AR,386121207,662836107,19971.44,21774.33,2,10887.165,0,2,2024-05


In [30]:
df[df['client_id']== '386121207'][['client_id','date', 'volume', 'income', 'number_of_orders', 'median_ticket', 'prom_contacts_month', 'month']]

Unnamed: 0,client_id,date,volume,income,number_of_orders,median_ticket,prom_contacts_month,month
871432,386121207,2024-01-03,18570.68,27796.43,1,27796.43,0,2024-01
873833,386121207,2024-01-22,6131.34,9251.91,1,9251.91,0,2024-01
875031,386121207,2024-01-31,9064.8,14361.84,1,14361.84,0,2024-01
877151,386121207,2024-02-15,18484.26,24774.23,1,24774.23,0,2024-02
878920,386121207,2024-02-29,9333.54,16979.35,1,16979.35,0,2024-02
880635,386121207,2024-03-13,17949.5,23338.66,1,23338.66,0,2024-03
882662,386121207,2024-03-27,17035.46,27066.38,1,27066.38,0,2024-03
883481,386121207,2024-04-05,10438.74,13957.42,1,13957.42,0,2024-04
885741,386121207,2024-04-24,19300.76,26658.77,1,26658.77,0,2024-04
887528,386121207,2024-05-08,19971.44,21774.33,2,10887.165,0,2024-05


In [31]:
#January
# Efficient --> Number of orders > number of contacts per month = 0
largest_sum_mt[largest_sum_mt['month']=='2024-01']

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
871432,2024-01-03,Sevilla,AR,386121207,662836107,18570.68,27796.43,1,27796.43,0,2,2024-01
873833,2024-01-22,Sevilla,AR,386121207,662836107,6131.34,9251.91,1,9251.91,0,2,2024-01
875031,2024-01-31,Sevilla,AR,386121207,662836107,9064.8,14361.84,1,14361.84,0,2,2024-01


In [32]:
#February
#Efficient --> No. of orders > number of contacts in month
largest_sum_mt[largest_sum_mt['month']=='2024-02']

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
877151,2024-02-15,Sevilla,AR,386121207,662836107,18484.26,24774.23,1,24774.23,0,2,2024-02
878920,2024-02-29,Sevilla,AR,386121207,662836107,9333.54,16979.35,1,16979.35,0,2,2024-02


In [33]:
#March
largest_sum_mt[largest_sum_mt['month']=='2024-03']

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
880635,2024-03-13,Sevilla,AR,386121207,662836107,17949.5,23338.66,1,23338.66,0,2,2024-03
882662,2024-03-27,Sevilla,AR,386121207,662836107,17035.46,27066.38,1,27066.38,0,2,2024-03


In [34]:
#April
largest_sum_mt[largest_sum_mt['month']=='2024-04']

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
883481,2024-04-05,Sevilla,AR,386121207,662836107,10438.74,13957.42,1,13957.42,0,2,2024-04
885741,2024-04-24,Sevilla,AR,386121207,662836107,19300.76,26658.77,1,26658.77,0,2,2024-04


In [35]:
#May
largest_sum_mt[largest_sum_mt['month']=='2024-05']

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
887528,2024-05-08,Sevilla,AR,386121207,662836107,19971.44,21774.33,2,10887.165,0,2,2024-05
888894,2024-05-17,Sevilla,AR,386121207,662836107,14975.82,23138.46,1,23138.46,0,2,2024-05


In [36]:
#June
largest_sum_mt[largest_sum_mt['month']=='2024-06']

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
891134,2024-06-05,Sevilla,AR,386121207,662836107,12892.2,18317.46,1,18317.46,0,2,2024-06
891326,2024-06-06,Sevilla,AR,386121207,662836107,19364.4,25839.39,1,25839.39,0,2,2024-06
891843,2024-06-11,Sevilla,AR,386121207,662836107,6633.9,10096.54,1,10096.54,0,2,2024-06
893125,2024-06-20,Sevilla,AR,386121207,662836107,9780.3,11400.29,2,5700.145,0,2,2024-06


### 2. Checking client with highest number of orders

In [39]:
most_orders_df = df[df['client_id']=='577029300']
most_orders_df

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
955431,2024-01-03,Valencia,HR,577029300,376164172,934.317,1188.17,8,148.521250,0,4,2024-01
955927,2024-01-05,Valencia,HR,577029300,376164172,581.350,451.73,2,225.865000,0,4,2024-01
956153,2024-01-08,Valencia,HR,577029300,376164172,0.000,0.00,0,0.000000,0,0,2024-01
956402,2024-01-09,Valencia,HR,577029300,376164172,992.927,1494.96,12,124.580000,0,4,2024-01
957212,2024-01-12,Valencia,HR,577029300,376164172,173.250,131.25,1,131.250000,0,4,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
1012139,2024-12-11,Valencia,HR,577029300,376164172,1326.250,1301.94,11,118.358182,0,4,2024-12
1012962,2024-12-17,Valencia,HR,577029300,376164172,626.100,569.97,12,47.497500,0,4,2024-12
1014109,2024-12-25,Valencia,HR,577029300,376164172,861.810,1231.28,12,102.606667,0,4,2024-12
1014490,2024-12-27,Valencia,HR,577029300,376164172,462.000,350.00,1,350.000000,0,4,2024-12


In [42]:
#January
most_orders_df[most_orders_df['month'] == '2024-01']

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
955431,2024-01-03,Valencia,HR,577029300,376164172,934.317,1188.17,8,148.52125,0,4,2024-01
955927,2024-01-05,Valencia,HR,577029300,376164172,581.35,451.73,2,225.865,0,4,2024-01
956153,2024-01-08,Valencia,HR,577029300,376164172,0.0,0.0,0,0.0,0,0,2024-01
956402,2024-01-09,Valencia,HR,577029300,376164172,992.927,1494.96,12,124.58,0,4,2024-01
957212,2024-01-12,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01
957434,2024-01-15,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01
957688,2024-01-16,Valencia,HR,577029300,376164172,1104.707,892.85,10,89.285,0,4,2024-01
957942,2024-01-17,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01
958401,2024-01-19,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01
958656,2024-01-22,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01


In [50]:
#Investigating relationship between income and median ticket
most_orders_df[most_orders_df['month'] == '2024-01'].head(12)

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
955431,2024-01-03,Valencia,HR,577029300,376164172,934.317,1188.17,8,148.52125,0,4,2024-01
955927,2024-01-05,Valencia,HR,577029300,376164172,581.35,451.73,2,225.865,0,4,2024-01
956153,2024-01-08,Valencia,HR,577029300,376164172,0.0,0.0,0,0.0,0,0,2024-01
956402,2024-01-09,Valencia,HR,577029300,376164172,992.927,1494.96,12,124.58,0,4,2024-01
957212,2024-01-12,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01
957434,2024-01-15,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01
957688,2024-01-16,Valencia,HR,577029300,376164172,1104.707,892.85,10,89.285,0,4,2024-01
957942,2024-01-17,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01
958401,2024-01-19,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01
958656,2024-01-22,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01


In [51]:
print(1188.17/8, 1494.96/12,1312.05/15 )

148.52125 124.58 87.47


In [52]:
#February
most_orders_df[most_orders_df['month'] == '2024-02']

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
960835,2024-02-02,Valencia,HR,577029300,376164172,172.4,121.44,1,60.72,0,4,2024-02
961034,2024-02-05,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-02
961556,2024-02-07,Valencia,HR,577029300,376164172,972.447,1325.27,11,120.479091,0,4,2024-02
962150,2024-02-09,Valencia,HR,577029300,376164172,138.6,105.0,1,105.0,0,4,2024-02
962353,2024-02-12,Valencia,HR,577029300,376164172,138.6,-12.62,1,-4.206667,0,0,2024-02
962836,2024-02-14,Valencia,HR,577029300,376164172,885.62,718.67,12,59.889167,0,4,2024-02
963092,2024-02-15,Valencia,HR,577029300,376164172,60.0,325.2,1,325.2,0,4,2024-02
963353,2024-02-16,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-02
963537,2024-02-19,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-02
963775,2024-02-20,Valencia,HR,577029300,376164172,1682.604,1783.27,11,162.115455,0,4,2024-02


In [54]:
print(1325.27/11, 1783.27/11)

120.47909090909091 162.11545454545455


### 3. Exploring relationship between median ticket vs income (for positive income values only)

In [56]:
positive_income = df[df['income']>=0]
positive_income

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
0,2024-01-01,Alicante,AR,398150871,729030652,5.940,0.00,1,0.00,0,0,2024-01
1,2024-01-01,Alicante,HR,410234355,551409294,48.000,21.02,1,21.02,4,0,2024-01
2,2024-01-02,Alicante,AR,123463493,551409294,125.250,92.57,1,92.57,1,0,2024-01
3,2024-01-02,Alicante,AR,124527399,729030652,83.000,60.94,1,60.94,4,0,2024-01
4,2024-01-02,Alicante,AR,130100821,729030652,768.000,244.33,1,244.33,1,3,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
1014960,2024-12-31,Valencia,HR,974505828,249555220,120.000,119.20,1,119.20,4,0,2024-12
1014961,2024-12-31,Valencia,HR,976757748,327176535,79.963,255.49,1,255.49,4,0,2024-12
1014962,2024-12-31,Valencia,HR,977650762,937854151,85.890,280.38,1,280.38,0,1,2024-12
1014963,2024-12-31,Valencia,HR,982745366,52875287,178.500,280.24,1,280.24,0,4,2024-12


In [57]:
positive_income['check'] = positive_income['income'] - (positive_income['median_ticket']*positive_income['number_of_orders'])
positive_income

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  positive_income['check'] = positive_income['income'] - (positive_income['median_ticket']*positive_income['number_of_orders'])


Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month,check
0,2024-01-01,Alicante,AR,398150871,729030652,5.940,0.00,1,0.00,0,0,2024-01,0.0
1,2024-01-01,Alicante,HR,410234355,551409294,48.000,21.02,1,21.02,4,0,2024-01,0.0
2,2024-01-02,Alicante,AR,123463493,551409294,125.250,92.57,1,92.57,1,0,2024-01,0.0
3,2024-01-02,Alicante,AR,124527399,729030652,83.000,60.94,1,60.94,4,0,2024-01,0.0
4,2024-01-02,Alicante,AR,130100821,729030652,768.000,244.33,1,244.33,1,3,2024-01,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1014960,2024-12-31,Valencia,HR,974505828,249555220,120.000,119.20,1,119.20,4,0,2024-12,0.0
1014961,2024-12-31,Valencia,HR,976757748,327176535,79.963,255.49,1,255.49,4,0,2024-12,0.0
1014962,2024-12-31,Valencia,HR,977650762,937854151,85.890,280.38,1,280.38,0,1,2024-12,0.0
1014963,2024-12-31,Valencia,HR,982745366,52875287,178.500,280.24,1,280.24,0,4,2024-12,0.0


In [60]:
dif_from_0= positive_income[positive_income['check']!= 0.0]
dif_from_0

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month,check
23,2024-01-02,Alicante,AR,529158878,91937945,301.212,146.36,3,48.786667,4,0,2024-01,-8.526513e-14
31,2024-01-02,Alicante,AR,688971420,729030652,1386.480,1655.90,4,551.966667,8,0,2024-01,-5.519667e+02
400,2024-01-05,Alicante,HR,542471045,729030652,274.000,174.02,2,174.020000,1,0,2024-01,-1.740200e+02
429,2024-01-08,Alicante,AR,194410127,729030652,252.320,197.30,2,197.300000,0,8,2024-01,-1.973000e+02
657,2024-01-10,Alicante,HR,324862039,729030652,50.858,129.80,3,43.266667,0,0,2024-01,-8.526513e-14
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1014638,2024-12-30,Valencia,HR,126703933,327176535,17.444,36.90,3,12.300000,4,0,2024-12,-7.105427e-15
1014667,2024-12-30,Valencia,HR,348190599,998162842,27.128,67.17,2,67.170000,4,0,2024-12,-6.717000e+01
1014689,2024-12-30,Valencia,HR,575399626,998162842,45.418,182.71,2,182.710000,4,0,2024-12,-1.827100e+02
1014893,2024-12-31,Valencia,HR,449827392,307450899,0.000,1302.44,0,1302.440000,0,0,2024-12,1.302440e+03


In [64]:
dif_from_0['check'].value_counts().sort_values(ascending=False)

check
-1.136868e-13    193
 1.136868e-13    177
 1.023182e-12    147
 8.526513e-14    141
-1.023182e-12    139
                ... 
 6.943750e+01      1
-5.292000e+01      1
 5.741500e+01      1
 7.553000e+01      1
 1.302440e+03      1
Name: count, Length: 10155, dtype: int64

In [65]:
dif_from_0[dif_from_0['check']> 1]

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month,check
924,2024-01-15,Alicante,AR,194410127,729030652,247.880,143.48,1,71.740,0,8,2024-01,71.740
1069,2024-01-16,Alicante,HR,739047412,729030652,0.000,25.69,0,25.690,0,0,2024-01,25.690
1230,2024-01-18,Alicante,AR,426657251,39304770,166.940,45.56,1,22.780,4,0,2024-01,22.780
1603,2024-01-24,Alicante,AR,531963963,218497097,8.580,870.04,1,435.020,0,0,2024-01,435.020
2483,2024-02-06,Alicante,AR,413503307,551409294,890.940,412.08,1,206.040,2,0,2024-02,206.040
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1010580,2024-11-29,Valencia,HR,449827392,307450899,0.000,120.95,0,60.475,0,0,2024-11,120.950
1010703,2024-12-02,Valencia,AR,380180714,444765134,1551.336,583.99,1,291.995,0,4,2024-12,291.995
1011400,2024-12-05,Valencia,AR,588478841,52875287,245.900,94.58,1,47.290,4,0,2024-12,47.290
1011634,2024-12-09,Valencia,AR,644476280,998162842,1080.000,2743.49,1,1371.745,0,4,2024-12,1371.745


In [67]:
#checking within this client id 
client_194410127= df[df['client_id']== '194410127']
client_194410127

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
8,2024-01-02,Alicante,AR,194410127,729030652,350.69,207.48,1,207.48,0,8,2024-01
429,2024-01-08,Alicante,AR,194410127,729030652,252.32,197.30,2,197.30,0,8,2024-01
924,2024-01-15,Alicante,AR,194410127,729030652,247.88,143.48,1,71.74,0,8,2024-01
1093,2024-01-17,Alicante,AR,194410127,729030652,148.99,162.11,1,162.11,0,8,2024-01
1369,2024-01-22,Alicante,AR,194410127,729030652,271.20,186.23,1,186.23,0,8,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
24826,2024-12-04,Alicante,AR,194410127,729030652,400.20,232.76,1,232.76,0,8,2024-12
25214,2024-12-11,Alicante,AR,194410127,729030652,384.60,255.60,1,255.60,0,8,2024-12
25518,2024-12-16,Alicante,AR,194410127,729030652,157.20,94.43,1,94.43,0,8,2024-12
25676,2024-12-18,Alicante,AR,194410127,729030652,190.80,123.63,1,123.63,0,8,2024-12


In [68]:
#January
client_194410127[client_194410127['month']=='2024-01']

##2nd entry, the formula doesnt match and it doesnt seem to make sense

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
8,2024-01-02,Alicante,AR,194410127,729030652,350.69,207.48,1,207.48,0,8,2024-01
429,2024-01-08,Alicante,AR,194410127,729030652,252.32,197.3,2,197.3,0,8,2024-01
924,2024-01-15,Alicante,AR,194410127,729030652,247.88,143.48,1,71.74,0,8,2024-01
1093,2024-01-17,Alicante,AR,194410127,729030652,148.99,162.11,1,162.11,0,8,2024-01
1369,2024-01-22,Alicante,AR,194410127,729030652,271.2,186.23,1,186.23,0,8,2024-01
1915,2024-01-29,Alicante,AR,194410127,729030652,931.74,362.69,2,181.345,0,8,2024-01


In [69]:
#February
client_194410127[client_194410127['month']=='2024-02']

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
2294,2024-02-02,Alicante,AR,194410127,729030652,246.0,240.49,1,240.49,0,8,2024-02
2889,2024-02-12,Alicante,AR,194410127,729030652,139.99,148.64,1,148.64,0,8,2024-02
3408,2024-02-19,Alicante,AR,194410127,729030652,410.4,267.14,1,267.14,0,8,2024-02
3576,2024-02-21,Alicante,AR,194410127,729030652,88.8,90.2,1,90.2,0,8,2024-02
3908,2024-02-26,Alicante,AR,194410127,729030652,258.0,132.94,1,132.94,0,8,2024-02


In [70]:
#March
client_194410127[client_194410127['month']=='2024-03']
##3rd entry, the formula doesnt match and it doesnt seem to make sense

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
4433,2024-03-04,Alicante,AR,194410127,729030652,325.68,198.4,1,198.4,0,8,2024-03
4936,2024-03-11,Alicante,AR,194410127,729030652,492.55,330.09,1,330.09,0,8,2024-03
5489,2024-03-18,Alicante,AR,194410127,729030652,288.74,194.35,2,194.35,0,8,2024-03
5858,2024-03-22,Alicante,AR,194410127,729030652,186.0,170.23,1,170.23,0,8,2024-03
6173,2024-03-27,Alicante,AR,194410127,729030652,343.5,215.04,1,215.04,0,8,2024-03


In [71]:
#April
client_194410127[client_194410127['month']=='2024-04']
#In this case the 2nd entry makes sense. There is no consistency across the median ticket formula

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
6558,2024-04-03,Alicante,AR,194410127,729030652,385.35,254.26,1,254.26,0,8,2024-04
7155,2024-04-10,Alicante,AR,194410127,729030652,339.6,234.03,2,117.015,0,8,2024-04
7589,2024-04-17,Alicante,AR,194410127,729030652,338.0,251.73,1,251.73,0,8,2024-04
8133,2024-04-24,Alicante,AR,194410127,729030652,434.23,281.52,1,281.52,0,8,2024-04
8483,2024-04-29,Alicante,AR,194410127,729030652,272.79,165.59,1,165.59,0,8,2024-04


In [72]:
#May 
client_194410127[client_194410127['month']=='2024-05']
##3rd entry, the formula doesnt match and it doesnt seem to make sense

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
8946,2024-05-06,Alicante,AR,194410127,729030652,322.99,246.52,1,246.52,0,8,2024-05
9508,2024-05-13,Alicante,AR,194410127,729030652,333.25,211.0,1,211.0,0,8,2024-05
10083,2024-05-20,Alicante,AR,194410127,729030652,395.174,227.13,2,227.13,0,8,2024-05
10624,2024-05-27,Alicante,AR,194410127,729030652,331.74,186.21,1,186.21,0,8,2024-05


### 4. Exploring negative income values

In [73]:
df.head()

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
0,2024-01-01,Alicante,AR,398150871,729030652,5.94,0.0,1,0.0,0,0,2024-01
1,2024-01-01,Alicante,HR,410234355,551409294,48.0,21.02,1,21.02,4,0,2024-01
2,2024-01-02,Alicante,AR,123463493,551409294,125.25,92.57,1,92.57,1,0,2024-01
3,2024-01-02,Alicante,AR,124527399,729030652,83.0,60.94,1,60.94,4,0,2024-01
4,2024-01-02,Alicante,AR,130100821,729030652,768.0,244.33,1,244.33,1,3,2024-01


In [74]:
negative_income = df[df['income']<0]
negative_income

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
33,2024-01-02,Alicante,AR,697976135,39304770,0.000,-17.98,0,-17.98,0,0,2024-01
144,2024-01-03,Alicante,HR,310637681,91937945,-103.990,-60.90,1,-30.45,0,0,2024-01
289,2024-01-04,Alicante,HR,454699461,551409294,-1.200,-3.07,0,-3.07,0,0,2024-01
376,2024-01-05,Alicante,HR,129590664,91937945,-3.500,-21.34,0,-21.34,0,0,2024-01
441,2024-01-08,Alicante,AR,394499568,39304770,-10.200,-50.75,0,-50.75,0,0,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
1012956,2024-12-17,Valencia,HR,527370739,327176535,-2.000,-46.90,0,-46.90,0,0,2024-12
1013410,2024-12-19,Valencia,HR,468061603,444765134,-0.134,-24.60,0,-24.60,0,0,2024-12
1014456,2024-12-27,Valencia,HR,377818083,460456701,6.000,-76.71,1,-76.71,0,0,2024-12
1014586,2024-12-30,Valencia,AR,496800310,460456701,0.000,-316.24,0,-316.24,0,0,2024-12


In [75]:
negative_income['number_of_orders'].value_counts()

number_of_orders
0    2681
1     346
2      14
3       2
Name: count, dtype: int64

In [76]:
pos_orders = negative_income[negative_income['number_of_orders'] > 0]
pos_orders

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
144,2024-01-03,Alicante,HR,310637681,91937945,-103.99,-60.90,1,-30.450000,0,0,2024-01
2021,2024-01-30,Alicante,AR,912139581,91937945,0.00,-0.31,1,-0.155000,0,0,2024-01
4053,2024-02-27,Alicante,HR,557261162,39304770,12.00,-745.01,1,-248.336667,0,0,2024-02
5928,2024-03-22,Alicante,HR,392868386,729030652,1.00,-35.37,1,-35.370000,0,0,2024-03
6319,2024-03-27,Alicante,HR,867912400,91937945,0.00,-21.12,1,-10.560000,0,0,2024-03
...,...,...,...,...,...,...,...,...,...,...,...,...
994639,2024-08-21,Valencia,AR,392518446,444765134,0.00,-2.62,1,-1.310000,0,0,2024-08
1001565,2024-10-03,Valencia,HR,371882962,249555220,108.00,-73.52,1,-73.520000,0,0,2024-10
1008398,2024-11-14,Valencia,HR,796836014,376164172,1.00,-0.52,1,-0.520000,0,0,2024-11
1009697,2024-11-25,Valencia,AR,859513033,249555220,-138.20,-105.35,1,-52.675000,0,0,2024-11


### 5. Exploring clients with total income < 0

In [78]:
df.groupby('client_id')['income'].sum().sort_values()

client_id
216722324    -17252.42
680649272     -2433.60
850271991     -2212.17
249067654     -1299.63
874370762      -758.87
               ...    
908335862    327914.20
697872448    351473.41
223570774    494810.12
215933226    527932.32
386121207    599249.42
Name: income, Length: 42149, dtype: float64

In [None]:
client_216722324 = df[df['client_id'] == '216722324']
client_216722324

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
387361,2024-02-26,Girona,HR,216722324,638196450,0.0,18056.83,0,18056.83,0,0,2024-02
393503,2024-05-29,Girona,HR,216722324,638196450,0.0,0.0,0,0.0,0,0,2024-05
398748,2024-08-09,Girona,HR,216722324,638196450,0.0,-12526.72,0,-6263.36,0,0,2024-08
406408,2024-12-16,Girona,HR,216722324,638196450,0.0,-22782.53,0,-22782.53,0,0,2024-12


In [None]:
client_680649272 = df[df['client_id'] == '680649272']
client_680649272

In [83]:
client_850271991 =df[df['client_id'] == '850271991']
client_850271991

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
477082,2024-01-18,Madrid,HR,850271991,64444732,0.0,-2212.17,0,-2212.17,0,0,2024-01


In [84]:
client_249067654 =df[df['client_id'] == '249067654']
client_249067654

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
387362,2024-02-26,Girona,HR,249067654,916155200,0.0,-1299.63,0,-1299.63,0,0,2024-02
388422,2024-03-13,Girona,HR,249067654,916155200,0.0,0.0,0,0.0,0,0,2024-03
393511,2024-05-29,Girona,HR,249067654,916155200,0.0,0.0,0,0.0,0,0,2024-05


### 3. Checking that at the daily level median ticket is the same as net income --> IT IS NOT, THERE IS A NUMBER OF ORDERS COMPONENT WHICH NEEDS TO BE TAKEN INTO ACCOUNT

In [None]:
df['check'] = df['income'] - df['median_ticket']
df['check']

In [None]:
df['check'].value_counts(ascending=False)

In [None]:
df[df['check'] == 173.92]

In [None]:
# income = median_ticket * number_of_orders?
173.92 * 2 # appears to be so

In [None]:
df.head()

In [None]:
df['check_2'] = (df['median_ticket'] * df['number_of_orders']) - df['income']
df['check_2'].value_counts()

In [None]:
df['check_2'].max()

In [None]:
df[df['check_2']== 22782.53]

In [None]:
#it seems to be an issue where there are negative median tickets and income --> these are most likely reimbursements

In [None]:
# df_income_equal_to_or_greater_than_0 == df_ietogt0

df_ietogt0 = df[df['income'] >= 0]
df_ietogt0

In [None]:
df_ietogt0['check_2'].value_counts()

In [None]:
df_ietogt0['check_2'].max()

In [None]:
df_ietogt0[df_ietogt0['check_2']== 10108.8] #unsure what to do with this --> here income = median ticket and number of orders = 2

### 3. Exploring negative income and negative median ticket to check what this means

In [None]:
df_neg = df[df['income'] < 0]
df_neg

In [None]:
df_neg[df_neg['volume']>0]

In [None]:
print(f'There are {df_neg.shape[0]} entries where income < 0')

In [None]:
df_orders_0 = df_neg[df_neg['number_of_orders'] == 0]
df_orders_0['neg_check'] = df_neg['income'] - df_neg['median_ticket']

In [None]:
df_orders_0.drop(columns=['efficiency'], inplace=True)

In [None]:
df_orders_0

In [None]:
df_orders_0[df_orders_0['neg_check'] == -13.355]

### 4. client id by client id analysis

In [None]:
df.head()

In [None]:
df.drop(columns= ['efficiency'], inplace=True)

In [None]:
client_398150871 = df[df['client_id']== '398150871']
client_398150871

In [None]:
#January
## Appears that it has 4 promotional contacts monthly
## It appears that income = number_of_orders * median_ticket
# Efficient: no of orders > conacts = 4
client_398150871[client_398150871['month'] == '2024-01']

quick note --> check income vs median ticket where no_of_orders > 2 --> median ticket should not be income/number_of_orders

In [None]:
#February
# Not efficient: no of orders=3 < contacts=4
client_398150871[client_398150871['month'] == '2024-02']

In [None]:
#March
client_398150871[client_398150871['month'] == '2024-03']
# Efficient: no of orders > conacts = 4

In [None]:
#April
client_398150871[client_398150871['month'] == '2024-04']

---

Thomas ideas:
* Might be a good idea to do an analysis segmenting by city and channel, as profitability & efficiency might differ by channel &/or city.

In [None]:
df.head()

In [None]:
print(f'In this dataset there are {df['client_id'].nunique()} clients')

We focus on HR (restaurants and bars) and AR (small shops, e.g. alimentación) --> Most complex to manage due to large numbers

In these Channels each client has a pre-established number of contacts per
week. 

These contacts are the number of times a client is contacted or visited by a promotor (agents that take the client’s orders) within a month.

In [None]:
df.sort_values(by='client_id') #Maybe should add a month column and focus on a specific client (id) here.

Questions:
* Median Ticket: How does Median Ticket work? If median ticket is the median income generated by the orders of a certain client within the selected period then the data must also be aggregated at a daily level, right? (Also, what is *THE* selected period? Monthly? Weekly?)
* Number of Orders: I assume the number of orders is daily? It doesn't make sense that a client can have 17 orders in one day though?
* Number of contacts: Which is the established period? From what I can deduce monthly, but I am unsure
* What does a negative Median Ticket mean, this doesn't make sense. The minimum value a minimum ticket should have is 0.

In [None]:
print(f'Max. no. of orders: {df['number_of_orders'].max()}')
print(f'Min. no. of orders: {df['number_of_orders'].min()}')

Extra notes: (Unsure whether to do this after aggregating)
* Maybe can add logistics cost: logistic_cost = 10€ * number_of_orders
* Maybe can add promotor visit cost: contact_cost = 15€ * prom_contacts_month

In [None]:
df[df['client_id'] == '653025']

<a id='3'></a>
# Part 3 - Aggregating the Client-Level Dataset

In [None]:
# --- STEP 1: Create `frequency` (median orders per month per client) ---
df['month'] = df['date'].dt.to_period('M')
print(df.head())

monthly_orders = df.groupby(['client_id', 'month'])['number_of_orders'].sum().reset_index()
print(monthly_orders.head())

frequency_df = monthly_orders.groupby('client_id')['number_of_orders'].median().reset_index()
frequency_df.rename(columns={'number_of_orders': 'frequency'}, inplace=True)
print(frequency_df.head())

In [None]:
# --- STEP 1.5: Validate that 'channel' and 'city' are unique per client ---
multi_channel = df.groupby('client_id')['channel'].nunique()
print("Clients with >1 unique channel:", (multi_channel > 1).sum())

multi_city = df.groupby('client_id')['city'].nunique()
print("Clients with >1 unique city:", (multi_city > 1).sum())

#Check to be able to agg by channel and city --> each client id has a unique city and channel

In [None]:
# --- STEP 2: Aggregate Remaining Data Per Client ---
# Notes:
# - 'income', 'volume', 'number_of_orders' -> summed: represents cumulative behavior
# - 'prom_contacts_month', 'tel_contacts_month' -> averaged: avoid inflation from repetition
# - 'median_ticket' -> median to reduce outlier skew
# - 'channel', 'city' -> assumed to be static, validated above

client_df = df.groupby('client_id').agg({
    'income': 'sum',
    'volume': 'sum',  # volume = total weight/space across orders, relevant for logistics
    'number_of_orders': 'sum',
    'prom_contacts_month': 'sum', #mean?
    'tel_contacts_month': 'sum', #mean?
    'median_ticket': 'median',
    'channel': 'first',
    'city': 'first'
}).reset_index()

client_df.head()

In [None]:
# --- STEP 3: Merge Frequency ---
client_df = client_df.merge(frequency_df, on='client_id', how='left')

In [None]:
# --- STEP 4: Create Efficiency Feature ---
client_df['efficiency'] = client_df['number_of_orders'] / client_df['prom_contacts_month']

In [None]:
# --- STEP 5: Handle Division by Zero or NaNs ---
client_df['efficiency'] = client_df['efficiency'].replace([float('inf'), -float('inf')], None)
client_df['efficiency'] = client_df['efficiency'].fillna(0)

In [None]:
client_df.dtypes

In [None]:
client_df.duplicated().any()

In [None]:
client_df.isna().any().sum()

In [None]:
client_df

In [None]:
client_df.rename(columns={
    'prom_contacts_month': 'total_prom_contacts',
    'tel_contacts_month': 'total_tel_contacts'
}, inplace=True)
#renaming column to aviod confusion --> not monthly now, now is total contacts

In [None]:
#client_df.to_csv('dataset/aggregated_client_data.csv')

<a id='3'></a>
# Part 3 - Exploratory Data Analysis (aggregated)

In [None]:
# Set plot style
sns.set(style='whitegrid')

### Income & Median ticket per client

In [None]:
client_df['income'].describe()

In [None]:
# 1. Distribution of Total Income
plt.figure(figsize=(6, 4))
sns.histplot(client_df['income'], bins=50, kde=True)
plt.title('Distribution of Total Income per Client')
plt.xlabel('Total Income (€)')
plt.ylabel('Number of Clients')
plt.tight_layout()
plt.show()

In [None]:
neg_income = client_df[client_df['income'] < 0]
neg_ticket = client_df[client_df['median_ticket'] < 0]
print(f"Negative income clients: {len(neg_income)}")
print(f"Negative ticket clients: {len(neg_ticket)}")

# Optional: see overlap
neg_both = client_df[(client_df['income'] < 0) & (client_df['median_ticket'] < 0)]

In [None]:
len(neg_both)

In [None]:
#to drop
#client_df = client_df[(client_df['income'] >= 0) & (client_df['median_ticket'] >= 0)]


In [None]:
client_df['median_ticket'].describe()

In [None]:
# 2. Distribution of Median Ticket
plt.figure(figsize=(6, 4))
sns.histplot(client_df['median_ticket'], bins=50, kde=True)
plt.axvline(80, color='red', linestyle='--', label='Ticket Threshold (80€)')
plt.title('Distribution of Median Ticket per Client')
plt.xlabel('Median Ticket (€)')
plt.ylabel('Number of Clients')
plt.legend()
plt.tight_layout()
plt.show()

# 3. Efficiency Distribution

In [None]:
# 3. Efficiency Distribution
plt.figure(figsize=(6, 4))
sns.histplot(client_df['efficiency'], bins=50, kde=True)
plt.title('Distribution of Client Efficiency (Orders / Physical Contacts)')
plt.xlabel('Efficiency')
plt.ylabel('Number of Clients')
plt.tight_layout()
plt.show()

# 4. Orders vs Promotor Contacts (Scatter)

In [None]:
# 4. Orders vs Promotor Contacts (Scatter)
plt.figure(figsize=(6, 5))
sns.scatterplot(data=client_df, x='prom_contacts_month', y='number_of_orders', hue='channel', alpha=0.6, edgecolor='w')
plt.plot([0, client_df['prom_contacts_month'].max()], [0, client_df['prom_contacts_month'].max()], '--', color='grey', label='Ideal 1:1 Line')
plt.title('Number of Orders vs. Promotor Contacts')
plt.xlabel('Total Promotor Contacts')
plt.ylabel('Total Orders')
plt.legend()
plt.tight_layout()
plt.show()

# 5. Average Efficiency by Channel (Fixed)

In [None]:
# 5. Average Efficiency by Channel (Fixed)
plt.figure(figsize=(6, 4))
sns.barplot(data=client_df, x='channel', y='efficiency', estimator=np.mean)
plt.title('Average Efficiency by Channel')
plt.xlabel('Channel')
plt.ylabel('Avg Efficiency')
plt.tight_layout()
plt.show()

# 6. Correlation Matrix

In [None]:
# 6. Correlation Matrix
plt.figure(figsize=(8, 6))
corr = client_df[['income', 'volume', 'number_of_orders', 'prom_contacts_month', 'median_ticket', 'frequency', 'efficiency']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()