# **Pacual Capstone Group 4 Notebook - Questions**

Group members: *Abdullah Alshaarawi, James Alarde, Hiromitsu Fujiyama, Sanjo Joy, Thomas Arturo Renwick Morales*

---

This notebook is organized in the following sections:

* [Part 0 - Importing the Necessary Libraries](#0)

* [Part 1 - Data Loading](#1)

* [Part 2 - Data Cleaning/ Wrangling](#2)
  * [Part 2.1 - Preliminary Analysis of the Dataset](#2.1)
  * [Part 2.2 - Converting Column Names to Pythonic Snake-Case](#2.2)
  * [Part 2.3 - Dealing with Duplicates](#2.3)
  * [Part 2.4 - Ensuring Correct Data Types](#2.4)
  * [Part 2.5 - Dealing with Null/Missing Values](#2.5)
  * [Part 2.6 - Final Checks](#2.6)

* [Part 3 - **Questions for Pascual**](#3)
  * [Part 3.1 - *Data-related questions*](#3.1)
  * [Part 3.2 - *Conceptual questions*](#3.2)

---

<a id='0'></a>
## Part 0 - Importing the Necessary Libraries

First, we imported the libraries which were necessary for our analysis.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn 
import numpy as np
import joblib

# Displaying only 2 decimal points for visual purposes
pd.set_option('display.float_format', '{:.2f}'.format) 

In [2]:
#To reset to default display option if needed later on:
## pd.reset_option('display.float_format')

<a id='1'></a>
# Part 1 - Data Loading

Then, we proceeded to load the dataset.

In [3]:
df = pd.read_csv('dataset/Orders_Master_Data(in).csv')

<a id='2'></a>
# Part 2 - Data Cleaning/ Wrangling

<a id='2.1'></a>
## Part 2.1 - Preliminary Analysis of the Dataset

Before beginning with data cleaning/wrangling we ran basic pandas functions for preliminary analysis/view of the dataset.

In [4]:
df.head()

Unnamed: 0,Date,City,Channel,Client ID,Promotor ID,Volume,Income,Number of orders,Median Ticket (€),Prom Contacts Month,Tel Contacts Month
0,01.01.2024,Alicante,AR,398150871,729030652,5.94,0.0,1,0.0,0,0
1,01.01.2024,Alicante,HR,410234355,551409294,48.0,21.02,1,21.02,4,0
2,02.01.2024,Alicante,AR,123463493,551409294,125.25,92.57,1,92.57,1,0
3,02.01.2024,Alicante,AR,124527399,729030652,83.0,60.94,1,60.94,4,0
4,02.01.2024,Alicante,AR,130100821,729030652,768.0,244.33,1,244.33,1,3


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035735 entries, 0 to 1035734
Data columns (total 11 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   Date                 1035735 non-null  object 
 1   City                 1035735 non-null  object 
 2   Channel              1035735 non-null  object 
 3   Client ID            1035735 non-null  int64  
 4   Promotor ID          1035735 non-null  int64  
 5   Volume               1035735 non-null  float64
 6   Income               1035735 non-null  float64
 7   Number of orders     1035735 non-null  int64  
 8   Median Ticket (€)    1035735 non-null  float64
 9   Prom Contacts Month  1035735 non-null  int64  
 10  Tel Contacts Month   1035735 non-null  int64  
dtypes: float64(3), int64(5), object(3)
memory usage: 86.9+ MB


<a id='2.2'></a>
## Part 2.2 - Converting Column Names to Pythonic Snake-Case

Next, we converted column names to Pythonic snake-case as this would simplify the process later in when doing machine learning.

In [6]:
df = df.rename(columns={'Date':'date', 
                        'City':'city', 
                        'Channel':'channel', 
                        'Client ID': 'client_id',
                        'Promotor ID': 'promotor_id',
                        'Volume': 'volume',
                        'Income': 'income',
                        'Number of orders': 'number_of_orders',
                        'Median Ticket (€)':'median_ticket',
                        'Prom Contacts Month': 'prom_contacts_month',
                        'Tel Contacts Month': 'tel_contacts_month'})

<a id='2.3'></a>
## Part 2.3 - Dealing with Duplicates

Then, we checked if there were duplicates, which was in fact the case.

In [7]:
df.duplicated().any()

True

We checked how many rows were duplicated out of the whole dataset and found there were quite a few.

In [8]:
# Total number of rows
total_rows = df.shape[0]

# Number of exact duplicates (all columns identical)
exact_duplicates = df.duplicated().sum()
print(f"Exact Duplicates: {exact_duplicates} out of {total_rows}")

Exact Duplicates: 20770 out of 1035735


We explored the duplicates, to check whether these were exact duplicates.

In [9]:
exact_duplicates = df[df.duplicated(keep=False)]
exact_duplicates.sort_values(by=['client_id', 'date']).head(10)

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month
919356,11.03.2024,Tarragona,HR,100854769,306190165,54.3,117.02,1,117.02,4,0
1018754,11.03.2024,Tarragona,HR,100854769,306190165,54.3,117.02,1,117.02,4,0
917803,12.02.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0
1017201,12.02.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0
925032,13.06.2024,Tarragona,HR,100854769,306190165,45.2,90.5,1,90.5,4,0
1024430,13.06.2024,Tarragona,HR,100854769,306190165,45.2,90.5,1,90.5,4,0
930843,16.09.2024,Tarragona,HR,100854769,306190165,129.0,74.14,1,74.14,4,0
1030241,16.09.2024,Tarragona,HR,100854769,306190165,129.0,74.14,1,74.14,4,0
917063,29.01.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0
1016461,29.01.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0


As we found the duplicates were exact duplicates, we proceeded to drop the duplicated row, keeping the first occurrence to not lose any valuable data points.

In [10]:
df = df.drop_duplicates(keep='first')

We checked once more, to see if we had dealt with the duplicates properly and to observe if there were any remaining ones.

In [11]:
# Total number of rows
total_rows = df.shape[0]

# Number of exact duplicates (all columns identical)
exact_duplicates = df.duplicated().sum()
print(f"Exact Duplicates: {exact_duplicates} out of {total_rows}")

Exact Duplicates: 0 out of 1014965


As there were'nt any remaining duplicates (i.e., we had dealt with them properly), we proceeded to ensure the columns were in their correct/appropriate data type next.

<a id='2.4'></a>
## Part 2.4 - Ensuring Correct Data Types

After having had a preliminary view of the dataset, we determined that the columns should of of the following data types:
* `date`: datetime
* `city`: object
* `channel`: object	
* `client_id`: object
* `promotor_id` : object
* `volume`: float	
* `income`: float	
* `number_of_orders`: integer	
* `median_ticket`: float	
* `prom_contacts_month`: integer	
* `tel_contacts_month`: integer

Therefore we proceeded to check if the columns were in fact in the data types we wanted them to be.

In [12]:
df.dtypes

date                    object
city                    object
channel                 object
client_id                int64
promotor_id              int64
volume                 float64
income                 float64
number_of_orders         int64
median_ticket          float64
prom_contacts_month      int64
tel_contacts_month       int64
dtype: object

Most columns were already of the appropriate data type except for: `date`, `client_id`, and `promotor_id`. Therefore we proceeded to modify these into their appropriate data types.

In [13]:
df['date'] = pd.to_datetime(df['date'], format='%d.%m.%Y')
df['client_id'] = df['client_id'].astype(str)
df['promotor_id'] = df['promotor_id'].astype(str)

We made a final check to make sure we had properly transformed these columns into their correct data type.

In [14]:
df.dtypes

date                   datetime64[ns]
city                           object
channel                        object
client_id                      object
promotor_id                    object
volume                        float64
income                        float64
number_of_orders                int64
median_ticket                 float64
prom_contacts_month             int64
tel_contacts_month              int64
dtype: object

Having all columns in the correct data type we proceeded to check for missing/null values in the dataset.

<a id='2.5'></a>
## Part 2.5 - Dealing with Null/Missing Values

In [15]:
df.isna().any()

date                   False
city                   False
channel                False
client_id              False
promotor_id            False
volume                 False
income                 False
number_of_orders       False
median_ticket          False
prom_contacts_month    False
tel_contacts_month     False
dtype: bool

In [16]:
df.isna().any().sum()

0

We found there were no missing/null values across the whole dataset.

Having completed this data wrangling step, we proceeded to make some final checks to the dataset before proceeding to aggregate data at the client-level.

<a id='2.6'></a>
## Part 2.6 - Final checks

We had another prelimianry view of the data.

In [17]:
df.head()

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month
0,2024-01-01,Alicante,AR,398150871,729030652,5.94,0.0,1,0.0,0,0
1,2024-01-01,Alicante,HR,410234355,551409294,48.0,21.02,1,21.02,4,0
2,2024-01-02,Alicante,AR,123463493,551409294,125.25,92.57,1,92.57,1,0
3,2024-01-02,Alicante,AR,124527399,729030652,83.0,60.94,1,60.94,4,0
4,2024-01-02,Alicante,AR,130100821,729030652,768.0,244.33,1,244.33,1,3


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1014965 entries, 0 to 1014964
Data columns (total 11 columns):
 #   Column               Non-Null Count    Dtype         
---  ------               --------------    -----         
 0   date                 1014965 non-null  datetime64[ns]
 1   city                 1014965 non-null  object        
 2   channel              1014965 non-null  object        
 3   client_id            1014965 non-null  object        
 4   promotor_id          1014965 non-null  object        
 5   volume               1014965 non-null  float64       
 6   income               1014965 non-null  float64       
 7   number_of_orders     1014965 non-null  int64         
 8   median_ticket        1014965 non-null  float64       
 9   prom_contacts_month  1014965 non-null  int64         
 10  tel_contacts_month   1014965 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(3), object(4)
memory usage: 92.9+ MB


We checked for duplicates again.

In [19]:
df.duplicated().any()

False

We checked for data types again.

In [20]:
df.dtypes

date                   datetime64[ns]
city                           object
channel                        object
client_id                      object
promotor_id                    object
volume                        float64
income                        float64
number_of_orders                int64
median_ticket                 float64
prom_contacts_month             int64
tel_contacts_month              int64
dtype: object

Finally, we checked for any missing/null values again.

In [21]:
df.isna().any()

date                   False
city                   False
channel                False
client_id              False
promotor_id            False
volume                 False
income                 False
number_of_orders       False
median_ticket          False
prom_contacts_month    False
tel_contacts_month     False
dtype: bool

In [22]:
df.isna().any().sum()

0

Having confirmed, that the dataset was clean, we proceeded to continue with the next step which was to aggregate the data at the client-level.

Just in case, we wrote to csv a copy of the cleaned dataset.

In [23]:
#df.to_csv('dataset/clean_orders_data/clean_orders_data.csv', index=False)

---

<a id='3'></a>
# Part 3 - **Questions for Pascual**

<a id='3.1'></a>
## Part 3.1 - *Data-related questions*

In [24]:
#Create month column for the analysis
df['month'] = df['date'].dt.to_period('M')

**NOTE**: We split our analysis by positive and negative entries of income.

### 1. What is the relationship between median ticket and number of orders with net income?

We found it follows this general rule: 
* income = median_ticket * number_of_orders

However, there are some exceptions which do not coincide with this. We found for these exceptions that:
* income != median_ticket * number_of_orders
* Instead --> income = median_ticket

This exception is not taking the number of orders into account.

For this case, we only took into account entries with income equal to or greater than 0 (non-negative income).

In [25]:
positive_income = df[df['income']>=0]

#Created a column to check the consistency of this income and median_ticket relationship
positive_income['check'] = positive_income['income'] - (positive_income['median_ticket']*positive_income['number_of_orders'])

#The idea is that this consistency check should be equal to 0
#If this is not equal to 0, then the relationship does not hold

#This new dataframe is to check those inconsistencies
#We do a threshold of > 1, as those "inconsistencies" which are due to decimal point errors are not an issue (minimal error)
inconsistent_rel = positive_income[positive_income['check']>1]

#We found there are around 2000 entries which do not match this criteria
inconsistent_rel

#This is not a very large proportion of the whole dataset, should we drop them?
#How would you recommend we proceed?

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  positive_income['check'] = positive_income['income'] - (positive_income['median_ticket']*positive_income['number_of_orders'])


Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month,check
924,2024-01-15,Alicante,AR,194410127,729030652,247.88,143.48,1,71.74,0,8,2024-01,71.74
1069,2024-01-16,Alicante,HR,739047412,729030652,0.00,25.69,0,25.69,0,0,2024-01,25.69
1230,2024-01-18,Alicante,AR,426657251,39304770,166.94,45.56,1,22.78,4,0,2024-01,22.78
1603,2024-01-24,Alicante,AR,531963963,218497097,8.58,870.04,1,435.02,0,0,2024-01,435.02
2483,2024-02-06,Alicante,AR,413503307,551409294,890.94,412.08,1,206.04,2,0,2024-02,206.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1010580,2024-11-29,Valencia,HR,449827392,307450899,0.00,120.95,0,60.48,0,0,2024-11,120.95
1010703,2024-12-02,Valencia,AR,380180714,444765134,1551.34,583.99,1,292.00,0,4,2024-12,292.00
1011400,2024-12-05,Valencia,AR,588478841,52875287,245.90,94.58,1,47.29,4,0,2024-12,47.29
1011634,2024-12-09,Valencia,AR,644476280,998162842,1080.00,2743.49,1,1371.74,0,4,2024-12,1371.74


In [26]:
#We explored the data a bit more 
#-->by month for a client within this inconsistent_rel dataframe
## (first client_id which appears in the inconsistent_rel dataframe)
client_194410127= df[df['client_id']== '194410127']
client_194410127

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
8,2024-01-02,Alicante,AR,194410127,729030652,350.69,207.48,1,207.48,0,8,2024-01
429,2024-01-08,Alicante,AR,194410127,729030652,252.32,197.30,2,197.30,0,8,2024-01
924,2024-01-15,Alicante,AR,194410127,729030652,247.88,143.48,1,71.74,0,8,2024-01
1093,2024-01-17,Alicante,AR,194410127,729030652,148.99,162.11,1,162.11,0,8,2024-01
1369,2024-01-22,Alicante,AR,194410127,729030652,271.20,186.23,1,186.23,0,8,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
24826,2024-12-04,Alicante,AR,194410127,729030652,400.20,232.76,1,232.76,0,8,2024-12
25214,2024-12-11,Alicante,AR,194410127,729030652,384.60,255.60,1,255.60,0,8,2024-12
25518,2024-12-16,Alicante,AR,194410127,729030652,157.20,94.43,1,94.43,0,8,2024-12
25676,2024-12-18,Alicante,AR,194410127,729030652,190.80,123.63,1,123.63,0,8,2024-12


In [27]:
#January
client_194410127[client_194410127['month']=='2024-01']
##2nd entry: the relationship is inconsistent

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
8,2024-01-02,Alicante,AR,194410127,729030652,350.69,207.48,1,207.48,0,8,2024-01
429,2024-01-08,Alicante,AR,194410127,729030652,252.32,197.3,2,197.3,0,8,2024-01
924,2024-01-15,Alicante,AR,194410127,729030652,247.88,143.48,1,71.74,0,8,2024-01
1093,2024-01-17,Alicante,AR,194410127,729030652,148.99,162.11,1,162.11,0,8,2024-01
1369,2024-01-22,Alicante,AR,194410127,729030652,271.2,186.23,1,186.23,0,8,2024-01
1915,2024-01-29,Alicante,AR,194410127,729030652,931.74,362.69,2,181.34,0,8,2024-01


In [28]:
#March
client_194410127[client_194410127['month']=='2024-03']
##3rd entry: the relationship is inconsistent

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
4433,2024-03-04,Alicante,AR,194410127,729030652,325.68,198.4,1,198.4,0,8,2024-03
4936,2024-03-11,Alicante,AR,194410127,729030652,492.55,330.09,1,330.09,0,8,2024-03
5489,2024-03-18,Alicante,AR,194410127,729030652,288.74,194.35,2,194.35,0,8,2024-03
5858,2024-03-22,Alicante,AR,194410127,729030652,186.0,170.23,1,170.23,0,8,2024-03
6173,2024-03-27,Alicante,AR,194410127,729030652,343.5,215.04,1,215.04,0,8,2024-03


In [29]:
#May 
client_194410127[client_194410127['month']=='2024-05']
##3rd entry: the relationship is inconsistent

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
8946,2024-05-06,Alicante,AR,194410127,729030652,322.99,246.52,1,246.52,0,8,2024-05
9508,2024-05-13,Alicante,AR,194410127,729030652,333.25,211.0,1,211.0,0,8,2024-05
10083,2024-05-20,Alicante,AR,194410127,729030652,395.17,227.13,2,227.13,0,8,2024-05
10624,2024-05-27,Alicante,AR,194410127,729030652,331.74,186.21,1,186.21,0,8,2024-05


### 2. What does negative income for an entry mean? Are these reimbursements?

We would assume that these are reimbursements, however there are some things we don't understand:
* For different rows of negative income entries, the volumes are either positive, negative or zero? What does this mean? 

In [30]:
negative_income = df[df['income']<0]

In [31]:
#Negative volume
negative_vol = negative_income[negative_income['volume']< 0]
negative_vol

#There are 1689 rows with negative income entries and negative volume

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
144,2024-01-03,Alicante,HR,310637681,91937945,-103.99,-60.90,1,-30.45,0,0,2024-01
289,2024-01-04,Alicante,HR,454699461,551409294,-1.20,-3.07,0,-3.07,0,0,2024-01
376,2024-01-05,Alicante,HR,129590664,91937945,-3.50,-21.34,0,-21.34,0,0,2024-01
441,2024-01-08,Alicante,AR,394499568,39304770,-10.20,-50.75,0,-50.75,0,0,2024-01
991,2024-01-15,Alicante,HR,986671407,551409294,-2.34,-23.92,0,-23.92,0,0,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
1009697,2024-11-25,Valencia,AR,859513033,249555220,-138.20,-105.35,1,-52.67,0,0,2024-11
1012470,2024-12-13,Valencia,AR,835982499,460456701,-90.00,-113.52,0,-113.52,0,0,2024-12
1012956,2024-12-17,Valencia,HR,527370739,327176535,-2.00,-46.90,0,-46.90,0,0,2024-12
1013410,2024-12-19,Valencia,HR,468061603,444765134,-0.13,-24.60,0,-24.60,0,0,2024-12


In [32]:
#Positive volume
positive_vol = negative_income[negative_income['volume']> 0]
positive_vol

#There are 238 rows with negative income entries and positive volume

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
4053,2024-02-27,Alicante,HR,557261162,39304770,12.00,-745.01,1,-248.34,0,0,2024-02
5928,2024-03-22,Alicante,HR,392868386,729030652,1.00,-35.37,1,-35.37,0,0,2024-03
10607,2024-05-24,Alicante,HR,867377147,729030652,120.00,-120.22,1,-60.11,0,0,2024-05
11572,2024-06-06,Alicante,HR,279230392,39304770,34.95,-13.72,1,-13.72,0,0,2024-06
13430,2024-07-02,Alicante,AR,688556611,551409294,11.50,-6.12,1,-6.12,0,0,2024-07
...,...,...,...,...,...,...,...,...,...,...,...,...
987813,2024-07-04,Valencia,HR,761984352,139088935,143.16,-42.43,1,-42.43,0,0,2024-07
988980,2024-07-11,Valencia,HR,297605797,52875287,32.64,-123.23,1,-123.23,0,0,2024-07
1001565,2024-10-03,Valencia,HR,371882962,249555220,108.00,-73.52,1,-73.52,0,0,2024-10
1008398,2024-11-14,Valencia,HR,796836014,376164172,1.00,-0.52,1,-0.52,0,0,2024-11


In [33]:
#Zero volume
zero_vol = negative_income[negative_income['volume'] == 0]
zero_vol

#There are 1116 rows with negative income entries and zero volume

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
33,2024-01-02,Alicante,AR,697976135,39304770,0.00,-17.98,0,-17.98,0,0,2024-01
1295,2024-01-18,Alicante,HR,972240381,551409294,0.00,-3.60,0,-3.60,0,0,2024-01
1323,2024-01-19,Alicante,AR,702594377,218497097,0.00,-6.04,0,-6.04,0,0,2024-01
1516,2024-01-23,Alicante,HR,101926782,91937945,0.00,-49.68,0,-49.68,0,0,2024-01
2021,2024-01-30,Alicante,AR,912139581,91937945,0.00,-0.31,1,-0.15,0,0,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
1009521,2024-11-22,Valencia,HR,211786455,52875287,0.00,-21.14,0,-21.14,0,0,2024-11
1010185,2024-11-27,Valencia,HR,371882962,249555220,0.00,-90.32,0,-45.16,0,0,2024-11
1011607,2024-12-09,Valencia,AR,383525450,998162842,0.00,-776.16,0,-776.16,0,0,2024-12
1011624,2024-12-09,Valencia,AR,576161489,998162842,0.00,-86.18,0,-86.18,0,0,2024-12


### 3. What does it mean to have clients with a negative total income (i.e., the sum of income across all entries amounts to a negative number)?

In [34]:
#Summing total income across all entries by client
total_income = df.groupby('client_id')['income'].sum().sort_values()

#Creating a series which shows just negative total income (from most negative to least)
negative_income_clients = total_income[total_income < 0]

negative_income_clients.sort_values()

client_id
216722324   -17252.42
680649272    -2433.60
850271991    -2212.17
249067654    -1299.63
874370762     -758.87
686054949     -714.27
833968223     -604.13
434326174     -540.00
954110509     -441.56
193935197     -388.95
671328462     -329.81
572342924     -288.74
338415545     -279.69
326529393     -263.93
375350895     -249.09
119006128     -222.92
300010850     -210.92
473454133     -171.29
534193468     -155.28
533833249     -132.39
358565223     -110.22
327274699      -98.28
708945253      -87.16
127804495      -76.62
364776507      -68.21
327696319      -61.26
155877396      -59.21
353065642      -56.40
868006169      -56.00
516009589      -55.90
113456478      -55.90
545660944      -55.69
620948028      -55.60
462322659      -50.84
626656988      -49.50
909210497      -43.96
615645715      -32.60
908365968      -24.76
937477703      -21.60
129761549      -21.28
619532117      -20.55
523894728      -17.89
227049778      -16.64
223778293      -13.07
603561469       -8.15


In [35]:
print(f'There are {len(negative_income_clients)} clients with a negative total income')

There are 55 clients with a negative total income


What does this mean? What should we do with these clients? Should we drop them from our analysis?

### 4. What does it mean to have more than one order on a specific date?

In this case we have taken the client with the highest amount of orders for demonstration purposes. We did a short month by month analysis.

In [36]:
highest_no_of_orders_client = df.groupby('client_id')['number_of_orders'].sum().sort_values(ascending=False).index[0]
most_orders_df = df[df['client_id']== highest_no_of_orders_client]
most_orders_df

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
955431,2024-01-03,Valencia,HR,577029300,376164172,934.32,1188.17,8,148.52,0,4,2024-01
955927,2024-01-05,Valencia,HR,577029300,376164172,581.35,451.73,2,225.87,0,4,2024-01
956153,2024-01-08,Valencia,HR,577029300,376164172,0.00,0.00,0,0.00,0,0,2024-01
956402,2024-01-09,Valencia,HR,577029300,376164172,992.93,1494.96,12,124.58,0,4,2024-01
957212,2024-01-12,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
1012139,2024-12-11,Valencia,HR,577029300,376164172,1326.25,1301.94,11,118.36,0,4,2024-12
1012962,2024-12-17,Valencia,HR,577029300,376164172,626.10,569.97,12,47.50,0,4,2024-12
1014109,2024-12-25,Valencia,HR,577029300,376164172,861.81,1231.28,12,102.61,0,4,2024-12
1014490,2024-12-27,Valencia,HR,577029300,376164172,462.00,350.00,1,350.00,0,4,2024-12


In [37]:
#January
most_orders_df[most_orders_df['month'] == '2024-01'].head()

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
955431,2024-01-03,Valencia,HR,577029300,376164172,934.32,1188.17,8,148.52,0,4,2024-01
955927,2024-01-05,Valencia,HR,577029300,376164172,581.35,451.73,2,225.87,0,4,2024-01
956153,2024-01-08,Valencia,HR,577029300,376164172,0.0,0.0,0,0.0,0,0,2024-01
956402,2024-01-09,Valencia,HR,577029300,376164172,992.93,1494.96,12,124.58,0,4,2024-01
957212,2024-01-12,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-01


In [38]:
#February
most_orders_df[most_orders_df['month'] == '2024-02'].head()

##Entry 5: Negative income, median ticket doesn't follow the standard relationship
##positive volume --> All related to previous questions. 

##What do we do with these types of entries?

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
960835,2024-02-02,Valencia,HR,577029300,376164172,172.4,121.44,1,60.72,0,4,2024-02
961034,2024-02-05,Valencia,HR,577029300,376164172,173.25,131.25,1,131.25,0,4,2024-02
961556,2024-02-07,Valencia,HR,577029300,376164172,972.45,1325.27,11,120.48,0,4,2024-02
962150,2024-02-09,Valencia,HR,577029300,376164172,138.6,105.0,1,105.0,0,4,2024-02
962353,2024-02-12,Valencia,HR,577029300,376164172,138.6,-12.62,1,-4.21,0,0,2024-02


In [39]:
#March
most_orders_df[most_orders_df['month'] == '2024-03'].head()

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
965913,2024-03-01,Valencia,HR,577029300,376164172,231.0,175.0,1,175.0,0,4,2024-03
966134,2024-03-04,Valencia,HR,577029300,376164172,231.0,175.0,1,175.0,0,4,2024-03
966661,2024-03-06,Valencia,HR,577029300,376164172,831.22,858.69,8,107.34,0,4,2024-03
966963,2024-03-07,Valencia,HR,577029300,376164172,214.8,155.6,2,77.8,0,4,2024-03
967284,2024-03-08,Valencia,HR,577029300,376164172,138.6,105.0,1,105.0,0,4,2024-03


Therefore, our question is, what does one row represent in this dataset?

### 5. What do rows with number of orders = 0 mean? Are we able to just discard these?

We found there are approximately 16000 rows with numbers of orders = 0

In [40]:
zero_orders_df = df[df['number_of_orders']== 0]
zero_orders_df

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
33,2024-01-02,Alicante,AR,697976135,39304770,0.00,-17.98,0,-17.98,0,0,2024-01
64,2024-01-02,Alicante,HR,852243122,729030652,0.00,0.00,0,0.00,0,0,2024-01
103,2024-01-03,Alicante,AR,702594377,218497097,0.00,0.00,0,0.00,0,0,2024-01
140,2024-01-03,Alicante,HR,255446686,551409294,0.00,0.00,0,0.00,0,0,2024-01
141,2024-01-03,Alicante,HR,266456261,551409294,0.00,0.00,0,0.00,0,0,2024-01
...,...,...,...,...,...,...,...,...,...,...,...,...
1014805,2024-12-31,Valencia,AR,661172837,327176535,0.00,0.00,0,0.00,0,0,2024-12
1014881,2024-12-31,Valencia,HR,382663741,327176535,0.00,0.00,0,0.00,0,0,2024-12
1014893,2024-12-31,Valencia,HR,449827392,307450899,0.00,1302.44,0,1302.44,0,0,2024-12
1014915,2024-12-31,Valencia,HR,650595433,376164172,0.00,0.00,0,0.00,0,0,2024-12


Most of the values with zero orders have promotor contacts = 0. This is perhaps what is leading to inconsistencies across the values of prom_contacts_month? (More in detail on this below).

In [41]:
print(f'{zero_orders_df['prom_contacts_month'].value_counts()}')

prom_contacts_month
0    15674
4      144
1      104
2       98
8       13
3        4
Name: count, dtype: int64


### 6. Is it normal that the client which generates the highest median ticket/income has 0 promotor visits?

In [42]:
print(f'Client which generates the highest median ticket --> client_id = {df.groupby('client_id')['median_ticket'].sum().sort_values(ascending=False).index[0]}')
print(f'Client which generates the highest total income --> client_id = {df.groupby('client_id')['income'].sum().sort_values(ascending=False).index[0]}') 

Client which generates the highest median ticket --> client_id = 386121207
Client which generates the highest total income --> client_id = 386121207


The client which generates the highest income and median ticket is the same one. This makes sense as both these features are clearly related.

In [43]:
highest_income_client = df.groupby('client_id')['income'].sum().sort_values(ascending=False).index[0]
highest_income_client_df = df[df['client_id']==highest_income_client]
highest_income_client_df['prom_contacts_month'].value_counts()

prom_contacts_month
0    31
Name: count, dtype: int64

In [44]:
highest_income_client_df.head(3)

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month
871432,2024-01-03,Sevilla,AR,386121207,662836107,18570.68,27796.43,1,27796.43,0,2,2024-01
873833,2024-01-22,Sevilla,AR,386121207,662836107,6131.34,9251.91,1,9251.91,0,2,2024-01
875031,2024-01-31,Sevilla,AR,386121207,662836107,9064.8,14361.84,1,14361.84,0,2,2024-01


For these high income clients, which have zero/few visits, do you want us to increase the number of visits? Is this also an objective of the capstone? Basically, efficiency (no. of orders / prom contacts month) must be always equal to 1? We don't just focus on those clients with efficiency < 1?

### 7. What is the final outcome you are expecting/looking for?

daily table --> monthly table --> yearly table (for 2024)

Yearly table has the following components:
* client_id
* city (which is unique for each client_id --> we have checked)
* channel (which is unique for each client_id --> we have checked)
* volume (yearly sum)
* income (yearly sum)
* number of orders (yearly sum)
* median ticket (**unsure what to do with median ticket**)
* prom_contacts_month (established number of monthly contacts for the client)
* tel_contacts_month (established number of monthly tel contacts for the client)
* frequency (median number of monthly orders --> can be compared against prom_contacts_month)
* efficiency (yearly --> prom_contacts_month * 12 / number of orders)
* logistics cost (yearly --> 10 * number of orders)
* visit cost (prom_contacts_month * 12 / 15)

(Another issue we are having is understanding the yearly table --> it has monthly and yearly components and this may be confusing)

Let's assume the data is fully clean and that we have taken care of the median ticket issues, etc.

In [45]:
clean_df = df[(df['number_of_orders'] != 0) & (df['income'] > 0)]
#Here we have no orders equal to 0 and positive income only

In [46]:
#This is the monthly aggregated table --> intermediate step
monthly_df = clean_df.groupby(['client_id', 'month']).agg({'city':'first',
                                              'channel':'first',
                                              'volume':'sum',
                                              'income':'sum',
                                              'number_of_orders':'sum',
                                              'median_ticket':'median', #Unsure here
                                              'prom_contacts_month': 'first'})

monthly_df

Unnamed: 0_level_0,Unnamed: 1_level_0,city,channel,volume,income,number_of_orders,median_ticket,prom_contacts_month
client_id,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100006690,2024-01,Madrid,AR,202.50,203.99,2,102.00,2
100006690,2024-02,Madrid,AR,195.12,160.66,2,80.33,2
100006690,2024-03,Madrid,AR,138.53,111.39,2,55.70,2
100006690,2024-04,Madrid,AR,156.67,184.01,3,54.37,2
100006690,2024-05,Madrid,AR,175.29,172.16,3,59.78,2
...,...,...,...,...,...,...,...,...
999976985,2024-07,Barcelona,HR,270.00,1024.60,4,227.80,1
999976985,2024-08,Barcelona,HR,291.00,1217.87,4,300.48,1
999976985,2024-09,Barcelona,HR,78.00,240.43,1,240.43,1
999976985,2024-10,Barcelona,HR,209.00,806.33,3,211.83,1


Is this the type of table you are looking for us to analyse, and segment by low medium ticket clients (<80€), as well as inefficient clients (efficiency < 1)?

In [47]:
#Fully aggregated table --> one row per client id, at a yearly level (missing extra features)

yearly_df = monthly_df.groupby('client_id').agg(
    city=('city', 'first'),
    channel=('channel', 'first'),
    volume=('volume', 'sum'),
    income=('income', 'sum'),
    number_of_orders=('number_of_orders', 'sum'),
    median_ticket=('median_ticket', 'median'), #Unsure here
    prom_contacts_month=('prom_contacts_month', 'first'),
    frequency=('number_of_orders', 'median')
).reset_index()

#renaming columns which are yearly columns
yearly_df.rename(columns={'volume': 'yearly_volume',
                          'income': 'yearly_income',
                          'number_of_orders':'yearly_number_of_orders'},
                          inplace=True)

#Creating the new features --> as specified above in markdown
yearly_df['efficiency'] = yearly_df['prom_contacts_month']*12 / yearly_df['yearly_number_of_orders']
yearly_df['logistics_cost'] = yearly_df['yearly_number_of_orders']*10
yearly_df['visit_cost'] = yearly_df['prom_contacts_month'] *12 *15
yearly_df['total_cost'] = yearly_df['logistics_cost'] + yearly_df['visit_cost']

#displaying output
yearly_df

Unnamed: 0,client_id,city,channel,yearly_volume,yearly_income,yearly_number_of_orders,median_ticket,prom_contacts_month,frequency,efficiency,logistics_cost,visit_cost,total_cost
0,100006690,Madrid,AR,1658.71,1494.53,22,60.40,2,2.00,1.09,220,360,580
1,100008050,Barcelona,AR,3982.00,1905.59,14,132.37,0,1.00,0.00,140,0,140
2,100042162,Barcelona,HR,1812.85,2243.30,18,128.78,4,2.00,2.67,180,720,900
3,100046227,Barcelona,AR,4590.18,2273.12,16,129.06,2,4.00,1.50,160,360,520
4,100125158,Cádiz,HR,1266.50,2204.24,26,87.58,1,3.00,0.46,260,180,440
...,...,...,...,...,...,...,...,...,...,...,...,...,...
41846,999934164,Barcelona,HR,691.00,785.30,23,38.22,1,2.00,0.52,230,180,410
41847,999940211,Barcelona,AR,557.82,260.55,3,76.43,0,1.00,0.00,30,0,30
41848,999940578,Madrid,AR,1101.52,1044.61,13,70.44,2,1.00,1.85,130,360,490
41849,999941988,Madrid,AR,5343.15,3828.31,35,108.86,2,3.00,0.69,350,360,710


In [48]:
#Checking there are no clients with negative yearly income
yearly_df[yearly_df['yearly_income'] < 0]

Unnamed: 0,client_id,city,channel,yearly_volume,yearly_income,yearly_number_of_orders,median_ticket,prom_contacts_month,frequency,efficiency,logistics_cost,visit_cost,total_cost


In [49]:
#Checking there are no clients with negative yearly volumes
yearly_df[yearly_df['yearly_volume'] < 0]

Unnamed: 0,client_id,city,channel,yearly_volume,yearly_income,yearly_number_of_orders,median_ticket,prom_contacts_month,frequency,efficiency,logistics_cost,visit_cost,total_cost


<a id='3.2'></a>
## Part 3.2 - *Conceptual questions*

### 1. In which unit is Volume recorded (kg, litres, cases, other)? 

We assume that in kg.

### 2. What does one row represent in this dataset?

### 3. Who decides frequency? Isn't frequency decided by the client, as they choose how many times they can order?

---

### Checking if the number of promotional contacts is consistent across the whole dataset for every client.

We first checked if prom_contacts_month stayed the same across the dataset for every client, without taking returns/reimbursements, rows with number of orders equal to 0, and zero or negative income into account. We found that prom_contacts_month values did vary if not taking these factors into account.

In [50]:
# Group by client_id and count unique values of prom_contacts_month
prom_contact_variability = df.groupby('client_id')['prom_contacts_month'].nunique().reset_index()
prom_contact_variability.columns = ['client_id', 'unique_prom_contacts_values']

# Filter clients with more than one unique value
inconsistent_clients = prom_contact_variability[prom_contact_variability['unique_prom_contacts_values'] > 1]

print("Number of clients with inconsistent prom_contacts_month:", inconsistent_clients.shape[0])
print("List of clients with inconsistencies:")
print(inconsistent_clients)

Number of clients with inconsistent prom_contacts_month: 8540
List of clients with inconsistencies:
       client_id  unique_prom_contacts_values
4      100125158                            2
26     100570715                            2
33     100607540                            2
34     100648094                            2
45     100900019                            2
...          ...                          ...
42133  999773029                            2
42135  999807215                            2
42142  999905686                            2
42146  999940578                            2
42147  999941988                            2

[8540 rows x 2 columns]


Therefore, we decided to take these factors into account, and check for consistency across prom_contacs_month for each client. There were no rows with orders = 0, as well as no rows with zero or negative income. Having taken this into account, we found that prom_contacts_month, stayed the same for all clients throughout the whole dataset. 

In [51]:
no_zero_orders = df[df['number_of_orders'] != 0]
positive_income_nzo = no_zero_orders[no_zero_orders['income']> 0]
prom_contact_variability = positive_income_nzo.groupby('client_id')['prom_contacts_month'].nunique().reset_index()
prom_contact_variability.columns = ['client_id', 'unique_prom_contacts_values']

# Filter clients with more than one unique value
inconsistent_clients = prom_contact_variability[prom_contact_variability['unique_prom_contacts_values'] > 1]

print("Number of clients with inconsistent prom_contacts_month:", inconsistent_clients.shape[0])
print("List of clients with inconsistencies:")
print(inconsistent_clients)

Number of clients with inconsistent prom_contacts_month: 0
List of clients with inconsistencies:
Empty DataFrame
Columns: [client_id, unique_prom_contacts_values]
Index: []


If we removed number of orders = 0 and zero/negative income values, we would still be keeping around 98% of the whole dataset approximately.

In [52]:
(len(positive_income_nzo) / len(df))*100

98.16772006916494

Removing negative and zero income values, as well as zero orders for a row, does not get rid of the inconsistencies between the relationship of income and median ticket.

In [53]:
positive_income_nzo['check'] = positive_income_nzo['income'] - (positive_income_nzo['median_ticket']*positive_income_nzo['number_of_orders'])

#The idea is that this consistency check should be equal to 0
#If this is not equal to 0, then the relationship does not hold

#This new dataframe is to check those inconsistencies
#We do a threshold of > 1, as those "inconsistencies" which are due to decimal point errors are not an issue (minimal error)
inconsistent_rel = positive_income_nzo[positive_income_nzo['check']>1]
inconsistent_rel

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  positive_income_nzo['check'] = positive_income_nzo['income'] - (positive_income_nzo['median_ticket']*positive_income_nzo['number_of_orders'])


Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month,check
924,2024-01-15,Alicante,AR,194410127,729030652,247.88,143.48,1,71.74,0,8,2024-01,71.74
1230,2024-01-18,Alicante,AR,426657251,39304770,166.94,45.56,1,22.78,4,0,2024-01,22.78
1603,2024-01-24,Alicante,AR,531963963,218497097,8.58,870.04,1,435.02,0,0,2024-01,435.02
2483,2024-02-06,Alicante,AR,413503307,551409294,890.94,412.08,1,206.04,2,0,2024-02,206.04
2612,2024-02-07,Alicante,HR,102561197,39304770,29.00,111.33,1,55.66,4,0,2024-02,55.66
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1009106,2024-11-20,Valencia,AR,785149986,998162842,491.33,369.47,1,184.74,4,0,2024-11,184.74
1009156,2024-11-20,Valencia,HR,256289268,52875287,71.25,55.05,1,27.52,2,0,2024-11,27.52
1010703,2024-12-02,Valencia,AR,380180714,444765134,1551.34,583.99,1,292.00,0,4,2024-12,292.00
1011400,2024-12-05,Valencia,AR,588478841,52875287,245.90,94.58,1,47.29,4,0,2024-12,47.29


In [54]:
positive_income_nzo

Unnamed: 0,date,city,channel,client_id,promotor_id,volume,income,number_of_orders,median_ticket,prom_contacts_month,tel_contacts_month,month,check
1,2024-01-01,Alicante,HR,410234355,551409294,48.00,21.02,1,21.02,4,0,2024-01,0.00
2,2024-01-02,Alicante,AR,123463493,551409294,125.25,92.57,1,92.57,1,0,2024-01,0.00
3,2024-01-02,Alicante,AR,124527399,729030652,83.00,60.94,1,60.94,4,0,2024-01,0.00
4,2024-01-02,Alicante,AR,130100821,729030652,768.00,244.33,1,244.33,1,3,2024-01,0.00
5,2024-01-02,Alicante,AR,159147063,91937945,756.00,229.82,1,229.82,1,1,2024-01,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1014960,2024-12-31,Valencia,HR,974505828,249555220,120.00,119.20,1,119.20,4,0,2024-12,0.00
1014961,2024-12-31,Valencia,HR,976757748,327176535,79.96,255.49,1,255.49,4,0,2024-12,0.00
1014962,2024-12-31,Valencia,HR,977650762,937854151,85.89,280.38,1,280.38,0,1,2024-12,0.00
1014963,2024-12-31,Valencia,HR,982745366,52875287,178.50,280.24,1,280.24,0,4,2024-12,0.00


## More tests

In [None]:
clean_df = df[(df['number_of_orders'] != 0) & (df['income'] > 0)]

In [None]:
clean_df

In [None]:
clean_df['income'].min()

Checking with this dataset if promotional contacts are consistent across the whole dataset for every client --> They are

In [None]:
prom_contact_variability = clean_df.groupby('client_id')['prom_contacts_month'].nunique().reset_index()
prom_contact_variability.columns = ['client_id', 'unique_prom_contacts_values']

# Filter clients with more than one unique value
inconsistent_clients = prom_contact_variability[prom_contact_variability['unique_prom_contacts_values'] > 1]

print("Number of clients with inconsistent prom_contacts_month:", inconsistent_clients.shape[0])
print("List of clients with inconsistencies:")
print(inconsistent_clients)

Checking if volume values make sense here --> no, there are still some negative volumes. Unsure what to do here.

In [None]:
len(clean_df[clean_df['volume'] < 0])

In [None]:
clean_df[clean_df['volume'] < 0]

In [None]:
# therefore, for the analysis let's suppose we drop it
clean_df = clean_df[clean_df['volume'] >=0]

In [None]:
clean_df

Checking consistency of median ticket and income

In [None]:
#Created a column to check the consistency of this income and median_ticket relationship
clean_df['check'] = clean_df['income'] - (clean_df['median_ticket']*clean_df['number_of_orders'])

#The idea is that this consistency check should be equal to 0
#If this is not equal to 0, then the relationship does not hold

#This new dataframe is to check those inconsistencies
#We do a threshold of > 1, as those "inconsistencies" which are due to decimal point errors are not an issue (minimal error)
inconsistent_rel = clean_df[clean_df['check']>1]

#We found there are around 2000 entries which do not match this criteria
inconsistent_rel

#This is not a very large proportion of the whole dataset, should we drop them?
#How would you recommend we proceed?

Checking if we do it in total for a given customer this will give an equal sum.

In [None]:
client_194410127 = clean_df[clean_df['client_id'] == '194410127']
client_194410127

In [None]:
client_194410127.groupby('client_id').agg({'income':'sum',
                                           'volume':'sum',
                                           'number_of_orders':'sum',
                                           'check': 'sum'})



In [None]:
70* 190806.71

---

Creating the aggregated table --> desired output?

daily table --> monthly table --> yearly table (for 2024)

Yearly table has the following components:
* client_id
* city (which is unique for each client_id --> we have checked)
* channel (which is unique for each client_id --> we have checked)
* volume (yearly sum)
* income (yearly sum)
* number of orders (yearly sum)
* median ticket (unsure what to do with median ticket)
* prom_contacts_month (established number of monthly contacts for the client)
* tel_contacts_month (established number of monthly tel contacts for the client)
* frequency (median number of monthly orders --> can be compared against prom_contacts_month)
* efficiency (yearly --> prom_contacts_month * 12 / number of orders)
* logistics cost (yearly --> 10 * number of orders)
* visit cost (prom_contacts_month * 12 * 15)

The issue we are having trying to understand the yearly table is that it has monthly and yearly components and this may be confusing.

Then --> ...

In [None]:
full_table = clean_df.groupby(['client_id']).agg({'city':'first',
                                              'channel':'first',
                                              'volume':'sum',
                                              'income':'sum',
                                              'number_of_orders':'sum',
                                              'median_ticket':'median', #Unsure here
                                              'prom_contacts_month': 'first'})



In [None]:
yearly_df = monthly_df.groupby(['client_id']).agg({'city':'first',
                                              'channel':'first',
                                              'volume':'sum',
                                              'income':'sum',
                                              'number_of_orders':['sum', 'median'],
                                              'median_ticket':'median', #Unsure here
                                              'prom_contacts_month': 'first'})