# **Pacual Capstone Group 4 Notebook - Route Optimization**

Group members: *Abdullah Alshaarawi, James Alarde, Hiromitsu Fujiyama, Sanjo Joy, Thomas Arturo Renwick Morales*

---

This notebook is organized in the following sections:

* [Part 0 - Importing the Necessary Libraries](#0)

* [Part 1 - Data Loading](#1)

* [Part 2 - Data Cleaning/ Wrangling](#2)
  * [Part 2.1 - Preliminary Analysis of the Dataset](#2.1)
  * [Part 2.2 - Dealing with Duplicates](#2.2)
  * [Part 2.3 - Ensuring Correct Data Types](#2.3)
  * [Part 2.4 - Dealing with Null/Missing Values](#2.4)
  * [Part 2.5 - ?](#2.5)

* [Part 3 - Exploratory Data Analysis](#3)



---

<a id='0'></a>
## Part 0 - Importing the Necessary Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import sklearn 
import numpy
import joblib

<a id='1'></a>
# Part 1 - Data Loading

In [2]:
df = pd.read_csv('dataset/Orders_Master_Data(in).csv')

<a id='2'></a>
# Part 2 - Data Cleaning/ Wrangling

<a id='2.1'></a>
## Part 2.1 - Preliminary Analysis of the Dataset

<a id='2.2'></a>
## Part 2.2 - Dealing with Duplicates

In [3]:
df.head()

Unnamed: 0,Date,City,Channel,Client ID,Promotor ID,Volume,Income,Number of orders,Median Ticket (€),Prom Contacts Month,Tel Contacts Month
0,01.01.2024,Alicante,AR,398150871,729030652,5.94,0.0,1,0.0,0,0
1,01.01.2024,Alicante,HR,410234355,551409294,48.0,21.02,1,21.02,4,0
2,02.01.2024,Alicante,AR,123463493,551409294,125.25,92.57,1,92.57,1,0
3,02.01.2024,Alicante,AR,124527399,729030652,83.0,60.94,1,60.94,4,0
4,02.01.2024,Alicante,AR,130100821,729030652,768.0,244.33,1,244.33,1,3


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035735 entries, 0 to 1035734
Data columns (total 11 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   Date                 1035735 non-null  object 
 1   City                 1035735 non-null  object 
 2   Channel              1035735 non-null  object 
 3   Client ID            1035735 non-null  int64  
 4   Promotor ID          1035735 non-null  int64  
 5   Volume               1035735 non-null  float64
 6   Income               1035735 non-null  float64
 7   Number of orders     1035735 non-null  int64  
 8   Median Ticket (€)    1035735 non-null  float64
 9   Prom Contacts Month  1035735 non-null  int64  
 10  Tel Contacts Month   1035735 non-null  int64  
dtypes: float64(3), int64(5), object(3)
memory usage: 86.9+ MB


In [28]:
df.duplicated().any()

True

In [29]:
# Total number of rows
total_rows = df.shape[0]

# Number of exact duplicates (all columns identical)
exact_duplicates = df.duplicated().sum()
print(f"Exact Duplicates: {exact_duplicates} out of {total_rows}")


Exact Duplicates: 20770 out of 1035735


In [27]:
# Show all exact duplicates (entire row is duplicated) --> Exploring duplicates to see whether they were exact to then drop the first ocurrence only from the dataset.
exact_duplicates = df[df.duplicated(keep=False)]
exact_duplicates.sort_values(by=['Client ID', 'Date']).head(10)

Unnamed: 0,Date,City,Channel,Client ID,Promotor ID,Volume,Income,Number of orders,Median Ticket (€),Prom Contacts Month,Tel Contacts Month
919356,11.03.2024,Tarragona,HR,100854769,306190165,54.295,117.02,1,117.02,4,0
1018754,11.03.2024,Tarragona,HR,100854769,306190165,54.295,117.02,1,117.02,4,0
917803,12.02.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0
1017201,12.02.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0
925032,13.06.2024,Tarragona,HR,100854769,306190165,45.2,90.5,1,90.5,4,0
1024430,13.06.2024,Tarragona,HR,100854769,306190165,45.2,90.5,1,90.5,4,0
930843,16.09.2024,Tarragona,HR,100854769,306190165,129.0,74.14,1,74.14,4,0
1030241,16.09.2024,Tarragona,HR,100854769,306190165,129.0,74.14,1,74.14,4,0
917063,29.01.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0
1016461,29.01.2024,Tarragona,HR,100854769,306190165,105.0,45.93,1,45.93,4,0


In [31]:
df = df.drop_duplicates(keep='first')

In [33]:
# Total number of rows
total_rows = df.shape[0]

# Number of exact duplicates (all columns identical)
exact_duplicates = df.duplicated().sum()
print(f"Exact Duplicates: {exact_duplicates} out of {total_rows}")

Exact Duplicates: 0 out of 1014965


<a id='2.3'></a>
## Part 2.3 - Ensuring Correct Data Types

In [None]:
#Ensuring corect datatypes

<a id='2.4'></a>
## Part 2.4 - Dealing with Null/Missing Values

In [5]:
df.isna().any() #No columns have any null values

Date                   False
City                   False
Channel                False
Client ID              False
Promotor ID            False
Volume                 False
Income                 False
Number of orders       False
Median Ticket (€)      False
Prom Contacts Month    False
Tel Contacts Month     False
dtype: bool

In [6]:
df.isna().any().sum() #0 null values

0

<a id='2.5'></a>
## Part 2.5 - Creating new cols or aggregating?

Ask gpt if should aggregate now or create the new cols first

<a id='3'></a>
# Part 3 - Exploratory Data Analysis