## Author: Rodolpho Justino
### Data Analyst / Data Scientist

This is an exploratory data analysis of a logistics database from a company called loggi to find some insights about the dataset, the data is available [here](https://www.kaggle.com/datasets/franklinposso/loggi-deliveries)


The first step is to load the libraries

In [1]:
import pandas as pd
import numpy as np
import json
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from scipy import stats
import sklearn as skl

In [2]:
!wget -q "https://raw.githubusercontent.com/andre-marcos-perez/ebac-course-utils/main/dataset/deliveries.json" -O deliveries.json 

In [3]:
with open('deliveries.json', mode='r', encoding = 'utf8') as file:
    data = json.load(file)

In [4]:
len(data)

199

In [5]:
df = pd.DataFrame(data)

In [6]:
df.head()

Unnamed: 0,name,region,origin,vehicle_capacity,deliveries
0,cvrp-2-df-33,df-2,"{'lng': -48.05498915846707, 'lat': -15.8381445...",180,"[{'id': '313483a19d2f8d65cd5024c8d215cfbd', 'p..."
1,cvrp-2-df-73,df-2,"{'lng': -48.05498915846707, 'lat': -15.8381445...",180,"[{'id': 'bf3fc630b1c29601a4caf1bdd474b85', 'po..."
2,cvrp-2-df-20,df-2,"{'lng': -48.05498915846707, 'lat': -15.8381445...",180,"[{'id': 'b30f1145a2ba4e0b9ac0162b68d045c3', 'p..."
3,cvrp-1-df-71,df-1,"{'lng': -47.89366206897872, 'lat': -15.8051175...",180,"[{'id': 'be3ed547394196c12c7c27c89ac74ed6', 'p..."
4,cvrp-2-df-87,df-2,"{'lng': -48.05498915846707, 'lat': -15.8381445...",180,"[{'id': 'a6328fb4dc0654eb28a996a270b0f6e4', 'p..."


We see that not all the data was retrieved correctly, for example, deliveries has a lot of values inside the column, our need is to "explode" the column in order to retrieve the data.

first we perform the operation on another column, the origin one

In [7]:
origin_df = pd.json_normalize(df["origin"])
origin_df.head()

Unnamed: 0,lng,lat
0,-48.054989,-15.838145
1,-48.054989,-15.838145
2,-48.054989,-15.838145
3,-47.893662,-15.805118
4,-48.054989,-15.838145


The procedure worked, so we can merge the df with the previous one!

In [8]:
df = pd.merge(left = df, right = origin_df, how = "inner", left_index = True, right_index = True)
df.head()

Unnamed: 0,name,region,origin,vehicle_capacity,deliveries,lng,lat
0,cvrp-2-df-33,df-2,"{'lng': -48.05498915846707, 'lat': -15.8381445...",180,"[{'id': '313483a19d2f8d65cd5024c8d215cfbd', 'p...",-48.054989,-15.838145
1,cvrp-2-df-73,df-2,"{'lng': -48.05498915846707, 'lat': -15.8381445...",180,"[{'id': 'bf3fc630b1c29601a4caf1bdd474b85', 'po...",-48.054989,-15.838145
2,cvrp-2-df-20,df-2,"{'lng': -48.05498915846707, 'lat': -15.8381445...",180,"[{'id': 'b30f1145a2ba4e0b9ac0162b68d045c3', 'p...",-48.054989,-15.838145
3,cvrp-1-df-71,df-1,"{'lng': -47.89366206897872, 'lat': -15.8051175...",180,"[{'id': 'be3ed547394196c12c7c27c89ac74ed6', 'p...",-47.893662,-15.805118
4,cvrp-2-df-87,df-2,"{'lng': -48.05498915846707, 'lat': -15.8381445...",180,"[{'id': 'a6328fb4dc0654eb28a996a270b0f6e4', 'p...",-48.054989,-15.838145


We now remove the duplicate Origin column and rename the other ones

In [9]:
df = df.drop("origin", axis = 1)
df = df[["name","region", "lng", "lat", "vehicle_capacity", "deliveries"]]
df.rename(columns = {"lng":"hub_long","lat":"hub_lat"}, inplace = True)
df.head()

Unnamed: 0,name,region,hub_long,hub_lat,vehicle_capacity,deliveries
0,cvrp-2-df-33,df-2,-48.054989,-15.838145,180,"[{'id': '313483a19d2f8d65cd5024c8d215cfbd', 'p..."
1,cvrp-2-df-73,df-2,-48.054989,-15.838145,180,"[{'id': 'bf3fc630b1c29601a4caf1bdd474b85', 'po..."
2,cvrp-2-df-20,df-2,-48.054989,-15.838145,180,"[{'id': 'b30f1145a2ba4e0b9ac0162b68d045c3', 'p..."
3,cvrp-1-df-71,df-1,-47.893662,-15.805118,180,"[{'id': 'be3ed547394196c12c7c27c89ac74ed6', 'p..."
4,cvrp-2-df-87,df-2,-48.054989,-15.838145,180,"[{'id': 'a6328fb4dc0654eb28a996a270b0f6e4', 'p..."


Now, "exploding" the deliveries column, we perform the following operations

In [10]:
df_exploded = df[["deliveries"]].explode("deliveries")
df_exploded.head()

Unnamed: 0,deliveries
0,"{'id': '313483a19d2f8d65cd5024c8d215cfbd', 'po..."
0,"{'id': '320c94b17aa685c939b3f3244c3099de', 'po..."
0,"{'id': '3663b42f4b8decb33059febaba46d5c8', 'po..."
0,"{'id': 'e11ab58363c38d6abc90d5fba87b7d7', 'poi..."
0,"{'id': '54cb45b7bbbd4e34e7150900f92d7f4b', 'po..."


In [11]:
new_df = pd.concat([pd.DataFrame(df_exploded["deliveries"].apply(lambda record: record["size"])).rename(columns={"deliveries": "delivery_size"}),
                    pd.DataFrame(df_exploded["deliveries"].apply(lambda record: record["point"]["lng"])).rename(columns={"deliveries": "delivery_lng"}),
                    pd.DataFrame(df_exploded["deliveries"].apply(lambda record: record["point"]["lat"])).rename(columns={"deliveries": "delivery_lat"})
                   ], axis = 1)
new_df.head()

Unnamed: 0,delivery_size,delivery_lng,delivery_lat
0,9,-48.116189,-15.848929
0,2,-48.118195,-15.850772
0,1,-48.112483,-15.847871
0,2,-48.118023,-15.846471
0,7,-48.114898,-15.858055


Now that we "exploded" the data, we merge it to the previous df

In [12]:
df = df.drop("deliveries", axis = 1)
df = pd.merge(left = df, right = new_df, how = "right", left_index = True, right_index = True )
df.reset_index(inplace = True, drop = True)
df

Unnamed: 0,name,region,hub_long,hub_lat,vehicle_capacity,delivery_size,delivery_lng,delivery_lat
0,cvrp-2-df-33,df-2,-48.054989,-15.838145,180,9,-48.116189,-15.848929
1,cvrp-2-df-33,df-2,-48.054989,-15.838145,180,2,-48.118195,-15.850772
2,cvrp-2-df-33,df-2,-48.054989,-15.838145,180,1,-48.112483,-15.847871
3,cvrp-2-df-33,df-2,-48.054989,-15.838145,180,2,-48.118023,-15.846471
4,cvrp-2-df-33,df-2,-48.054989,-15.838145,180,7,-48.114898,-15.858055
...,...,...,...,...,...,...,...,...
636144,cvrp-2-df-62,df-2,-48.054989,-15.838145,180,8,-48.064269,-15.997694
636145,cvrp-2-df-62,df-2,-48.054989,-15.838145,180,4,-48.065176,-16.003597
636146,cvrp-2-df-62,df-2,-48.054989,-15.838145,180,9,-48.065841,-16.003808
636147,cvrp-2-df-62,df-2,-48.054989,-15.838145,180,1,-48.062327,-16.001568


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 636149 entries, 0 to 636148
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   name              636149 non-null  object 
 1   region            636149 non-null  object 
 2   hub_long          636149 non-null  float64
 3   hub_lat           636149 non-null  float64
 4   vehicle_capacity  636149 non-null  int64  
 5   delivery_size     636149 non-null  int64  
 6   delivery_lng      636149 non-null  float64
 7   delivery_lat      636149 non-null  float64
dtypes: float64(4), int64(2), object(2)
memory usage: 38.8+ MB


From the info above we can take different things:

* The data types are consistent with the data, no changes or conversions needed;
* There is no null data;
* there are 636 149 rows, each row represent a different delivery.

To confirm that there is no null o missing data, we use the following code line

In [14]:
df.isna().any()

name                False
region              False
hub_long            False
hub_lat             False
vehicle_capacity    False
delivery_size       False
delivery_lng        False
delivery_lat        False
dtype: bool

## Using GeoCodification on the delivery hub

The reverse geocodification transforms the lat and long data in their respective text descriptions

In [15]:
hub_df = df[["region","hub_long","hub_lat"]]
hub_df = hub_df.drop_duplicates().sort_values(by = "region").reset_index(drop=True)
hub_df.head()

Unnamed: 0,region,hub_long,hub_lat
0,df-0,-47.802665,-15.657014
1,df-1,-47.893662,-15.805118
2,df-2,-48.054989,-15.838145


For this particular example, we will be using nominatim for the reverse geocodification

In [16]:
geolocator = Nominatim(user_agent = "logdata_geocoder")
geocoder = RateLimiter(geolocator.reverse, min_delay_seconds = 1)

Because the geocoder needs data to be in string format of latitude and longitude, a conversion is needed

In [17]:
hub_df["coordinates"] = hub_df["hub_lat"].astype(str) + ", " + hub_df["hub_long"].astype(str)
hub_df["geodata"] = hub_df["coordinates"].apply(geocoder)
hub_df.head()

Unnamed: 0,region,hub_long,hub_lat,coordinates,geodata
0,df-0,-47.802665,-15.657014,"-15.657013854445248, -47.802664728268745","(Clinica dos Olhos, Rua 7, Quadra 2, Sobradinh..."
1,df-1,-47.893662,-15.805118,"-15.80511751066334, -47.89366206897872","(Bloco B / F, W1 Sul, SQS 103, Asa Sul, Brasíl..."
2,df-2,-48.054989,-15.838145,"-15.83814451122274, -48.05498915846707","(Armazém do Bolo, lote 4/8, CSB 4/5, Taguating..."


In [18]:
hub_geo_df = pd.json_normalize(hub_df["geodata"].apply(lambda data: data.raw))
hub_geo_df.columns

Index(['place_id', 'licence', 'osm_type', 'osm_id', 'lat', 'lon',
       'display_name', 'boundingbox', 'address.amenity', 'address.road',
       'address.residential', 'address.suburb', 'address.town',
       'address.municipality', 'address.county', 'address.state_district',
       'address.state', 'address.ISO3166-2-lvl4', 'address.region',
       'address.postcode', 'address.country', 'address.country_code',
       'address.building', 'address.neighbourhood', 'address.city',
       'address.shop', 'address.house_number'],
      dtype='object')

In [19]:
hub_geo_df = hub_geo_df[["address.town", "address.suburb", "address.city"]]
hub_geo_df.rename(columns = {"address.town":"hub_town", "address.suburb": "hub_suburb", "address.city": "hub_city"}, inplace = True)
hub_geo_df["hub_city"] = np.where(hub_geo_df["hub_city"].notna(), hub_geo_df["hub_city"],hub_geo_df["hub_town"])
hub_geo_df["hub_suburb"] = np.where(hub_geo_df["hub_suburb"].notna(), hub_geo_df["hub_suburb"], hub_geo_df["hub_city"])
hub_geo_df = hub_geo_df.drop("hub_town", axis = 1)
hub_geo_df.head()

Unnamed: 0,hub_suburb,hub_city
0,Sobradinho,Sobradinho
1,Asa Sul,Brasília
2,Taguatinga,Taguatinga


Now, it's time to combine the df with the information about the suburb and city with the main df

In [20]:
hub_df = pd.merge(left = hub_df, right = hub_geo_df, left_index = True, right_index = True)
hub_df = hub_df[["region","hub_suburb","hub_city"]]
hub_df.head()

Unnamed: 0,region,hub_suburb,hub_city
0,df-0,Sobradinho,Sobradinho
1,df-1,Asa Sul,Brasília
2,df-2,Taguatinga,Taguatinga


In [21]:
df = pd.merge(left = df, right = hub_df, how = "inner", on = "region")
df = df[["name", "region","hub_long", "hub_lat", "hub_city", "hub_suburb", "vehicle_capacity", "delivery_size", "delivery_lng", "delivery_lat"]]
df.head()
    

Unnamed: 0,name,region,hub_long,hub_lat,hub_city,hub_suburb,vehicle_capacity,delivery_size,delivery_lng,delivery_lat
0,cvrp-2-df-33,df-2,-48.054989,-15.838145,Taguatinga,Taguatinga,180,9,-48.116189,-15.848929
1,cvrp-2-df-33,df-2,-48.054989,-15.838145,Taguatinga,Taguatinga,180,2,-48.118195,-15.850772
2,cvrp-2-df-33,df-2,-48.054989,-15.838145,Taguatinga,Taguatinga,180,1,-48.112483,-15.847871
3,cvrp-2-df-33,df-2,-48.054989,-15.838145,Taguatinga,Taguatinga,180,2,-48.118023,-15.846471
4,cvrp-2-df-33,df-2,-48.054989,-15.838145,Taguatinga,Taguatinga,180,7,-48.114898,-15.858055


For the delivery dataset, with over 600 000 entries, the reversed geocodification was done locally on another machine and inserted on a different csv file, the data is available [here](https://raw.githubusercontent.com/andre-marcos-perez/ebac-course-utils/main/dataset/deliveries-geodata.csv)