<a href="https://colab.research.google.com/github/sohanmahamuni/EDA_Projects/blob/main/FedEx_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - FedEx EDA Project



##### **Project Type**    - EDA

# **Project Summary -**

The SCMS Delivery History Dataset (FedEx) is a comprehensive dataset containing details about delivery operations, including delivery times, package characteristics, and statuses. This project leverages Exploratory Data Analysis (EDA) to uncover patterns and trends in the dataset, with the ultimate goal of improving logistics performance, reducing delays, and identifying areas for optimization within FedEx's delivery system. By applying Univariate, Bivariate, and Multivariate analysis, the project aims to deliver actionable insights that will help businesses streamline their operations and enhance customer satisfaction.

The analysis begins with data cleaning and preprocessing, addressing missing values, outliers, and ensuring data consistency across variables. With over 20 visualizations, we explore different aspects of delivery performance such as the relationship between package weight, delivery status, and delivery time. Key performance metrics are analyzed, such as delivery delays, package characteristics, and performance by delivery zones. The goal is to identify trends, correlations, and bottlenecks that may be influencing delivery performance, which can then be optimized for greater efficiency.

Insights derived from the project will assist businesses in making data-driven decisions for optimizing delivery routes, reducing costs, and improving service quality. By understanding the factors influencing delivery times and performance, organizations can better allocate resources, improve scheduling, and provide a smoother experience for customers.

# **GitHub Link -**

https://github.com/sohanmahamuni

# **Problem Statement**


Efficient delivery management is a critical part of logistics companies like FedEx. Delays in deliveries, incorrect handling, and resource mismanagement can affect customer satisfaction and operational costs. The challenge lies in identifying the underlying factors contributing to delivery inefficiencies and developing strategies to optimize delivery processes. In this project, we aim to analyze the SCMS Delivery History dataset to uncover patterns, correlations, and bottlenecks that impact delivery performance, and suggest improvements that can lead to a more effective and efficient system.

#### **Define Your Business Objective?**

The business objective of this project is to help FedEx improve its delivery efficiency by identifying key factors that contribute to delays and inefficiencies. The goal is to generate insights that will allow the company to optimize its delivery operations, minimize delivery times, and enhance customer satisfaction. Through this analysis, we aim to provide a data-driven approach to improve resource allocation, delivery scheduling, and overall performance across different delivery zones.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('SCMS_Delivery_History_Dataset.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum().sort_values(ascending=False)
missing_values[missing_values > 0]


In [None]:
# Visualizing the missing values
msno.matrix(df)
plt.title("Missing Values Matrix")
plt.show()

### What did you know about your dataset?

The dataset contains over 10,000 records and 33 features. It includes information about shipment modes, vendors, delivery schedules, item weights, costs, and more. Some columns like Weight (Kilograms) and Freight Cost (USD) have missing values and need cleaning. There are no significant duplicate records.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Columns
df.columns.tolist()

In [None]:
# Dataset Describe
df.describe()

### Variables Description



  **Column Name | Description**

1. ID | Unique identifier for each delivery record
2. Project Code | Code associated with a specific logistics project
3. Project Object Code | Object code representing the type of item
4. Planned Delivery Date | Initially planned delivery date
5. Scheduled Delivery Date | The scheduled delivery date after planning
6. Delivered to Client Date | The date on which goods were actually delivered
7. Delivery Recorded Date | Date when delivery was recorded in the system
8. Delivery Status | Delivery status (On time / Late / In progress etc.)
9. Country | Country where goods are to be delivered
10. Beneficiary | End recipient of the goods
11. Final Delivery Point | Exact delivery address/location
12. Product Group | Group/category of product (e.g., Medical, Electrical)
13. Product Category | Specific category under product group
14. Item Description | Short description of the product
15. Brand | Brand name of the product
16. Dosage | Dosage information (for medical items)
17. Dosage Form | Form of medicine (tablet, capsule, etc.)
18. Units per Pack | Number of units per pack
19. Quantity (Pack) | Number of packs ordered
20. Unit of Measure (Per Pack) | Unit of measurement per pack (e.g.,
Tablet, Bottle)
21. Line Item Quantity | Total number of units = Units per Pack × Quantity (Pack)
22. Weight (Kilograms) | Weight of the shipment in kilograms
23. Freight Cost (USD) | Shipping cost in US dollars
24. Line Item Insurance (USD) | Insurance cost of the item
25. Line Item Value | Total value of the line item
26. Shipment Mode | Mode of shipment used (Air, Ocean, Truck, etc.)
27. Manufacturing Site | Site/factory where the product was manufactured
28. Vendor INCO Term | Vendor’s INCOTERM (logistics contractual term)
29. Purchasing Document | Reference document for purchasing
30. PO / SO | Purchase Order / Sales Order ID
31. Vendor | Supplier/vendor responsible for fulfilling the order
32. Shipment ID | Unique ID for each shipment
33. Material Number | Material code identifier


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
unique_vals = df.nunique().sort_values(ascending=False)
unique_vals

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
date_cols = [
    'Scheduled Delivery Date',
    'Delivered to Client Date',
    'Delivery Recorded Date'
]

for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')

In [None]:
# Optional: Rename column names for convenience (remove spaces, make lowercase)
df.columns = df.columns.str.strip().str.replace(' ', '_').str.lower()

In [None]:
# Count of nulls again after datetime conversion
df.isnull().sum().sort_values(ascending=False).head(10)

In [None]:
# Drop rows with missing crucial data for analysis
df_clean = df.dropna(subset=['weight_(kilograms)', 'freight_cost_(usd)', 'line_item_value'])

In [None]:
# New column: Delivery Delay in days
df_clean['delivery_delay'] = (df_clean['delivered_to_client_date'] - df_clean['scheduled_delivery_date']).dt.days

In [None]:
# Check data types
df_clean[['freight_cost_(usd)', 'weight_(kilograms)', 'line_item_value']].dtypes

In [None]:
# Remove commas or symbols and convert to float (if the values are strings)
df_clean['freight_cost_(usd)'] = pd.to_numeric(df_clean['freight_cost_(usd)'], errors='coerce')
df_clean['weight_(kilograms)'] = pd.to_numeric(df_clean['weight_(kilograms)'], errors='coerce')
df_clean['line_item_value'] = pd.to_numeric(df_clean['line_item_value'], errors='coerce')

In [None]:
# Drop rows with NaN values in these columns
df.dropna(subset=['weight_(kilograms)', 'freight_cost_(usd)'], inplace=True)

### What all manipulations have you done and insights you found?

Converted date columns to datetime format.

Cleaned column names.

Dropped rows with missing critical data like weight and freight cost.

Engineered a new column for delivery delay (in days).

Ensured numerical columns are in proper format.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# Univariate Analysis

#### Chart - 1

In [None]:
# Chart 1 - Shipment Mode Count
plt.figure(figsize=(8,5))
sns.countplot(data=df_clean, x='shipment_mode', palette='viridis')
plt.title('Distribution of Shipment Modes')
plt.xlabel('Shipment Mode')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar plots are ideal for categorical univariate data to compare frequencies across different shipment modes.

##### 2. What is/are the insight(s) found from the chart?

* Air is the dominant shipment mode, followed by Truck, Air Charter, and Ocean.

* There is a strong preference for air-based transport (Air + Air Charter), which suggests a need for speed in delivery.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* Logistics teams can evaluate if the high use of Air is justified in terms of cost vs. delivery speed.

* If Air shipments are being used even when urgency is not high, the company may reduce costs by shifting some deliveries to Ocean or Truck modes.

**Insights that lead to negative growth:**

* Over-dependence on Air shipment may lead to higher logistics costs, which could eat into the company's profit margins if not optimized.

* This also exposes the business to risks if air transport capacity becomes limited or prices surge due to geopolitical or fuel-related reasons.

#### Chart - 2

In [None]:
# Chart 2 - Top 10 Countries by Delivery Count
plt.figure(figsize=(10,5))
top_countries = df_clean['country'].value_counts().nlargest(10)
sns.barplot(x=top_countries.index, y=top_countries.values, palette='coolwarm')
plt.title('Top 10 Countries by Number of Deliveries')
plt.xlabel('Country')
plt.ylabel('Number of Deliveries')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is suitable for comparing the frequency of deliveries across different countries. It helps highlight which countries are the top delivery destinations, offering a clear understanding of demand concentration geographically.

##### 2. What is/are the insight(s) found from the chart?

* South Africa leads in the number of deliveries, followed by Nigeria and Côte d'Ivoire.

* These top 3 countries have significantly higher delivery volumes compared to others in the top 10 list.

* The rest show a gradual decline, with Tanzania and Zimbabwe at the lower end.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* Identifying high-demand countries like South Africa and Nigeria allows for better resource allocation (e.g., warehouse capacity, staff, transportation).

* It can help in forecasting demand, ensuring that inventory levels match regional delivery needs.

* Businesses can consider investing in regional logistics hubs in top-performing countries to reduce delivery time and cost.

**Insights that lead to negative growth:**

* If over-reliance on a few countries like South Africa occurs, the business becomes vulnerable to regional instability or policy changes in those areas.

* Countries with fewer deliveries (like Tanzania and Zimbabwe) may be underperforming markets, indicating potential logistical issues or lack of demand, both of which require investigation.

#### Chart - 3

In [None]:
# Chart - Freight Cost Distribution
plt.figure(figsize=(8,5))
sns.histplot(df_clean['freight_cost_(usd)'], bins=50, kde=True, color='teal')
plt.title('Distribution of Freight Cost (USD)')
plt.xlabel('Freight Cost (USD)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a KDE (Kernel Density Estimate) curve is ideal for showing how freight costs are distributed. It highlights the skewness of the data, helps detect outliers, and identifies the most common cost ranges in logistics operations.

##### 2. What is/are the insight(s) found from the chart?

* The distribution is right-skewed, meaning most freight costs are on the lower end, with a few extreme high-cost outliers.

* The majority of deliveries have freight costs below $10,000, with a significant peak around $0–$5,000.

* There are very few deliveries with exceptionally high freight costs (above $50,000), which are outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* By understanding the common cost ranges, businesses can optimize budget allocation, negotiate better shipping rates, or choose more cost-effective routes.

* Identifying outliers can help investigate unusual costs, which may be due to inefficient logistics, errors, or fraud—this can help reduce unnecessary expenses.

* Helps forecast and set freight cost benchmarks for future deliveries.

**Insights that lead to negative growth:**

* The presence of extreme high-cost outliers could signal inefficient routing, emergency shipments, or poor planning, which may be draining resources.

* If these high costs are not managed or reduced, they can erode profit margins, especially if they occur frequently or in high-priority deliveries.

#### Chart - 4

In [None]:
# Chart - Line Item Value Distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['line_item_value'], bins=50, kde=True, color='darkorange')
plt.title('Distribution of Line Item Value (USD)', fontsize=14)
plt.xlabel('Line Item Value (USD)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

 A histogram with a distribution line helps reveal how shipment values are spread, detect outliers, and evaluate the overall pricing structure.

##### 2. What is/are the insight(s) found from the chart?

* Highly right-skewed distribution: Most of the line item values are concentrated on the lower end of the scale (close to $0), while fewer items have very high values.

* High Frequency of Low-Value Shipments: Most shipments appear to be in the lower price range, possibly standard or domestic shipments.

* Majority of the transactions fall within the lower range (probably below $100,000), as shown by the tall bars on the left.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* Operational Focus: FedEx can optimize its operations to better handle high-volume, low-value shipments efficiently.

* Targeted Offerings: The data can inform decisions around pricing strategies or service bundles for frequent low-cost users.

* Risk Mitigation: Knowing that high-value items are rare but impactful allows for better fraud detection and exception handling protocols.

**Insights that lead to negative growth:**

* Revenue Disproportion: If a large portion of revenue depends on a few high-value shipments, this creates dependency risk.

* Cost-Heavy Low-Value Shipments: If the cost to process many low-value shipments exceeds the margin, it may require reevaluation of pricing models.

# Bivariate Analysis

#### Chart - 5

In [None]:
# Chart - 5 visualization code
import matplotlib.ticker as ticker
from matplotlib import style

# Convert 'weight_(kilograms)' and 'freight_cost_(usd)' to numeric, handling errors
df['weight_(kilograms)'] = pd.to_numeric(df['weight_(kilograms)'], errors='coerce')
df['freight_cost_(usd)'] = pd.to_numeric(df['freight_cost_(usd)'], errors='coerce')

# Drop rows with NaN values after conversion to avoid issues in plotting
df = df.dropna(subset=['weight_(kilograms)', 'freight_cost_(usd)'])

plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")
sns.scatterplot(data=df, x='weight_(kilograms)', y='freight_cost_(usd)', alpha=0.5, color='purple',edgecolor=None)

# Add regression line
sns.regplot(
    data=df,
    x='weight_(kilograms)',
    y='freight_cost_(usd)',
    scatter=False,
    color='black',
    line_kws={'linewidth': 1.5}
)

# Format axes
plt.gca().xaxis.set_major_locator(ticker.MaxNLocator(nbins=10))
plt.gca().yaxis.set_major_locator(ticker.MaxNLocator(nbins=10))
plt.xticks(rotation=45)
plt.xlabel('Weight (Kilograms)', fontsize=12)
plt.ylabel('Freight Cost (USD)', fontsize=12)
plt.title('Freight Cost vs Weight of Shipment', fontsize=14)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

scatter plot with a regression line was chosen to visualize the relationship between the shipment weight and the corresponding freight cost. It helps determine whether heavier shipments incur higher freight costs, which is essential for understanding cost drivers in logistics.

##### 2. What is/are the insight(s) found from the chart?

* There is a positive linear trend between the shipment weight and freight cost — as the weight increases, the freight cost also tends to increase.

* However, the correlation is not very strong; a significant number of low-weight shipments still have high freight costs, and vice versa.

* There are a few outliers (very high weights with relatively low cost or extremely high cost) that may need further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* By understanding the weight-cost correlation, companies can optimize shipment sizes to ensure cost-efficiency.

* Outliers might indicate inefficient routes, overcharges, or underutilized capacity, which can be flagged for process improvement.

* Freight contracts can be revisited using this insight to negotiate better rates based on shipment weight brackets.

**Insights that lead to negative growth:**

* If outliers or inefficient weight-cost relationships go unnoticed, logistics expenses may rise unnecessarily.

* Without using this insight, the company might continue paying disproportionately high freight costs for certain weight ranges.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12, 6))
sns.regplot(
    data=df,
    x='line_item_insurance_(usd)',
    y='freight_cost_(usd)',
    scatter_kws={'alpha':0.5, 'color':'orange'},
    line_kws={'color':'darkred'}
)
plt.title('Freight Cost vs Line Item Insurance')
plt.xlabel('Line Item Insurance (USD)')
plt.ylabel('Freight Cost (USD)')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot with a regression line because it’s ideal for analyzing the relationship between two continuous variables—in this case, freight cost and line item insurance. The chart helps identify whether there is a linear correlation or trend between the value of insurance and the cost incurred in freight.

##### 2. What is/are the insight(s) found from the chart?

* There is a weak to moderate positive correlation between line item insurance and freight cost—as the insurance value increases, freight cost tends to increase as well.

* However, the data is widely dispersed, especially for lower insurance values, where freight costs vary drastically.

* The regression line and its shaded confidence interval show a general upward trend, confirming the positive association.

* Some outliers exist where low insurance values are paired with unusually high freight costs and vice versa.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* This relationship can help in risk-based freight cost forecasting—if a product requires high insurance, companies can proactively budget higher freight expenses.

* Can guide logistics planning for high-value goods, improving cost efficiency and profitability by choosing optimized shipping options.

* It helps in setting insurance and shipping policies—for example, bundling high-value goods in fewer shipments to reduce cumulative freight cost.

**Insights that lead to negative growth:**

* If this correlation is ignored, businesses might underestimate freight costs for high-insurance items, leading to budget overruns.

* Also, frequent mismatches between freight cost and insurance value (seen in outliers) may signal inefficient shipment handling, possibly due to poor vendor selection or shipment mode misalignment.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(12, 6))
avg_freight_by_vendor = df.groupby('vendor')['freight_cost_(usd)'].mean().sort_values(ascending=False).head(10)
sns.barplot(x=avg_freight_by_vendor.values, y=avg_freight_by_vendor.index, palette='mako')
plt.title('Top 10 Vendors by Average Freight Cost', fontsize=14)
plt.xlabel('Average Freight Cost (USD)', fontsize=12)
plt.ylabel('Vendor Name', fontsize=12)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This horizontal bar chart was chosen to clearly rank and compare vendors based on their average freight costs.

##### 2. What is/are the insight(s) found from the chart?

* The medical export group BV has the highest average freight cost, followed by shanghai kehua bioengineering co., ltd. And inverness medical innovations Hong kong ltd.

* There's a significant gap in average freight cost between the top vendor (over $35,000) and the 10th-ranked vendor (just under $10,000), indicating varying shipping efficiencies or logistics complexities.

* Many high-cost vendors appear to be international companies, suggesting longer distances, heavier shipments, or premium shipping services may be involved.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* Vendor-Specific Cost Management: FedEx can engage with high-cost vendors to optimize shipping methods, consolidate shipments, or review contract terms.

* Contract Negotiation & Tiered Pricing: Helps in revisiting freight agreements with these vendors or introducing tiered pricing strategies based on volume or frequency.

* Supply Chain Optimization: Identifying expensive vendors may trigger a review of whether regional suppliers could be used instead, cutting down transportation costs.

**Insights that lead to negative growth:**

* Over-Reliance on Expensive Vendors: If high freight cost vendors are over-utilized without clear ROI, this can erode profit margins.

* Customer Pricing Imbalance: These elevated costs might be passed on to customers, risking competitive pricing and customer churn.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(12,6))
sns.boxplot(data=df, x='product_group', y='freight_cost_(usd)', palette='coolwarm')
plt.title('Freight Cost Distribution by Product Group')
plt.xlabel('Product Group')
plt.ylabel('Freight Cost (USD)')
plt.xticks(rotation=20)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

 Boxplot because it is an effective visualization to analyze the distribution, central tendency, and spread of freight costs across different product groups. It also clearly highlights the presence of outliers, which is important in logistics cost analysis where some shipments may disproportionately impact the overall expenses.

##### 2. What is/are the insight(s) found from the chart?

* ARV and HRDT product groups have the highest median freight costs and widest cost distributions, along with a significant number of outliers, indicating inconsistent and possibly inflated shipping costs for these categories.

* ACT has a relatively moderate median and a tighter distribution, suggesting more predictable freight costs.

* MRDT has almost negligible freight cost, or very low cost across the board.

* ANTM shows a small number of outliers and a moderate distribution, suggesting some variation but not extreme.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* By identifying ARV and HRDT as groups with high and inconsistent freight costs, stakeholders can investigate reasons—such as shipment size, urgency, or origin—and optimize transportation strategies (e.g., consolidated shipping or alternate logistics partners).

* ACT and ANTM appear more manageable and predictable, which helps in forecasting and budgeting.

**Insights that lead to negative growth:**

* The high variability and extreme outliers in ARV and HRDT might be a cost sink, especially if these products are not contributing proportionally to revenue. This could impact profitability if not addressed.

* If MRDT is indeed critical but consistently cheap, it might signal under-utilization of shipment capacity, or potential quality/volume issues.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10,6))
sns.boxplot(data=df, x='shipment_mode', y='freight_cost_(usd)', palette='pastel')
plt.title('Freight Cost by Mode of Shipment')
plt.xlabel('Mode of Shipment')
plt.ylabel('Freight Cost (USD)')
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A boxplot for this analysis because it effectively visualizes the spread, central tendency, and variability in freight costs across different modes of shipment (Air, Truck, Air Charter, Ocean). It also highlights outliers that may significantly impact logistics costs and identifies which modes are more cost-effective or volatile.

##### 2. What is/are the insight(s) found from the chart?

* Air Charter has the highest median freight cost and widest cost spread, indicating it is the most expensive and inconsistent mode of shipment.

* Air and Truck have relatively similar medians, but Air has more extreme outliers, indicating potential spikes in cost.

* Ocean appears to be more cost-efficient, with a lower median and tighter distribution compared to others.

* Outliers are present in all modes, but they are more frequent and extreme in Air Charter and Air, pointing to potential inefficiencies or high-priority shipping decisions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* This chart enables strategic decision-making by identifying which shipment modes contribute most to freight cost variability.

* Shifting non-urgent shipments from Air Charter to Ocean or Truck can significantly reduce freight expenses.

* Helps in budget forecasting and negotiating with logistics partners for cost control.

**Insights that lead to negative growth:**

* Heavy reliance on Air Charter, despite its high cost, could lead to profit margin erosion, especially if used unnecessarily for low-priority or bulk items.

* Ignoring the insights could result in unsustainable logistics costs and inefficient resource allocation.

# Multivariate Analysis

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='shipment_mode', y='freight_cost_(usd)', hue='product_group')
plt.title('Freight Cost by Mode of Shipment and Product Group')
plt.xlabel('Mode of Shipment')
plt.ylabel('Freight Cost (USD)')
plt.xticks(rotation=45)
plt.legend(title='Product Group')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

 Boxplot because it's perfect for comparing freight cost distributions across different modes of shipment and product groups. It visually conveys medians, variability, and outliers, making it easy to identify cost-intensive shipment methods and product categories at a glance.

##### 2. What is/are the insight(s) found from the chart?

* Air and Air Charter shipments tend to have higher freight costs, with Air Charter being especially expensive for the ARV group.

* The ARV product group shows high freight cost variability across all shipment types, particularly Air Charter, where median costs are significantly elevated.

* Ocean shipments, while slower, show lower median freight costs, especially for HRDT and ARV—suggesting a cost-effective option.

* Truck shipments are relatively moderate in cost, but ARV still shows a higher range of costs even here.

* HRDT and ARV are the most frequently shipped product groups across all modes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* Helps identify cost-saving opportunities by encouraging the use of ocean freight where feasible, particularly for non-urgent shipments.

* Businesses can optimize logistics planning—for instance, reserving Air Charter for high-priority or life-saving ARV shipments only.

* Assists in budget forecasting and inventory planning, ensuring better control over shipping expenses by product group.

**Insights that lead to negative growth:**

* Overreliance on Air Charter for ARV shipments can inflate logistics costs unnecessarily, especially if urgency is not critical.

* Without adjusting the mode of shipment by cost-effectiveness, the company may face reduced margins, particularly for high-volume shipments.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(12, 6))
grouped = df.groupby(['product_group', 'shipment_mode'])['freight_cost_(usd)'].sum().reset_index()
sns.barplot(data=grouped, x='product_group', y='freight_cost_(usd)', hue='shipment_mode')
plt.title('Total Freight Cost by Product Group & Shipment Mode')
plt.xlabel('Product Group')
plt.ylabel('Total Freight Cost (USD)')
plt.xticks(rotation=45)
plt.legend(title='Shipment Mode')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Grouped bar chart because it effectively visualizes the total freight cost for each product group broken down by shipment mode, making it easy to identify the most expensive logistics combinations and understand how costs are distributed across product types.

##### 2. What is/are the insight(s) found from the chart?

* ARV and HRDT are the top contributors to freight costs overall.

* Air shipments for ARV dominate the chart, with a total cost exceeding $25 million, making it the single most expensive shipping combination.

* Truck and Air Charter are also significantly used for ARV, contributing to overall high logistics spend.

* Ocean shipments are underutilized across all product groups despite their lower costs.

* ACT, ANTM, and MRDT have negligible freight cost contributions, implying either low volume or infrequent shipments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* The chart clearly pinpoints cost-heavy areas like ARV via Air and suggests where cost optimization is urgently needed.

* By shifting non-urgent ARV shipments to ocean or truck, businesses can significantly reduce logistics costs.

* Helps in supply chain budget planning and prioritizing shipment strategies based on cost-efficiency.

**Insights that lead to negative growth:**

* Continuing with Air as the dominant mode for ARV and HRDT will erode profit margins due to high logistics expenditure.

* Underutilization of cost-effective modes like Ocean indicates a missed opportunity for optimization and more sustainable shipping.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(14, 6))
sns.swarmplot(data=df, x='country', y='freight_cost_(usd)', hue='product_group')
plt.title('Freight Cost Distribution by Country & Product Group')
plt.xlabel('Country')
plt.ylabel('Freight Cost (USD)')
plt.xticks(rotation=90)
plt.legend(title='Product Group', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plot because it visualizes the spread and variability of freight costs across different countries and product groups. It’s ideal for identifying outliers, cost anomalies, and country-wise cost concentrations that wouldn't be apparent in summary charts.

##### 2. What is/are the insight(s) found from the chart?

* ARV and HRDT dominate freight costs across most countries, visible through the large number of orange and blue dots.

* High-cost outliers are visible in countries like Ethiopia, Kenya, Mozambique, Nigeria, and Uganda, suggesting possible inefficiencies, urgent shipments, or long distances.

* Mozambique and Ethiopia stand out with freight costs reaching above $200,000–$250,000 in certain instances.

* Countries like Guinea, Libya, Liberia, and Guatemala have minimal freight cost distributions—indicating fewer shipments or better cost control.

* ACT, MRDT, and ANTM product groups are rarely shipped or have very low freight costs across all countries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* The country-specific analysis helps to target cost-reduction strategies for the most expensive regions (e.g., Kenya, Mozambique).

* Enables a granular audit of outlier costs, which can uncover logistical bottlenecks, corruption, or inefficient routing.

* Identifies countries with stable and controlled costs, which could serve as models for best practices.

**Insights that lead to negative growth:**

* Failure to address outliers could lead to budget overruns, especially in high-volume countries.

* Without such detailed analysis, strategic planning and negotiation with local logistics partners becomes less effective, leading to missed opportunities for cost savings.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(12, 6))
sns.violinplot(
    data=df_clean,
    x='shipment_mode',
    y='delivery_delay',
    hue='product_group',
    split=True,
    inner='quartile',
    palette='Set2'
)
plt.title('Delivery Delay Distribution by Mode of Shipment and Product Group')
plt.xlabel('Mode of Shipment')
plt.ylabel('Delivery Delay (Days)')
plt.legend(title='Product Group', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The violin plot was chosen because it effectively displays both the distribution and spread of delivery_delay across different mode_of_shipment categories, while also comparing across product_group. It gives a clear idea of how delays vary within and between shipment methods and product categories.

##### 2. What is/are the insight(s) found from the chart?

* Air shipments show wide variability in delivery delays, especially for ARV and ACT product groups.

* Truck shipments also display high negative outliers, indicating significant early deliveries or possibly incorrect date entries.

* Air Charter and Ocean have narrower distributions, suggesting more consistency in delivery timing.

* The ACT group tends to have more frequent and larger delivery delays, especially when shipped via Air.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The insights that can help creating a positive business impact in the following ways:**

* Helps identify which shipment methods and product categories are prone to delays or inconsistencies.

* Logistics and procurement teams can optimize shipping modes for products like ACT to improve delivery performance.

**Insights that lead to negative growth:**

* Continued use of modes with high variability for time-sensitive products can lead to customer dissatisfaction and loss of trust.

* Delays in delivery for critical goods like ARV (likely Antiretroviral drugs) could have serious consequences in a healthcare supply chain.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Selecting numerical features for correlation
corr_features = df_clean[['delivery_delay', 'freight_cost_(usd)', 'weight_(kilograms)', 'line_item_value']]

# Compute correlation matrix
corr_matrix = corr_features.corr()

# Plotting heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of Key Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

I chose the Correlation Heatmap because it is an effective visualization to examine linear relationships between multiple numerical features simultaneously. It helps identify which variables are positively or negatively correlated and to what extent, guiding further feature selection, analysis, or modeling decisions.

##### 2. What is/are the insight(s) found from the chart?

* freight_cost_(usd) has a moderate positive correlation with both line_item_value (0.43) and weight_(kilograms) (0.23), suggesting that as item weight and value increase, freight cost tends to increase too.

* line_item_value and weight_(kilograms) are moderately correlated (0.35), indicating that heavier items generally tend to be more valuable.

* delivery_delay shows almost no correlation with the other variables (all values between -0.04 and -0.02), implying that delivery delays are likely influenced by other factors (e.g., logistics, external disruptions) not captured in these specific numerical columns.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df_clean[['delivery_delay', 'freight_cost_(usd)', 'weight_(kilograms)', 'line_item_value']], kind='scatter', plot_kws={'alpha': 0.5})
plt.suptitle('Pair Plot of Numerical Features', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

I chose the Pair Plot because it enables a detailed visual inspection of relationships and distributions among all combinations of multiple numerical features.

##### 2. What is/are the insight(s) found from the chart?

* Most features, like freight_cost_(usd), weight_(kilograms), and line_item_value, show some positive trends with one another, supporting the moderate correlations from the heatmap.

* delivery_delay has a very scattered and wide distribution with other variables, especially showing no clear linear relationship, which visually confirms its weak correlation.

* The distributions along the diagonals highlight significant skewness, particularly in freight_cost_(usd) and line_item_value, which may require transformation (like log-scaling) for modeling or deeper analysis.

* There are clear outliers in all variables — especially in weight, freight cost, and line item value — which could impact analysis and should be investigated or cleaned if necessary.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.


**Solution to Business Objective:**

To achieve the business objective of optimizing FedEx’s delivery operations, we suggest a few key solutions based on the insights generated from the EDA:

1. Optimize Delivery Routes: By identifying patterns in delivery times and zones, FedEx can optimize delivery routes to reduce delays and improve efficiency.

2. Resource Allocation: Analyzing delivery times against package characteristics (weight, size) and delivery status will help allocate resources (drivers, vehicles) more effectively, ensuring timely deliveries.

3. Scheduling Improvements: The project will highlight the best times for deliveries and potential bottlenecks. This insight can be used to refine delivery scheduling and ensure smoother operations.

4. Data-Driven Decision Making: The visualizations and analysis will provide a clear view of the most critical performance metrics, helping managers make informed decisions to boost efficiency.

In addition to these solutions, the project also emphasizes continuous monitoring and real-time data integration, which will allow FedEx to make iterative improvements over time based on up-to-date data.

# **Conclusion**

This project provides valuable insights into FedEx’s delivery performance using a comprehensive Exploratory Data Analysis of the SCMS Delivery History dataset. Through the application of Univariate, Bivariate, and Multivariate analysis techniques, we’ve uncovered several key factors influencing delivery efficiency. By implementing the suggested solutions, such as optimizing delivery routes, improving resource allocation, and refining scheduling, FedEx can improve operational efficiency, reduce delivery times, and ultimately enhance customer satisfaction. The findings from this project offer a strategic roadmap for leveraging data to make better, more informed business decisions, driving continuous improvement in FedEx’s logistics and delivery systems.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***