#Enterprise Fleet Analytics Pipeline: Focuses on the business outcome (analytics) and the domain (fleet/logistics).

![](./logistics_project.png)

##**1. Data Munging** -

####1. Visibily/Manually opening the file and capture couple of data patterns (Manual Exploratory Data Analysis)

**Source 1: Logistics Shipment Data (JSON Format)**
- Data is received from the source system in JSON[Semi strcutured format]
- key–value pairs

**Source 2: Logistics Data (CSV Format – 4 Columns)**
- Data is received in CSV format with 4 columns
- Header present, no footer
- Null columns and null records are there
- Data format inconsistencies observed like age contain string value
- Includes additional column(s)

**Source 3: Logistics Data (CSV Format – 7 Columns)**
- Data is received in CSV format with 7 columns
- Header present, no footer
- Contains duplicate records
- Null columns and null records are there
- Data format inconsistencies observed like age contain string value
- Includes additional column(s)

####2. Programatically try to find couple of data patterns applying below EDA (File: logistics_source1)

1. Apply inferSchema and toDF to create a DF and analyse the actual data.
2. Analyse the schema, datatypes, columns etc.,
3. Analyse the duplicate records count and summary of the dataframe.

In [0]:
source1_df=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/logistics_data_analysis/Silver_data/logistics_source1",header=True,inferSchema=True).toDF("Shipment_id","First_Name","Last_Name","Age","Role")

print(source1_df.printSchema())

display(source1_df.show(10,False))
display(source1_df.columns)
display(source1_df.dtypes) #Age is in string format and shippment ID is in string type
print("actual count of the data in Source1:",source1_df.count())
print("De-duplicated record count (all columns using distinct):",source1_df.distinct().count())
print("de-duplicated given id column count:",source1_df.dropDuplicates(['Shipment_id']).count())


source2_df=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/logistics_data_analysis/Silver_data/logistics_source2",header=True)
display(source2_df.show(10,False))
display(source2_df.columns)
display(source2_df.dtypes) #Age is in string format and shippment ID is in string type
print("actual count of the data in Source2:",source2_df.count())
print("De-duplicated record count (all columns using dropDuplicates)",source2_df.dropDuplicates().count())
print("de-duplicated given id column count:",source2_df.dropDuplicates(['Shipment_id']).count())

display(source1_df.summary())
display(source2_df.summary())


###a. Passive Data Munging -  (File: logistics_source1  and logistics_source2)
Without modifying the data, identify:<br>
Shipment IDs that appear in both master_v1 and master_v2<br>
Records where:<br>
1. shipment_id is non-numeric
2. age is not an integer<br>

Count rows having:<br>
3. fewer columns than expected<br>
4. more columns than expected<br>

In [0]:
find_df1 = source1_df.where("Shipment_id rlike '[A-Za-z]'")
print("shipment_id is non-numeric from source1_df:")
display(find_df1)
source1_df.schema

find_df1 = source2_df.where("Shipment_id rlike '[A-Za-z]'")
print("shipment_id is non-numeric from source2_df:")
display(find_df1)
source2_df.schema

Count rows having:<br>
3. fewer columns than expected<br>
4. more columns than expected<br>

In [0]:
from pyspark.sql.functions import size,col,split,when
expected_cols = 5   # change as per your file
delimiter = ","
print("Source df1 Count rows:")
raw_df1=spark.read.text("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/logistics_data_analysis/Silver_data/logistics_source1")
df_with_col_count=raw_df1.withColumn('actual_col_count',size(split(col('value'),delimiter)))

df_flagged=df_with_col_count.withColumn("column_status",when(col("actual_col_count")<expected_cols,"FEWER_COLUMNS").when(col("actual_col_count")>expected_cols,"MORE_COLUMNS").otherwise("EXPECTED_COLUMNS"))
display(df_flagged)

df_bad_record = df_with_col_count.where(col("actual_col_count") != expected_cols).groupBy("actual_col_count").count()
display(df_bad_record)

print("Source df2 Count rows:")
raw_df2=spark.read.text("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/logistics_data_analysis/Silver_data/logistics_source2")
df_with_col_count1=raw_df2.withColumn("actual_column_count",size(split(col("value"),delimiter)))
df_flagged1=df_with_col_count1.withColumn("column_status",when(col("actual_column_count")<7,"FEWER_COLUMNS").when(col("actual_column_count")>7,"MORE_COLUMNS").otherwise("EXPECTED_COLUMNS"))
display(df_flagged1)
df_bad_record1=df_flagged1.where(col("actual_column_count")!=7).groupBy("actual_column_count").count()
display(df_bad_record1)

In [0]:
#Create a Spark Session Object
from pyspark.sql.session import SparkSession
spark=SparkSession.builder.appName("Logistic_analysis").getOrCreate()

###**b. Active Data Munging** File: logistics_source1 and logistics_source2

#####1.Combining Data + Schema Merging (Structuring)
1. Read both files without enforcing schema<br>
2. Align them into a single canonical schema: 
- shipment_id,<br>
- first_name,<br>
- last_name,<br>
- age,<br>
- role,<br>
- hub_location,<br>
- vehicle_type,<br>
- data_source<br>
3. Add data_source column with values as: system1, system2 in the respective dataframes<br>

Source 1 (System A): id, fname, lname, age<br>
Source 2 (System B): shipment_id, full_name, years<br>
Canonical schema (decided by you):- shipment_id, first_name, last_name, age<br>

All source data is reshaped into this structure before further use.

In [0]:
from pyspark.sql.functions import col, lit, expr

source1_raw = spark.read \
    .option("header", True) \
    .option("mode", "permissive") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/logistics_data_analysis/Silver_data/logistics_source1")

source2_raw = spark.read \
    .option("header", True) \
    .option("mode", "permissive") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/logistics_data_analysis/Silver_data/logistics_source2")

source1_canonical = source1_raw.select(
    col("shipment_id").cast("string").alias("shipment_id"),
    col("first_name"),
    col("last_name"),
    col("age").cast("string"),
    col("role"),
    lit(None).cast("string").alias("hub_location"),
    lit(None).cast("string").alias("vehicle_type"),
    lit("system1").alias("data_source")
)
source2_canonical = source2_raw.select(
    col("shipment_id").cast("string").alias("shipment_id"),
    col("first_name"),
    col("last_name"),
    col("age").cast("string"),
    col("role"),
    col("hub_location"),
    col("vehicle_type"),
    lit("system2").alias("data_source")
)
canonical_df = source1_canonical.unionByName(source2_canonical)
display(canonical_df)
print(canonical_df.printSchema())


#####2. Cleansing:
Cleansing (removal of unwanted datasets)<br>
1. Mandatory Column Check - Drop any record where any of the following columns is NULL:shipment_id, role<br>
2. Name Completeness Rule - Drop records where both of the following columns are NULL: first_name, last_name<br>
3. Join Readiness Rule - Drop records where the join key is null: shipment_id<br>

In [0]:
#Mandatory Column Check - Drop any record where any of the following columns is NULL:shipment_id, role
print("Before dropping duplicates:", canonical_df.count())
canonical_df1 = canonical_df.na.drop(how='any',subset=["shipment_id","role"])
print("After dropping duplicates:", canonical_df1.count())


#Name Completeness Rule - Drop records where both of the following columns are NULL: first_name, last_name
print("Before dropping duplicates:", canonical_df1.count())
canonical_df2 = canonical_df1.na.drop(how='all',subset=["first_name","last_name"])
print("After dropping duplicates:", canonical_df2.count())


Join Readiness Rule
A record must have a valid join key (shipment_id) to participate in downstream joins.
If the join key is NULL, the record is not usable and must be dropped.

In [0]:
#Join Readiness Rule - Drop records where the join key is null: shipment_id
from pyspark.sql.functions import col
print("Before Join Readiness check:", canonical_df2.count())
canonical_df_join_ready = canonical_df2.filter(col("shipment_id").isNotNull())
print("After Join Readiness check:", canonical_df_join_ready.count())

#OR

#canonical_df.na.drop(subset=["shipment_id"])



#####3.Scrubbing (convert raw to tidy)<br>
4. Age Defaulting Rule - Fill NULL values in the age column with: -1<br>
5. Vehicle Type Default Rule - Fill NULL values in the vehicle_type column with: UNKNOWN<br>
6. Invalid Age Replacement - Replace the following values in age:
"ten" to -1<br>
"" to -1<br>
7. Vehicle Type Normalization - Replace inconsistent vehicle types: 
truck to LMV<br>
bike to TwoWheeler<br>

In [0]:
#Age Defaulting Rule - Fill NULL values in the age column with: -1
cleaned_df=canonical_df_join_ready.na.fill("-1",subset=["age"])

#Vehicle Type Default Rule - Fill NULL values in the vehicle_type column with: UNKNOWN
cleaned_df1=cleaned_df.na.fill("UNKNOWN",subset=["vehicle_type"])

#Invalid Age Replacement - Replace the following values in age: "ten" to -1 and "" to -1
replacedata={'ten':'-1','':'-1'}
cleaned_df2=cleaned_df1.na.replace(replacedata,subset=["age"])

#Vehicle Type Normalization - Replace inconsistent vehicle types: truck to LMV and bike to TwoWheeler
replacedata1={'Truck':'LMV','Bike':'TwoWheeler'}
cleaned_df3=cleaned_df2.na.replace(replacedata1,subset=["vehicle_type"])
display(cleaned_df3)

####3. Standardization, De-Duplication and Replacement / Deletion of Data to make it in a usable format

Detail Dataframe creation <br>
1. Read Data from logistics_shipment_detail.json
2. As this data is a clean json data, it doesn't require any cleansing or scrubbing.

In [0]:
json_clean_df=spark.read.option("multiLine", True).json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/logistics_data_analysis/Silver_data/logistics_shipment_detail_3000.json")
display(json_clean_df.limit(5))


Standardizations:<br>

1. Add a column<br> 
Source File: logistics_shipment_detail_3000.json<br>: domain as 'Logistics'
2. Column Uniformity: 
role - Convert to lowercase<br>
Source File: logistics_source1 & logistics_source2<br>
vehicle_type - Convert values to UPPERCASE<br>
Source Files: logistics_shipment_detail_3000.json (and the merged master files)
hub_location - Convert values to initcap case<br>
3. Format Standardization:<br>
Source Files: logistics_shipment_detail_3000.json
Convert shipment_ref to string<br>
Pad to 10 characters with leading zeros<br>
Convert dispatch_date to yyyy-MM-dd<br>
Ensure delivery_cost has 2 decimal precision<br>
4. Data Type Standardization<br>
Standardizing column data types to fix schema drift and enable mathematical operations.<br>
Source File: logistics_source1 & logistics_source2 <br>
age: Cast String to Integer<br>
Source File: logistics_shipment_detail_3000.json<br>
shipment_weight_kg: Cast to Double<br>
Source File: logistics_shipment_detail_3000.json<br>
is_expedited: Cast to Boolean<br>
5. Naming Standardization <br>
Source File: logistics_source1 & logistics_source2<br>
Rename: first_name to staff_first_name<br>
Rename: last_name to staff_last_name<br>
Rename: hub_location to origin_hub_city<br>
6. Reordering columns logically in a better standard format:<br>
Source File: All 3 files<br>
shipment_id (Identifier), staff_first_name (Dimension)staff_last_name (Dimension), role (Dimension), origin_hub_city (Location), shipment_cost (Metric), ingestion_timestamp (Audit)

In [0]:
from pyspark.sql.functions import lower,upper,initcap
#Add a column
#Source File: logistics_shipment_detail_3000.json
#domain as 'Logistics'
json_clean_df1=json_clean_df.withColumn("domain",lit("Logistic"))
display(json_clean_df1.limit(5))

#Column Uniformity: role - Convert to lowercase
cleaned_df4=cleaned_df3.withColumn("role",lower(col("role")))


#vehicle_type - Convert values to UPPERCASE
cleaned_df5=cleaned_df4.withColumn("vehicle_type",upper(col("vehicle_type")))


#Source Files: logistics_shipment_detail_3000.json (and the merged master files) hub_location - Convert values to initcap case
cleaned_df6=cleaned_df5.withColumn("hub_location",initcap(col("hub_location")))
display(cleaned_df5.limit(100))

Format Standardization:<br>
Source Files: logistics_shipment_detail_3000.json Convert shipment_ref to string<br>
Pad to 10 characters with leading zeros<br>
Convert dispatch_date to yyyy-MM-dd<br>
Ensure delivery_cost has 2 decimal precision<br>

In [0]:
from pyspark.sql.functions import to_date,round
#Format Standardization:
#Source Files: logistics_shipment_detail_3000.json Convert shipment_ref to string
json_clean_df1.printSchema()
#->shipment_ref column not found in the data

#Pad to 10 characters with leading zeros
#->Not sure which column

#Convert dispatch_date to yyyy-MM-dd
json_clean_df2=json_clean_df1.withColumn("shipment_date",to_date(col("shipment_date"),'yyyy-MM-dd'))


#Ensure delivery_cost has 2 decimal precision
json_clean_df3=json_clean_df2.withColumn("shipment_cost",round(col("shipment_cost"),2))
display(json_clean_df3.limit(15))

Data Type Standardization<br>
Standardizing column data types to fix schema drift and enable mathematical operations.<br>
Source File: logistics_source1 & logistics_source2<br>
age: Cast String to Integer<br>
Source File: logistics_shipment_detail_3000.json<br>
shipment_weight_kg: Cast to Double<br>
Source File: logistics_shipment_detail_3000.json<br>
is_expedited: Cast to Boolean<br>

In [0]:
#Source File: logistics_source1 & logistics_source2
#age: Cast String to Integer
from pyspark.sql.functions import col

cleaned_df6.printSchema()
cleaned_df7 = cleaned_df6.withColumn("age",col("age").cast("int"))
cleaned_df7.printSchema()

#shipment_weight_kg: Cast to Double
json_clean_df3.printSchema()
#->Shippement_weight is already in double format

#is_expedited: Cast to Boolean
#-->Not Sure what to do, current dataset has no columns with is_expedited


In [0]:
#Naming Standardization
#Source File: logistics_source1 & logistics_source2
#Rename: first_name to staff_first_name
#Rename: last_name to staff_last_name
cleaned_df8=cleaned_df7.withColumnsRenamed({"first_name":"staff_first_name","last_name":"staff_last_name"})

#Rename: hub_location to origin_hub_city
cleaned_df9=cleaned_df8.withColumnRenamed("hub_location","origin_hub_city")
cleaned_df9.printSchema()

6. Reordering columns logically in a better standard format:<br>
Source File: All 3 files<br>
shipment_id (Identifier), staff_first_name (Dimension)staff_last_name (Dimension), role (Dimension), origin_hub_city (Location), shipment_cost (Metric), ingestion_timestamp (Audit)

In [0]:
#question is not clear

Deduplication:
1. Apply Record Level De-Duplication
2. Apply Column Level De-Duplication (Primary Key Enforcement)