#EXPLORE - Deaths
####Find Data Quality Issues in:
>_himalaya.bronze.deaths_

###_Previewing Data_

In [0]:
import pyspark.sql.functions as F

In [0]:
df = spark.table("himalaya.bronze.deaths")

In [0]:
print(f"Rows: {df.count()}")
df.printSchema()
df.show(10)

###_Finding Errors_
1. Null
2. Duplicates
3. Distinct
4. Format
5. Timeliness

> ## 1. Nulls
- ####_Nationality_
  - Can probably find out null values from other data sets.
  - If not in other data sets, will remove
- ####_Cause of Death_
  - No need to find out, it is ok if we don't know how they died

In [0]:
for c in df.columns:
  null_count = df.filter(F.col(c).isNull()).count()
  status = "✅" if null_count == 0 else "❌"
  print(f"{status} Nulls in {c}: {null_count}")

In [0]:
for c in df.columns:
  null_count = df.filter(F.col(c).isNull()).count()
  if null_count == 0:
      print(f"{c}:")
      df.filter(F.col("nationality").isNull()).show()

####_Nationality_
Can probably find out null values from other data sets.
If not in other data sets, will remove
___
####_Cause of Death_
No need to find out, it is ok if we don't know how they died

> ## 2. Duplicates

In [0]:
total = df.count()
distinct = df.distinct().count()
duplicates = total - distinct

print(f"Total rows:    {total}")
print(f"Distinct rows: {distinct}")
print(f"Duplicates:    {duplicates}")

if duplicates == 0:
    print("✅ No duplicates")
else:
    print(f"❌ {duplicates} duplicate found")

> ## 3. Distinct

- ####⚠️ Nations 
  - some weird countries
- ####⚠️ Cause of Deaths 
  - Various redundant. 
  - Make categories.
- ####✅ Mountains

In [0]:
cols = ["nationality", "cause_of_death", "mountain"]

for c in distinct:
    distinct_count = df.select(c).distinct().count()
    print(f"{c}: {distinct_count} distinct values")

In [0]:
nations = df.select(cols[0]).distinct().show(1000, truncate=False)

In [0]:
cause = df.select(cols[1]).distinct().show(1000, truncate=False)

In [0]:
mountain = df.select(cols[2]).distinct().show(1000, truncate=False)

> ## 4. Format

- ####⚠️ Date
  - Change date datatype from **string** to **date**

In [0]:
df.printSchema()

>## 5. Timeliness

In [0]:
min_date, max_date = df.select(F.min("date"), F.max("date")).first()

today = date.today()

print(f"Min: {min_date}")
print(f"Max: {max_date}")
print(f"Today: {today}")

print("✅ Max date is not in the future" if str(max_date) <= str(today) else "❌ Max date is in the future")

# Issues Found

### 1. Nulls
- ⚠️ `nationality` — null values present. Will attempt to fill from other datasets. If not found, will drop.
- ✅ `cause_of_death` — nulls acceptable, not required for analysis.

### 2. Duplicates
- ✅ No duplicates.

### 3. Distinct
- ⚠️ `nationality` — contains unexpected/inconsistent country values. Needs standardisation in Silver.
- ⚠️ `cause_of_death` — various redundant categories. Will consolidate into broader categories in Silver.
- ✅ `mountain` — looks clean.

### 4. Format
- ⚠️ `date` — stored as string. Will cast to DateType in Silver.

### 5. Timeliness
- ✅ Dates make sense