# Prerequisite

- go to your target catalog and schema,
- click on volumes
- click on create drop down
- click on "create Volume"


![Screenshot 1.png](./Screenshot 1.png "Screenshot 1.png")

call it a unique name: "your_name_directory" 

![Screenshot 2025-03-11 at 11.52.39 AM.png](./Screenshot 2025-03-11 at 11.52.39 AM.png "Screenshot 2025-03-11 at 11.52.39 AM.png")

now you'd be able to see your directory under volumes:

![Screenshot 2025-03-11 at 11.55.28 AM.png](./Screenshot 2025-03-11 at 11.55.28 AM.png "Screenshot 2025-03-11 at 11.55.28 AM.png")  


# Import data
you can do this section manually  
- download teh zip file from https://www150.statcan.gc.ca/n1/en/pub/13-26-0001/2020001/ODHF_v1.1.zip 
- unzip on your local computer 
- upload the csv file it to directory you created
- Go to the csv file in volumes, create table and specity catalog and schema and table name

or 

use python to download and unzip the file

In [0]:
directory = '/Volumes/razi_demo/vch_workshop/razi_bayati_directory'

In [0]:
import requests
import os

url = "https://www150.statcan.gc.ca/n1/en/pub/13-26-0001/2020001/ODHF_v1.1.zip"
response = requests.get(url)


file_path = os.path.join(directory, "ODHF_v1.1.zip")
with open(file_path, "wb") as file:
    file.write(response.content)

In [0]:
import zipfile

with zipfile.ZipFile(file_path, 'r') as zip_ref:
    zip_ref.extractall(directory)

In [0]:
# Read the CSV file into a Spark DataFrame, create a temporary view, and display the DataFrame
csv_file_path = os.path.join(directory, "ODHF_v1.1", "odhf_v1.1.csv")
raw_df = spark.read.csv(csv_file_path, header=True, inferSchema=True)
raw_df.createOrReplaceTempView("odhf_v1_1")
display(raw_df)
# explore data profile using the + beside the table icon

Databricks data profile. Run in Databricks to view.

In [0]:
raw_df.write.mode("overwrite").saveAsTable("razi_demo.vch_workshop.raw_odhf_using_python")

now you can go back to UC and start exploring the data there 

# Data enrichment

using teh data profile here we realized 6.88% of rows miss lattitude and longitude, let's drop them 

In [0]:
df_cleaned = raw_df.dropna(subset=["latitude", "longitude"])
display(df_cleaned)
# use data profile to confirm the lat/long are valid now

Databricks data profile. Run in Databricks to view.

In [0]:
df_cleaned.write.mode("overwrite").saveAsTable("razi_demo.vch_workshop.cleaned_odhf_using_python")

go to UC and look at lineage tab for the new table

![Screenshot 2025-03-11 at 12.30.48 PM.png](./Screenshot 2025-03-11 at 12.30.48 PM.png "Screenshot 2025-03-11 at 12.30.48 PM.png")

# Create visualization

visualize the facilities on map

In [0]:
%pip install folium

import folium
from folium.plugins import MarkerCluster



In [0]:
# Convert Spark DataFrame to Pandas DataFrame
df_pandas = df_cleaned.toPandas()

# Create a map centered around the average latitude and longitude
map_center = [df_pandas['latitude'].mean(), df_pandas['longitude'].mean()]
facility_map = folium.Map(location=map_center, zoom_start=10)

# Add a marker cluster to the map
marker_cluster = MarkerCluster().add_to(facility_map)

# Add markers to the cluster
for idx, row in df_pandas.iterrows():
    folium.Marker(
        location=[row['latitude'], row['longitude']],
        popup=row['facility_name']
    ).add_to(marker_cluster)

# Display the map
display(facility_map)