To Dos
- [ x ] Write overview
- [ x ] Write before you begin section
- [ x ] Write summary
- [ x ] Write next steps 
- [  ] Create guide for adding data 
- [ x ] Create guide for setting up compute 
- [ x ] Add link for delta lake info
- [  ] Add docs for data table in catalog. 

# Data Ingestion, Cleaning and Exploration with Delta Lake 

Databricks provides a notebook interface compatible with Python, SQL, Pyspark, Scala, R, and more. In this notebook, we discuss how some basic data operations that can be performed on a table using Pyspark. We also introduce Delta Lake and Delta tables. 

All code and descriptions below are written by Zoya Shafique, unless where noted.

## <img src = 'https://www.svgrepo.com/show/176852/pin-signs.svg' style="height: 50px; margin: 5px; padding: 5px"/> Overview
---

In this tutorial, you'll learn how to use Databrick's Delta Lake and PySpark functionalities for handling data. This tutorial is intended for users with some experience with data handling, Python and Machine Learning. 

By the end of this tutorial, you'll be able to:

* Create and manage Delta Tables
* Use PySpark for understanding and cleaning data

Note that the goal of this tutorial is not to provide a walk through of data cleaning but rather to show how Databricks lakehouse storage, Delta Lake, can be used in combination with PySpark for data handling.

## <img src = 'https://www.svgrepo.com/show/176852/pin-signs.svg' style="height: 50px; margin: 5px; padding: 5px"/> Before you begin
---
Before you start the tutorial, you should:

* Download the dataset from <a href="https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho" target="_blank">this link</a>.
* Create a compute resource. For information on how to initialize your compute, plese <a href="https://github.com/zoyashaf/DataLakehouses101/blob/21011a4ffd7e4eb7f045f393720a428b940a3b3b/docs/create_compute.pdf" target="_blank">check here</a>.
 


### <img src='https://www.svgrepo.com/show/122877/presentation.svg' style="height: 65px; margin: 5px; padding: 5px"/> Loading Data

This section covers basics of loading data in a notebook in Databricks. Users familiar with Pandas will see many similarities between the PySpark interface and Pandas operations.

In [None]:
'''
To begin, we must first initialize our Spark session which will allow us to use the DataFrame API to handle our data. 
In line 8 we are creating our spark session instance. We can use the SparkSession.Builder object to configure our Spark session, 
but for the sake of this tutorial, we are using the default settings. 
'''
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 

<img src='https://www.svgrepo.com/show/530436/help.svg' style="height: 50px; margin: 0px; padding: 0px"/> Please refer to <a href="https://github.com/zoyashaf/DataLakehouses101/blob/5e5423427db745aa64ea52e3efe7fcd1fb04a288/figures/catalog.png" target="_blank">this image</a> to see how to import data into your databricks account and notebook.



In [None]:
# Reading in data from our file storage 
## .option tells our code to read in the header row of the csv file 
car_data = df1 = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/zshafiq001@citymail.cuny.edu/car_details_v4_edited-1.csv")
display(car_data.limit(5))
# NOTE: You may also use car_data.show(), however display() showcases the data in an easy to read table whereas show() provides a raw output. 

make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,engine,max_power,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity
Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate,1198 cc,87 bhp @ 6000 rpm,109 Nm @ 4500 rpm,FWD,3990,1680,1505,5,35
Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual,1248 cc,74 bhp @ 4000 rpm,190 Nm @ 2000 rpm,FWD,3995,1695,1555,5,42
Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual,1197 cc,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,FWD,3585,1595,1550,5,35
Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual,1197 cc,82 bhp @ 6000 rpm,113 Nm @ 4200 rpm,FWD,3995,1745,1510,5,37
Toyota,Innova 2.4 VX 7 STR [2016-2020],1950000,2018,69000,Diesel,Manual,Mumbai,Grey,First,Individual,2393 cc,148 bhp @ 3400 rpm,343 Nm @ 1400 rpm,RWD,4735,1830,1795,7,55


#### <img src='https://www.svgrepo.com/show/170412/notebook.svg' style="height: 65px; margin: 5px; padding: 5px"/> Task 1: 


Click on the '+' sign next to 'Table' in the cell above. You will see options for visualization and data profile. 
  * First generate a data profile. Do you notice anything weird about the data? 
  * Next, generate a bar graph using the visualization option. Use 'Make' as the x-axis and 'Price' as the y-axis. Does there seem to be a discernable pattern between makes of cars and their price?




!!!
< Your answer here > 
!!!

We can also use pyspark commands to learn more about our data*: 
* **`describe()`**:  displays count, mean, stddev, min, max. 
* **`summary()`**:  displays interquartile range (IQR) in addition to attributes from describe.
* **`printschema()`**: prints table schema in a tree format, with each column name followed by the data type and the nullability indicator, which shows if the column allows nulls or not.


*Note: adapted from DataBricks Academy ML 01- Data Cleansing tutorial 


In [None]:
display(car_data.describe())

summary,make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,engine,max_power,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity
count,2059,2059,2059.0,2059.0,2049.0,2059,2059,2059,2059,2059,2059,1979,1979,1979,1923,1994.0,1994.0,1994.0,1990.0,1942.0
mean,,,1702991.6964545895,2016.4254492472076,54247.44899951196,,,,,,,,,,,4280.7557673019055,1768.0035105315949,1591.7888665997991,5.306532663316583,51.99603501544799
stddev,,,2419880.6354341814,3.363563584951663,57478.95417075647,,,,,,,,,,,442.5446885163208,135.2987754673467,136.08707867338376,0.8217184764800414,15.1060929440344
min,Audi,2 Series Gran Coupe 220d M Sport [2020-2021],100000.0,1988.0,0.0,CNG,Automatic,Agra,Beige,4 or More,Commercial Registration,1047 cc,100 bhp @ 3600 rpm,101 Nm @ 3000 rpm,AWD,3099.0,1475.0,1165.0,2.0,100.0
max,Volvo,i20 Sportz 1.4 CRDI,999000.0,2022.0,99000.0,Petrol + LPG,Manual,Zirakpur,Yellow,UnRegistered Car,Individual,999 cc,99 bhp @ 5000 rpm,99@2800,RWD,5569.0,2220.0,1995.0,8.0,95.0


In [None]:
display(car_data.summary())

summary,make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,engine,max_power,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity
count,2059,2059,2059.0,2059.0,2049.0,2059,2059,2059,2059,2059,2059,1979,1979,1979,1923,1994.0,1994.0,1994.0,1990.0,1942.0
mean,,,1702991.6964545895,2016.4254492472076,54247.44899951196,,,,,,,,,,,4280.7557673019055,1768.0035105315949,1591.7888665997991,5.306532663316583,51.99603501544799
stddev,,,2419880.6354341814,3.363563584951663,57478.95417075647,,,,,,,,,,,442.5446885163208,135.2987754673467,136.08707867338376,0.8217184764800414,15.1060929440344
min,Audi,2 Series Gran Coupe 220d M Sport [2020-2021],100000.0,1988.0,0.0,CNG,Automatic,Agra,Beige,4 or More,Commercial Registration,1047 cc,100 bhp @ 3600 rpm,101 Nm @ 3000 rpm,AWD,3099.0,1475.0,1165.0,2.0,100.0
25%,,,484999.0,2014.0,29000.0,,,,,,,,,,,3985.0,1695.0,1485.0,5.0,41.0
50%,,,825000.0,2017.0,50000.0,,,,,,,,,,,4370.0,1770.0,1545.0,5.0,50.0
75%,,,1925000.0,2019.0,72000.0,,,,,,,,,,,4629.0,1832.0,1675.0,5.0,60.0
max,Volvo,i20 Sportz 1.4 CRDI,999000.0,2022.0,99000.0,Petrol + LPG,Manual,Zirakpur,Yellow,UnRegistered Car,Individual,999 cc,99 bhp @ 5000 rpm,99@2800,RWD,5569.0,2220.0,1995.0,8.0,95.0


In [None]:
car_data.printSchema()
# Since printSchema() is meant to print, we don't need to add a display wrapper

root
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- price: string (nullable = true)
 |-- year: string (nullable = true)
 |-- kilometer: string (nullable = true)
 |-- fuel_type: string (nullable = true)
 |-- transmission: string (nullable = true)
 |-- location: string (nullable = true)
 |-- color: string (nullable = true)
 |-- owner: string (nullable = true)
 |-- seller_type: string (nullable = true)
 |-- engine: string (nullable = true)
 |-- max_power: string (nullable = true)
 |-- max_torque: string (nullable = true)
 |-- drivetrain: string (nullable = true)
 |-- length: string (nullable = true)
 |-- width: string (nullable = true)
 |-- height: string (nullable = true)
 |-- seating_capacity: string (nullable = true)
 |-- fuel_tank_capacity: string (nullable = true)



#### <img src='https://upload.wikimedia.org/wikipedia/commons/6/68/Exclamation_Point.svg' style="height: 45px; margin: 5px; padding: 5px"/> Concept Review

* <b>Data Profile</b> allows users to quickly and easily gain an understanding of their data. The tool provides a complete overview of the dataset's characteristics, statistics, and more. Furthermore, the data profile along with any visualizations can easily be added to a dashboard to quickly create effective summaries of the data. With these features, users can easily understand the basic structure of their data, explore the data distribution, identify missing values and more. 

### <img src='https://www.svgrepo.com/show/229520/lake.svg' style="height: 80px; margin: 5px; padding: 5px"/> Delta Lake Tables 

Currently, our data exists as a DataFrame that we built from our .csv file. To take full advantage of Databrick's software, we can convert our DataFrame into a Delta Lake table. Delta Lake is an open-source storage layer used by Databricks. It provides organization and ACID transaction support to traditional lake storage, such as a distributed file system. As such, saving our data as a Delta table can proide extra functionality for efficient processing. 

More information about Delta Lake can be found <a href="https://docs.databricks.com/en/delta/index.html" target="_blank">here. </a> 



In [None]:
# Write the data to a table.
table_name = "car_data_v1"
car_data.write.saveAsTable(table_name)

In [None]:
## We can use a spark.sql query to get a quick overview of the table we created in the previous cell. The following command shows us the meta data associated with our table. 
display(spark.sql('DESCRIBE DETAIL car_data_v1'))


format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics
delta,73e2bb10-33af-4f1f-99f5-9c9f8ce87ab8,spark_catalog.default.car_data_v1,,dbfs:/user/hive/warehouse/car_data_v1,2024-05-12T22:03:28.930+0000,2024-05-12T22:03:36.000+0000,List(),1,69682,Map(),1,2,"List(appendOnly, invariants)",Map()


To view the created table, navigate to the catalog tab in the sidebar. You will see "car_data_v1" listed underneath tables in the "Database Tables" tab. You can find more information about how to view your table in the catalog here. One of the main advantages of using a Delta table as opposed to a DataFrame is that the Delta Table keeps a historical record of your data. This helps with data versioning and also reproducibility. Fruthermore, a historical record can help you monitor your data quality and keep track of any and all changes. The best part in all of this is that, Delta Lake is an extension of the Spark DataFrame API, so we can treat our Delta table just like a normal table if we choose to. 

Another advantage of storing our DataFrame as a Delta table is that it allows us to store and manage metadata along with our actual data. 

More information about the functionality of Delta tables can be found <a href="https://docs.databricks.com/en/delta/tutorial.html#create-a-table" target="_blank">here. </a> 






### <img src = 'https://www.svgrepo.com/show/503651/vacuum-cleaner.svg' style="height: 80px; margin: 5px; padding: 5px"/> Data Cleaning 
---
In this section, we will explore how we can use Delta tables and PySpark has for analyzing and cleaning data.



In [None]:
# First lets load in our Delta table 
## Note: We can load our table using the table name we specified earlier or by using the direct path to the table. 
car_table = spark.read.table(table_name)
display(car_table)


make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,engine,max_power,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity
Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150.0,Petrol,Manual,Pune,Grey,First,Corporate,1198 cc,87 bhp @ 6000 rpm,109 Nm @ 4500 rpm,FWD,3990.0,1680.0,1505.0,5.0,35.0
Maruti Suzuki,Swift DZire VDI,450000,2014,75000.0,Diesel,Manual,Ludhiana,White,Second,Individual,1248 cc,74 bhp @ 4000 rpm,190 Nm @ 2000 rpm,FWD,3995.0,1695.0,1555.0,5.0,42.0
Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000.0,Petrol,Manual,Lucknow,Maroon,First,Individual,1197 cc,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,FWD,3585.0,1595.0,1550.0,5.0,35.0
Toyota,Glanza G,799000,2019,37500.0,Petrol,Manual,Mangalore,Red,First,Individual,1197 cc,82 bhp @ 6000 rpm,113 Nm @ 4200 rpm,FWD,3995.0,1745.0,1510.0,5.0,37.0
Toyota,Innova 2.4 VX 7 STR [2016-2020],1950000,2018,69000.0,Diesel,Manual,Mumbai,Grey,First,Individual,2393 cc,148 bhp @ 3400 rpm,343 Nm @ 1400 rpm,RWD,4735.0,1830.0,1795.0,7.0,55.0
Maruti Suzuki,Ciaz ZXi,675000,2017,73315.0,Petrol,Manual,Pune,Grey,First,Individual,1373 cc,91 bhp @ 6000 rpm,130 Nm @ 4000 rpm,FWD,4490.0,1730.0,,5.0,43.0
Mercedes-Benz,CLA 200 Petrol Sport,1898999,2015,47000.0,Petrol,Automatic,Mumbai,White,Second,Individual,1991 cc,181 bhp @ 5500 rpm,300 Nm @ 1200 rpm,FWD,4630.0,1777.0,1432.0,,
BMW,X1 xDrive20d M Sport,2650000,2017,75000.0,Diesel,Automatic,Coimbatore,White,Second,Individual,1995 cc,188 bhp @ 4000 rpm,400 Nm @ 1750 rpm,AWD,4439.0,1821.0,1612.0,5.0,51.0
Skoda,Octavia 1.8 TSI Style Plus AT [2017],1390000,2017,56000.0,Petrol,Automatic,Mumbai,White,First,Individual,1798 cc,177 bhp @ 5100 rpm,250 Nm @ 1250 rpm,FWD,4670.0,1814.0,1476.0,5.0,50.0
Nissan,Terrano XL (D),575000,2015,85000.0,Diesel,Manual,Mumbai,White,First,Individual,1461 cc,84 bhp @ 3750 rpm,200 Nm @ 1900 rpm,FWD,4331.0,1822.0,1671.0,5.0,50.0


#### <img src = 'https://www.svgrepo.com/show/499853/idea.svg' style="height: 60px; margin: 5px; padding: 5px"/> Looking at Datatypes

As we see from our data profile and our describe method above, many of the numerical categories were picked up as strings. We need to fix this before we can use our dataset.

In [None]:
from pyspark.sql.functions import col, translate

fixed_price_df = car_table.withColumn("price", translate(col("price"), " ", "").cast("double"))

## Lets confirm if the change worked as expected 
fixed_price_df.printSchema()

root
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- price: double (nullable = true)
 |-- year: string (nullable = true)
 |-- kilometer: string (nullable = true)
 |-- fuel_type: string (nullable = true)
 |-- transmission: string (nullable = true)
 |-- location: string (nullable = true)
 |-- color: string (nullable = true)
 |-- owner: string (nullable = true)
 |-- seller_type: string (nullable = true)
 |-- engine: string (nullable = true)
 |-- max_power: string (nullable = true)
 |-- max_torque: string (nullable = true)
 |-- drivetrain: string (nullable = true)
 |-- length: string (nullable = true)
 |-- width: string (nullable = true)
 |-- height: string (nullable = true)
 |-- seating_capacity: string (nullable = true)
 |-- fuel_tank_capacity: string (nullable = true)



#### <img src='https://www.svgrepo.com/show/170412/notebook.svg' style="height: 65px; margin: 5px; padding: 5px"/> Task 2: 
Which other columns should be numerical but were read as strings? Convert these to the correct data type. 
  * Hint: Some columns do contain characters alongside numerical values. For these columns, use translate(col("Column Name", "Characters", "")) to remove them before casting the column as a numerical datatype
  * Consider: Can we simply apply col and translate to the 'Max Power' and 'Max Torque' columns?

In [None]:
'''
!!! Your answer here !!!
'''

Out[24]: '\n!!! Your answer here !!!\n'

In [None]:
## NOTE: This is just one solution. 
from pyspark.sql.functions import split #function used for splitting strings 

## Separting Max Torque and Max Power columns into their parts 
### First, we split the strings at '@'. Then we assign each part of the split string to a separate column. Finally, we drop the original column from the dataframe. 

fixed_dtype_df = fixed_price_df.withColumn("max_torque_Nm", split("max_torque", "@")[0]) \
                   .withColumn("max_torque_rpm", split("max_torque", "@")[1])
fixed_dtype_df = fixed_dtype_df.drop("Max Torque")

fixed_dtype_df = fixed_dtype_df.withColumn("max_power_bhp", split("max_power", "@")[0]) \
                   .withColumn("max_power_rpm", split("max_power", "@")[1])
fixed_dtype_df = fixed_dtype_df.drop("max_power")

## After splitting the columns, we need to remove the units from the rows 
columns_with_strings = [['max_power_bhp', ' bhp', 'double'],
                  ['max_power_rpm', ' rpm', 'double'],  ['max_torque_rpm', ' rpm', 'double'], ['max_torque_Nm', ' Nm', 'double']]

for column, string, dtype in columns_with_strings:
  fixed_dtype_df = fixed_dtype_df.withColumn(column, translate(col(column), string, '').cast(dtype))

## We use translate() to replace 'cc' with empty strings and then convert the column to the correct data type. 
fixed_dtype_df = fixed_dtype_df.withColumn("engine_cc", translate(col("engine"), " cc", "").cast("double")) 
fixed_dtype_df = fixed_dtype_df.drop("engine")

## Converting string columns to int/double types
columns_to_fix = [['year', 'int'], ['kilometer', 'double'], ['length', 'double'],
                  ['width', 'double'], ['height', 'double'], ['seating_capacity', 'int'],
                  ['fuel_tank_capacity', 'double']] 

for column, dtype in columns_to_fix:
  fixed_dtype_df = fixed_dtype_df.withColumn(column, translate(col(column), ' ', '').cast(dtype))

fixed_dtype_df.printSchema()


root
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- price: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- kilometer: double (nullable = true)
 |-- fuel_type: string (nullable = true)
 |-- transmission: string (nullable = true)
 |-- location: string (nullable = true)
 |-- color: string (nullable = true)
 |-- owner: string (nullable = true)
 |-- seller_type: string (nullable = true)
 |-- max_torque: string (nullable = true)
 |-- drivetrain: string (nullable = true)
 |-- length: double (nullable = true)
 |-- width: double (nullable = true)
 |-- height: double (nullable = true)
 |-- seating_capacity: integer (nullable = true)
 |-- fuel_tank_capacity: double (nullable = true)
 |-- max_torque_Nm: double (nullable = true)
 |-- max_torque_rpm: double (nullable = true)
 |-- max_power_bhp: double (nullable = true)
 |-- max_power_rpm: double (nullable = true)
 |-- engine_cc: double (nullable = true)



In [None]:
# Now that our data is in the correct format, lets recalculate the statistics:
display(fixed_dtype_df.describe())

summary,make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity,max_torque_Nm,max_torque_rpm,max_power_bhp,max_power_rpm,engine_cc
count,2059,2059,2059.0,2059.0,2049.0,2059,2059,2059,2059,2059,2059,1979,1923,1994.0,1994.0,1994.0,1990.0,1942.0,1979.0,1979.0,1979.0,1975.0,1979.0
mean,,,1702991.6964545895,2016.4254492472076,54247.44899951196,,,,,,,,,4280.7557673019055,1768.0035105315949,1591.7888665997991,5.306532663316583,51.99603501544799,245.8510194037392,2619.545224861041,129.61177362304196,4835.093670886076,1692.5755432036383
stddev,,,2419880.6354341814,3.363563584951663,57478.95417075647,,,,,,,,,442.5446885163208,135.2987754673467,136.08707867338376,0.8217184764800414,15.1060929440344,140.46573097140717,1206.3147698317562,65.07379732207389,1097.368547624979,643.7362940735347
min,Audi,2 Series Gran Coupe 220d M Sport [2020-2021],49000.0,1988.0,0.0,CNG,Automatic,Agra,Beige,4 or More,Commercial Registration,101 Nm @ 3000 rpm,AWD,3099.0,1475.0,1165.0,2.0,15.0,48.0,150.0,35.0,2910.0,624.0
max,Volvo,i20 Sportz 1.4 CRDI,35000000.0,2022.0,2000000.0,Petrol + LPG,Manual,Zirakpur,Yellow,UnRegistered Car,Individual,99@2800,RWD,5569.0,2220.0,1995.0,8.0,105.0,780.0,6500.0,660.0,8250.0,6592.0


####  <img src = 'https://www.svgrepo.com/show/499853/idea.svg' style="height: 60px; margin: 5px; padding: 5px"/> Handling extreme and null values 

##### Looking into extreme values 
From our describe functions above, we can see some strange data such as a minimum of 0 km in the Kilometer column as a Fuel Capacity of 15. A max price of 3.5e7 also seems extreme for a used car. Let's explore this more. 

In [None]:
# Lets take a look at the price column 
display(fixed_dtype_df
        .groupBy("price").count()
        .orderBy(col("price").desc(), col("count"))
       )

## NOTE: Adding a visualization to our table  here can help us quickly understand the distribution of our data and the outliers 
### Some cars seem very expensive compared to the majority of the dataset

price,count
35000000.0,1
27500000.0,1
24000000.0,1
22000000.0,1
20000000.0,3
19300000.0,1
18500000.0,2
18000000.0,1
16200000.0,1
14900000.0,1


Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

In [None]:
display(fixed_dtype_df.filter(col("kilometer") == 0))
# Mini coopers can be expensive, the year of the model is 2022 and the car is unregistered so perhaps it is a new car for sale? 

make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity,max_torque_Nm,max_torque_rpm,max_power_bhp,max_power_rpm,engine_cc
MINI,Cooper JCW Hatchback,5200000.0,2022,0.0,Petrol,Automatic,Ahmedabad,Yellow,UnRegistered Car,Individual,320 Nm @ 1450 rpm,FWD,3850.0,1727.0,1414.0,4,44.0,320.0,1450.0,228.0,5200.0,1998.0


In [None]:
pos_km_df = fixed_dtype_df.filter(col("kilometer") > 0) # only keeping rows with km greater than 0 

In [None]:
# Now lets take a look at the minimum maximum values
display(pos_km_df
        .groupBy("kilometer").count()
        .orderBy(col("kilometer").desc(), col("count"))
       )

kilometer,count
2000000.0,1
925000.0,1
440000.0,1
261236.0,1
240000.0,1
222000.0,1
219000.0,1
211000.0,1
195000.0,1
192326.0,1


#### <img src='https://www.svgrepo.com/show/170412/notebook.svg' style="height: 65px; margin: 5px; padding: 5px"/> Task 3: 


Click on the '+' sign next to 'Table' in the cell above. You will see options for visualization.
  * Create two histogram plots. One with a bin size of 10 and one with a bin size of 100. What do these vizualizations tell you about the data?



!!!
< Your answer here > 
!!!

In [None]:
# Two used cars with 1 km (a litte more than 0.5 miles) seems strange. Lets look into the row further
display(pos_km_df.filter(col("kilometer") == 1))
# These are both 2022 Audis. Considering the high price, perhaps these are also new cars for sale? 

make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity,max_torque_Nm,max_torque_rpm,max_power_bhp,max_power_rpm,engine_cc
Audi,Q5 45 TFSI Premium Plus,5651000.0,2022,1.0,Petrol,Automatic,Delhi,Blue,UnRegistered Car,Individual,370 Nm @ 1600 rpm,AWD,4663.0,1898.0,1659.0,5,70.0,370.0,1600.0,248.0,5000.0,1984.0
Audi,A4 Premium Plus 40 TFSI,4151000.0,2022,1.0,Petrol,Automatic,Delhi,Black,UnRegistered Car,Individual,320 Nm @ 1450 rpm,FWD,4762.0,1847.0,1433.0,5,54.0,320.0,1450.0,188.0,4200.0,1984.0


##### Looking into null values 
We also have many columns with null values. How we approach these depends greatly on the domain and task at hand. 

For a moment, lets consider a different example. For instance, a survey dataset where respondents were asked about their income level, education level, and whether they own a car. In this dataset, the "car ownership" column contains null values for some respondents. Some key considerations for this dataset are listed below. 

  * <b> Missing Data: </b> Null values in the "car ownership" column could indicate missing data or non-response. This could be due to various reasons such as respondents choosing not to answer the question, data entry errors, or survey design issues. Understanding the missing data mechanism is crucial for assessing data quality and potential biases in the dataset.
  * <b> Imputation Strategy: </b> The presence of null values in the "car ownership" column could influence the choice of imputation strategy if the goal is to fill in missing values. For example, if null values are more prevalent among respondents with lower income levels, simply imputing the mean or median car ownership rate may not be appropriate as it could bias the analysis.
  * <b> Analyzing Patterns: </b> If null values in the "car ownership" column are associated with certain demographic characteristics such as age or location, it could indicate differences in car ownership rates among different groups of respondents. Understanding these patterns could provide valuable insights for targeted marketing strategies or policy interventions.

Depending on the context, how we handle null values chaneges. Some approaches are *
* Drop all rows with null values 
* Replace them with mean/median/zero/etc. 
* Replace them with the mode 
* Create a new column to denote rows that have null values 

For the purposes of this tutorial, we will replace the missing values with a value such as the average. This process is known as imputing. For this, we will look at Spark's <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.Imputer.html?highlight=imputer#pyspark.ml.feature.Imputer" target="_blank"><b> Imputer </b></a>  method. Note, it is important to include an extra column denoting that a field has been imputed if the operation is performed*.


*Note: adapted from DataBricks Academy ML 01- Data Cleansing tutorial 

%md
#### <img src='https://www.svgrepo.com/show/170412/notebook.svg' style="height: 65px; margin: 5px; padding: 5px"/> Task 4: 


Take a moment to look through the Imputer function's documentation (linked above).
  * Is there any requirements for the data type of the inputs? 
  * Can Imputer perform any type of imputation (e.g., numerical, categorial)? 


!!!
< Your answer here > 
!!!

In the following cells, we will prepare our data for imputing. 
  * First, we need to convert any integer columns into double 
  * We need to denote rows where null values are present so we can keep track of imputed values. 

In [None]:
'''
The code in this cell is taken from: Databricks Academy ML 01 Data Cleansing 
'''
from pyspark.sql.types import IntegerType

integer_columns = [x.name for x in pos_km_df.schema.fields if x.dataType == IntegerType()]
doubles_df = pos_km_df

for c in integer_columns:
    doubles_df = doubles_df.withColumn(c, col(c).cast("double"))

columns = "\n - ".join(integer_columns)
print(f"Columns converted from Integer to Double:\n - {columns}")

Columns converted from Integer to Double:
 - year
 - seating_capacity


In [None]:
# We need to denote which rows had null values before we impute our data. 
from pyspark.sql.functions import when

impute_cols = [
    "kilometer",
    "max_power_bhp", 
    "max_torque_Nm",
    "max_power_rpm", 
    "max_torque_rpm",
    "length",
    "width",
    "height",
    "seating_capacity",
    "fuel_tank_capacity",
    "engine_cc"
]

# We will put a 0 if there is no value in the given column for that row and a 1 if there is a null value. 
for c in impute_cols:
    doubles_df = doubles_df.withColumn(c + "_na", when(col(c).isNull(), 1.0).otherwise(0.0))

In [None]:
display(doubles_df.limit(10))

make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity,max_torque_Nm,max_torque_rpm,max_power_bhp,max_power_rpm,engine_cc,kilometer_na,max_power_bhp_na,max_torque_Nm_na,max_power_rpm_na,max_torque_rpm_na,length_na,width_na,height_na,seating_capacity_na,fuel_tank_capacity_na,engine_cc_na
Honda,Amaze 1.2 VX i-VTEC,505000.0,2017.0,87150.0,Petrol,Manual,Pune,Grey,First,Corporate,109 Nm @ 4500 rpm,FWD,3990.0,1680.0,1505.0,5.0,35.0,109.0,4500.0,87.0,6000.0,1198.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Maruti Suzuki,Swift DZire VDI,450000.0,2014.0,75000.0,Diesel,Manual,Ludhiana,White,Second,Individual,190 Nm @ 2000 rpm,FWD,3995.0,1695.0,1555.0,5.0,42.0,190.0,2000.0,74.0,4000.0,1248.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hyundai,i10 Magna 1.2 Kappa2,220000.0,2011.0,67000.0,Petrol,Manual,Lucknow,Maroon,First,Individual,112.7619 Nm @ 4000 rpm,FWD,3585.0,1595.0,1550.0,5.0,35.0,112.7619,4000.0,79.0,6000.0,1197.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Toyota,Glanza G,799000.0,2019.0,37500.0,Petrol,Manual,Mangalore,Red,First,Individual,113 Nm @ 4200 rpm,FWD,3995.0,1745.0,1510.0,5.0,37.0,113.0,4200.0,82.0,6000.0,1197.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Toyota,Innova 2.4 VX 7 STR [2016-2020],1950000.0,2018.0,69000.0,Diesel,Manual,Mumbai,Grey,First,Individual,343 Nm @ 1400 rpm,RWD,4735.0,1830.0,1795.0,7.0,55.0,343.0,1400.0,148.0,3400.0,2393.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Maruti Suzuki,Ciaz ZXi,675000.0,2017.0,73315.0,Petrol,Manual,Pune,Grey,First,Individual,130 Nm @ 4000 rpm,FWD,4490.0,1730.0,,5.0,43.0,130.0,4000.0,91.0,6000.0,1373.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Mercedes-Benz,CLA 200 Petrol Sport,1898999.0,2015.0,47000.0,Petrol,Automatic,Mumbai,White,Second,Individual,300 Nm @ 1200 rpm,FWD,4630.0,1777.0,1432.0,,,300.0,1200.0,181.0,5500.0,1991.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
BMW,X1 xDrive20d M Sport,2650000.0,2017.0,75000.0,Diesel,Automatic,Coimbatore,White,Second,Individual,400 Nm @ 1750 rpm,AWD,4439.0,1821.0,1612.0,5.0,51.0,400.0,1750.0,188.0,4000.0,1995.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Skoda,Octavia 1.8 TSI Style Plus AT [2017],1390000.0,2017.0,56000.0,Petrol,Automatic,Mumbai,White,First,Individual,250 Nm @ 1250 rpm,FWD,4670.0,1814.0,1476.0,5.0,50.0,250.0,1250.0,177.0,5100.0,1798.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Nissan,Terrano XL (D),575000.0,2015.0,85000.0,Diesel,Manual,Mumbai,White,First,Individual,200 Nm @ 1900 rpm,FWD,4331.0,1822.0,1671.0,5.0,50.0,200.0,1900.0,84.0,3750.0,1461.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We are now ready to impute our data! Users familiar with scikit-learn will recognize the syntax for applying the Imputer function. 
 * We first create an instance of the Imputer object, specifying the method that we want to use to impute our data. 
 * Next, we 'fit' the impute instance on our data.
 * Finally, we call Imputer's transform method to convert our existing dataframe with its null values into a dataframe with all the null values filled in. 

More speciically, Spark ML's APIs are standardized in much the same way as scikit-learn. This allows different methods to be packaged into one pipeline. More details on two of the key components of the Spark ML API are described below: 

**
* **Transformers**: Converts one DataFrame into another. Takes a DataFrame as input and returns an updated DataFrame, based on the function. Transformers do not learn any parameters from the data and simply apply rule-based transformations. It has a **`.transform()`** method. 

* **Estimator**: An algorithm which can be fit on a DataFrame to produce a Transformer. It has a **`.fit()`** method because it learns parameters from your DataFrame in order to transform it.   

**

It is important to note that any call to a '.fit()' function should only be applied to training data. 

** Note: Descriptions taken from DataBricks Academy ML 01- Data Cleansing tutorial 


In [None]:
from pyspark.ml.feature import Imputer

imputer = Imputer(strategy="median", inputCols=impute_cols, outputCols=impute_cols)

imputer = imputer.fit(doubles_df)
imputed_df = imputer.transform(doubles_df)

In [None]:
## lets display our imputed data 
display(imputed_df.limit(10))

make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity,max_torque_Nm,max_torque_rpm,max_power_bhp,max_power_rpm,engine_cc,kilometer_na,max_power_bhp_na,max_torque_Nm_na,max_power_rpm_na,max_torque_rpm_na,length_na,width_na,height_na,seating_capacity_na,fuel_tank_capacity_na,engine_cc_na
Honda,Amaze 1.2 VX i-VTEC,505000.0,2017.0,87150.0,Petrol,Manual,Pune,Grey,First,Corporate,109 Nm @ 4500 rpm,FWD,3990.0,1680.0,1505.0,5.0,35.0,109.0,4500.0,87.0,6000.0,1198.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Maruti Suzuki,Swift DZire VDI,450000.0,2014.0,75000.0,Diesel,Manual,Ludhiana,White,Second,Individual,190 Nm @ 2000 rpm,FWD,3995.0,1695.0,1555.0,5.0,42.0,190.0,2000.0,74.0,4000.0,1248.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hyundai,i10 Magna 1.2 Kappa2,220000.0,2011.0,67000.0,Petrol,Manual,Lucknow,Maroon,First,Individual,112.7619 Nm @ 4000 rpm,FWD,3585.0,1595.0,1550.0,5.0,35.0,112.7619,4000.0,79.0,6000.0,1197.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Toyota,Glanza G,799000.0,2019.0,37500.0,Petrol,Manual,Mangalore,Red,First,Individual,113 Nm @ 4200 rpm,FWD,3995.0,1745.0,1510.0,5.0,37.0,113.0,4200.0,82.0,6000.0,1197.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Toyota,Innova 2.4 VX 7 STR [2016-2020],1950000.0,2018.0,69000.0,Diesel,Manual,Mumbai,Grey,First,Individual,343 Nm @ 1400 rpm,RWD,4735.0,1830.0,1795.0,7.0,55.0,343.0,1400.0,148.0,3400.0,2393.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Maruti Suzuki,Ciaz ZXi,675000.0,2017.0,73315.0,Petrol,Manual,Pune,Grey,First,Individual,130 Nm @ 4000 rpm,FWD,4490.0,1730.0,1545.0,5.0,43.0,130.0,4000.0,91.0,6000.0,1373.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Mercedes-Benz,CLA 200 Petrol Sport,1898999.0,2015.0,47000.0,Petrol,Automatic,Mumbai,White,Second,Individual,300 Nm @ 1200 rpm,FWD,4630.0,1777.0,1432.0,5.0,48.0,300.0,1200.0,181.0,5500.0,1991.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
BMW,X1 xDrive20d M Sport,2650000.0,2017.0,75000.0,Diesel,Automatic,Coimbatore,White,Second,Individual,400 Nm @ 1750 rpm,AWD,4439.0,1821.0,1612.0,5.0,51.0,400.0,1750.0,188.0,4000.0,1995.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Skoda,Octavia 1.8 TSI Style Plus AT [2017],1390000.0,2017.0,56000.0,Petrol,Automatic,Mumbai,White,First,Individual,250 Nm @ 1250 rpm,FWD,4670.0,1814.0,1476.0,5.0,50.0,250.0,1250.0,177.0,5100.0,1798.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Nissan,Terrano XL (D),575000.0,2015.0,85000.0,Diesel,Manual,Mumbai,White,First,Individual,200 Nm @ 1900 rpm,FWD,4331.0,1822.0,1671.0,5.0,50.0,200.0,1900.0,84.0,3750.0,1461.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
## Lets take another look at the summary statistics for the dataset. 
display(imputed_df.describe())

summary,make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity,max_torque_Nm,max_torque_rpm,max_power_bhp,max_power_rpm,engine_cc,kilometer_na,max_power_bhp_na,max_torque_Nm_na,max_power_rpm_na,max_torque_rpm_na,length_na,width_na,height_na,seating_capacity_na,fuel_tank_capacity_na,engine_cc_na
count,2048,2048,2048.0,2048.0,2048.0,2048,2048,2048,2048,2048,2048,1968,1912,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0,2048.0
mean,,,1704247.0229492188,2016.4208984375,54273.93701171875,,,,,,,,,4283.826171875,1768.11328125,1590.505859375,5.29736328125,51.771142578124994,244.2941433105468,2590.76171875,129.122900390625,4808.0615234375,1685.70849609375,0.0,0.0390625,0.0390625,0.041015625,0.0390625,0.03173828125,0.03173828125,0.03173828125,0.03369140625,0.05712890625,0.0390625
stddev,,,2424270.300890236,3.367506557338118,57480.482321943775,,,,,,,,,436.4141185963212,133.28069215455344,134.31419815104138,0.8102620278922369,14.72639914296144,138.2089803446354,1190.121276393727,63.94120465816325,1082.6122861460306,633.0727574727189,0.0,0.1937910175313043,0.1937910175313043,0.1983747933140049,0.1937910175313043,0.1753453034347215,0.1753453034347215,0.1753453034347215,0.1804776988697658,0.2321454469375393,0.1937910175313043
min,Audi,2 Series Gran Coupe 220d M Sport [2020-2021],49000.0,1988.0,1.0,CNG,Automatic,Agra,Beige,4 or More,Commercial Registration,101 Nm @ 3000 rpm,AWD,3099.0,1475.0,1165.0,2.0,15.0,48.0,150.0,35.0,2910.0,624.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,Volvo,i20 Sportz 1.4 CRDI,35000000.0,2022.0,2000000.0,Petrol + LPG,Manual,Zirakpur,Yellow,UnRegistered Car,Individual,99@2800,RWD,5569.0,2220.0,1995.0,8.0,105.0,780.0,6500.0,660.0,8250.0,6592.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Consider: We could also have approached this problem by calculating statistics based on Make and using those statistics to fill in null values for relevant fields. What other ways could we have approached this problem? 


Now that our data is cleaned, we can save our DataFrame to the Delta Lake. As our table already exists as a Delta table, we can use the 'overwrite' flag when saving our data to ensure that the existing file is replaced with our updated version. Saving our cleaned data to Delta Lake ensures that the data is stored in a reliable and efficient manner, making it ready for subsequent analysis, including building machine learning models. We will not have to save it to our local machine and reupload it again but can access it directly from our catalog. 

In [None]:
## As we made changes to our schema (i.e., we changed the data type of some columns), we need to include th overwriteSchema command to write our updated table. 

imputed_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable("default.car_data_v1")

If we go back to our catalog and view the history of our table, we can see that a new version has been added, denoting all of the changes we made. This is a more compact and efficient way to process data than saving new files for all updated tables as it allows us to easily track changes. 

## <img src = 'https://www.svgrepo.com/show/176852/pin-signs.svg' style="height: 50px; margin: 5px; padding: 5px"/> Summary
---
In this tutorial, you learned how to:

* Create Delta tables
* Use Delta tables with PySpark to clean data 
* Update your tables to maintain a consistent record of your data


## <img src = 'https://www.svgrepo.com/show/199671/next.svg' style="height: 50px; margin: 5px; padding: 5px"/> Next steps
---

Take a look into the docs for:
* A Look at the Delta Table Interface 

Or continue to the next tutorial: 
* Building a Machine Learning Model with Databricks ML Workflow 
