# Data Ingestion, Cleaning and Exploration with Databricks

Databricks provides a notebook interface compatible with Python, SQL, Pyspark, Scala, R, and more. In this notebook, we discuss how some basic data operations that can be performed on a table using Pyspark. 

## Overview
---

In this tutorial, you'll learn how to {insert brief description of the main tutorial task}. This tutorial is intended for {audience}. It assumes you have basic knowledge of:

    Concept 1
    Concept 2
    Concept 3...

By the end of this tutorial, you'll be able to:

    Learning objective 1
    Learning objective 2
    Learning objective 3...


##Before you begin
---
{Use this section to tell users about any prerequisites needed before they start the tutorial, such as:

    Expected prior knowledge.
    Software or hardware to obtain.
    Environments to set up and configure.
    Access codes to obtain. }

Before you start the tutorial, you should:

    Prerequisite 1
    Prerequisite 2
    Prerequisite 3...


### Loading Data

This section covers basics of loading data in a notebook in Databricks. Users familiar with Pandas will see many similarities between the PySpark interface and Pandas operations.

In [0]:
'''
To begin, we must first initialize our Spark session which will allow us to use the DataFrame API to handle our data. 
In line 8 we are creating our spark session instance. We can use the SparkSession.Builder object to configure our Spark session, 
but for the sake of this tutorial, we are using the default settings. 
'''
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate() 

Please refer to <a href="https://github.com/zoyashaf/DataLakehouses101/blob/5e5423427db745aa64ea52e3efe7fcd1fb04a288/figures/catalog.png" target="_blank">this image</a> to see how to import data into your databricks account and notebook.



In [0]:
# Reading in data from our file storage 
## .option tells our code to read in the header row of the csv file 
car_data = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/zshafiq001@citymail.cuny.edu/car_details_v4.csv")

display(car_data)
# NOTE: You may also use car_data.show(), however display() showcases the data in an easy to read table whereas show() provides a raw output. 

Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type,Engine,Max Power,Max Torque,Drivetrain,Length,Width,Height,Seating Capacity,Fuel Tank Capacity
Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate,1198 cc,87 bhp @ 6000 rpm,109 Nm @ 4500 rpm,FWD,3990.0,1680.0,1505.0,5.0,35.0
Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual,1248 cc,74 bhp @ 4000 rpm,190 Nm @ 2000 rpm,FWD,3995.0,1695.0,1555.0,5.0,42.0
Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual,1197 cc,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,FWD,3585.0,1595.0,1550.0,5.0,35.0
Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual,1197 cc,82 bhp @ 6000 rpm,113 Nm @ 4200 rpm,FWD,3995.0,1745.0,1510.0,5.0,37.0
Toyota,Innova 2.4 VX 7 STR [2016-2020],1950000,2018,69000,Diesel,Manual,Mumbai,Grey,First,Individual,2393 cc,148 bhp @ 3400 rpm,343 Nm @ 1400 rpm,RWD,4735.0,1830.0,1795.0,7.0,55.0
Maruti Suzuki,Ciaz ZXi,675000,2017,73315,Petrol,Manual,Pune,Grey,First,Individual,1373 cc,91 bhp @ 6000 rpm,130 Nm @ 4000 rpm,FWD,4490.0,1730.0,1485.0,5.0,43.0
Mercedes-Benz,CLA 200 Petrol Sport,1898999,2015,47000,Petrol,Automatic,Mumbai,White,Second,Individual,1991 cc,181 bhp @ 5500 rpm,300 Nm @ 1200 rpm,FWD,4630.0,1777.0,1432.0,5.0,
BMW,X1 xDrive20d M Sport,2650000,2017,75000,Diesel,Automatic,Coimbatore,White,Second,Individual,1995 cc,188 bhp @ 4000 rpm,400 Nm @ 1750 rpm,AWD,4439.0,1821.0,1612.0,5.0,51.0
Skoda,Octavia 1.8 TSI Style Plus AT [2017],1390000,2017,56000,Petrol,Automatic,Mumbai,White,First,Individual,1798 cc,177 bhp @ 5100 rpm,250 Nm @ 1250 rpm,FWD,4670.0,1814.0,1476.0,5.0,50.0
Nissan,Terrano XL (D),575000,2015,85000,Diesel,Manual,Mumbai,White,First,Individual,1461 cc,84 bhp @ 3750 rpm,200 Nm @ 1900 rpm,FWD,4331.0,1822.0,1671.0,5.0,50.0


#### <img src='https://www.svgrepo.com/show/170412/notebook.svg' style="height: 65px; margin: 5px; padding: 5px"/> Task 1: 


Click on the '+' sign next to 'Table' in the cell above. You will see options for visualization and data profile. 
  * First generate a data profile. Do you notice anything weird about the data? 
  * Next, generate a bar graph using the visualization option. Use 'Make' as the x-axis and 'Price' as the y-axis. Does there seem to be a discernable pattern between makes of cars and their price?

We can also use pyspark commands to learn more about our data: 
* **`describe()`**:  count, mean, stddev, min, max
* **`summary()`**:  describe + interquartile range (IQR)
* **`printschema()`**:


In [0]:
display(car_data.describe())

summary,Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type,Engine,Max Power,Max Torque,Drivetrain,Length,Width,Height,Seating Capacity,Fuel Tank Capacity
count,2059,2059,2059.0,2059.0,2059.0,2059,2059,2059,2059,2059,2059,1979,1979,1979,1923,1995.0,1995.0,1995.0,1995.0,1946.0
mean,,,1702991.6964545895,2016.4254492472076,54224.71442447791,,,,,,,,,,,4280.860651629073,1767.9919799498746,1591.7353383458646,5.306265664160401,52.00220966084275
stddev,,,2419880.6354341814,3.363563584951663,57361.72131433033,,,,,,,,,,,442.45850677915513,135.26582519704775,136.07395597176094,0.8221701349425025,15.110197794109098
min,Audi,2 Series Gran Coupe 220d M Sport [2020-2021],100000.0,1988.0,0.0,CNG,Automatic,Agra,Beige,4 or More,Commercial Registration,1047 cc,100 bhp @ 3600 rpm,101 Nm @ 3000 rpm,AWD,3099.0,1475.0,1165.0,2.0,100.0
max,Volvo,i20 Sportz 1.4 CRDI,999000.0,2022.0,99000.0,Petrol + LPG,Manual,Zirakpur,Yellow,UnRegistered Car,Individual,999 cc,99 bhp @ 5000 rpm,99@2800,RWD,5569.0,2220.0,1995.0,8.0,95.0


In [0]:
display(car_data.summary())

summary,Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type,Engine,Max Power,Max Torque,Drivetrain,Length,Width,Height,Seating Capacity,Fuel Tank Capacity
count,2059,2059,2059.0,2059.0,2059.0,2059,2059,2059,2059,2059,2059,1979,1979,1979,1923,1995.0,1995.0,1995.0,1995.0,1946.0
mean,,,1702991.6964545895,2016.4254492472076,54224.71442447791,,,,,,,,,,,4280.860651629073,1767.9919799498746,1591.7353383458646,5.306265664160401,52.00220966084275
stddev,,,2419880.6354341814,3.363563584951663,57361.72131433033,,,,,,,,,,,442.45850677915513,135.26582519704775,136.07395597176094,0.8221701349425025,15.110197794109098
min,Audi,2 Series Gran Coupe 220d M Sport [2020-2021],100000.0,1988.0,0.0,CNG,Automatic,Agra,Beige,4 or More,Commercial Registration,1047 cc,100 bhp @ 3600 rpm,101 Nm @ 3000 rpm,AWD,3099.0,1475.0,1165.0,2.0,100.0
25%,,,484999.0,2014.0,29000.0,,,,,,,,,,,3985.0,1695.0,1485.0,5.0,41.0
50%,,,825000.0,2017.0,50000.0,,,,,,,,,,,4370.0,1770.0,1545.0,5.0,50.0
75%,,,1925000.0,2019.0,72000.0,,,,,,,,,,,4629.0,1832.0,1675.0,5.0,60.0
max,Volvo,i20 Sportz 1.4 CRDI,999000.0,2022.0,99000.0,Petrol + LPG,Manual,Zirakpur,Yellow,UnRegistered Car,Individual,999 cc,99 bhp @ 5000 rpm,99@2800,RWD,5569.0,2220.0,1995.0,8.0,95.0


In [0]:
car_data.printSchema()
# Since printSchema() is meant to print, we don't need to add a display wrapper

root
 |-- Make: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Kilometer: string (nullable = true)
 |-- Fuel Type: string (nullable = true)
 |-- Transmission: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Color: string (nullable = true)
 |-- Owner: string (nullable = true)
 |-- Seller Type: string (nullable = true)
 |-- Engine: string (nullable = true)
 |-- Max Power: string (nullable = true)
 |-- Max Torque: string (nullable = true)
 |-- Drivetrain: string (nullable = true)
 |-- Length: string (nullable = true)
 |-- Width: string (nullable = true)
 |-- Height: string (nullable = true)
 |-- Seating Capacity: string (nullable = true)
 |-- Fuel Tank Capacity: string (nullable = true)



### <img src='https://upload.wikimedia.org/wikipedia/commons/6/68/Exclamation_Point.svg' style="height: 65px; margin: 5px; padding: 5px"/> Concept Review

* Data Profile allows users to quickly and easily gain an understanding of their data. The tool provides a complete overview of the dataset's characteristics, statistics, and more. Furthermore, the data profile along with any visualizations can easily be added to a dashboard to quickly create effective summaries of the data. With these features, users can easily understand the basic structure of their data, explore the data distribution, identify missing values and more. 

### Data Cleaning 
---
In this section, we will explore the various tools PySpark has for analyzing and cleaning data.

As we see from our data profile and our describe method above, many of the numerical categories were picked up as strings. We need to fix this before we can use our dataset.

In [0]:
from pyspark.sql.functions import col, translate

fixed_price_df = car_data.withColumn("Price", translate(col("Price"), " ", "").cast("double"))

display(fixed_price_df)

Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type,Engine,Max Power,Max Torque,Drivetrain,Length,Width,Height,Seating Capacity,Fuel Tank Capacity
Honda,Amaze 1.2 VX i-VTEC,505000.0,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate,1198 cc,87 bhp @ 6000 rpm,109 Nm @ 4500 rpm,FWD,3990.0,1680.0,1505.0,5.0,35.0
Maruti Suzuki,Swift DZire VDI,450000.0,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual,1248 cc,74 bhp @ 4000 rpm,190 Nm @ 2000 rpm,FWD,3995.0,1695.0,1555.0,5.0,42.0
Hyundai,i10 Magna 1.2 Kappa2,220000.0,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual,1197 cc,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,FWD,3585.0,1595.0,1550.0,5.0,35.0
Toyota,Glanza G,799000.0,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual,1197 cc,82 bhp @ 6000 rpm,113 Nm @ 4200 rpm,FWD,3995.0,1745.0,1510.0,5.0,37.0
Toyota,Innova 2.4 VX 7 STR [2016-2020],1950000.0,2018,69000,Diesel,Manual,Mumbai,Grey,First,Individual,2393 cc,148 bhp @ 3400 rpm,343 Nm @ 1400 rpm,RWD,4735.0,1830.0,1795.0,7.0,55.0
Maruti Suzuki,Ciaz ZXi,675000.0,2017,73315,Petrol,Manual,Pune,Grey,First,Individual,1373 cc,91 bhp @ 6000 rpm,130 Nm @ 4000 rpm,FWD,4490.0,1730.0,1485.0,5.0,43.0
Mercedes-Benz,CLA 200 Petrol Sport,1898999.0,2015,47000,Petrol,Automatic,Mumbai,White,Second,Individual,1991 cc,181 bhp @ 5500 rpm,300 Nm @ 1200 rpm,FWD,4630.0,1777.0,1432.0,5.0,
BMW,X1 xDrive20d M Sport,2650000.0,2017,75000,Diesel,Automatic,Coimbatore,White,Second,Individual,1995 cc,188 bhp @ 4000 rpm,400 Nm @ 1750 rpm,AWD,4439.0,1821.0,1612.0,5.0,51.0
Skoda,Octavia 1.8 TSI Style Plus AT [2017],1390000.0,2017,56000,Petrol,Automatic,Mumbai,White,First,Individual,1798 cc,177 bhp @ 5100 rpm,250 Nm @ 1250 rpm,FWD,4670.0,1814.0,1476.0,5.0,50.0
Nissan,Terrano XL (D),575000.0,2015,85000,Diesel,Manual,Mumbai,White,First,Individual,1461 cc,84 bhp @ 3750 rpm,200 Nm @ 1900 rpm,FWD,4331.0,1822.0,1671.0,5.0,50.0


Databricks data profile. Run in Databricks to view.

%md
#### Task 2: 
Which other columns should be numerical but were read as strings? Convert these to the correct data type. 
  * Hint: Some columns do contain characters alongside numerical values. For these columns, use translate(col("Column Name", "Characters", "")) to remove them before casting the column as a numerical datatype

Summary
---
{Use this section to summarize what the user learned in the tutorial.}

In this tutorial, you learned how to:

    Summary point 1
    Summary point 2
    Summary point 3..


Next steps
---
{Use this section to share links to related tutorials, videos, or other documentation}.

Consider completing some other common tasks using {feature}:

    Task 1
    Task 2
    Task 3...
