### Inspecting Data in Spark

In this notebook we demonstrate how to examine the data and metadata (schema) in Spark DataFrames. This notebook is based on material supplied by Cloudera under their Cloudera Academic Partner program and the *Spark: The Definitive Guide* book by Bill Chambers and Matei Zaharia. You can find out more about Spark here: [https://spark.apache.org/](https://spark.apache.org/ "Apache Spark"). 

Topics
In this module we will take our first look at a Spark DataFrame.
- Examining the schema
- Viewing some data
- Computing the number of rows (observations)
- Computing summary statistics
- Inspecting a column (variable)
-- Inspecting a key variable
-- Inspecting a categorical variable
-- Inspecting a continuous numerical variable
-- Inspecting a datetime variable

In [0]:
# Load the rider data from our location for data on S3 into a Spark DataFrame
riders = spark.read.csv("/mnt/cis442f-data/duocar/raw/riders/", header=True, inferSchema=True)


### Examining the schema

Use the `printSchema` method of the DataFrame class to examine the schema in a tree format. Follow [this link](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame) to see the full set of Spark DataFrame methods. You can tell that it is a method as it follows the syntactical format of `instance_of_class.method_name()` e.g. `riders.printSchema()`

In [0]:
riders.printSchema()

Use the `dtypes`, `columns`, and `schema` attributes to view the schema in other ways

**Note:** 
- The `schema` attribute provides a programmatic version of the schema. 
- You can tell that it is an attribute as it follows the syntactical format of `instance_of_class.attribute_name` e.g. `riders.schema`. There are no parentheses `()` after an attribute name. 
- `dtypes` and `columns` return python lists

In [0]:
riders.dtypes 

In [0]:
riders.columns

In [0]:
riders.schema

####Viewing some data and counting the number of rows (observations)

- Use the `show` method to get a SQL-like display of the data
- Use the `head` or `take` method to get a display of the `Row` objects that is sometimes easier to read
- Use the `count` method to compute the number of rows

**Notes:** 
- `id` and `home_block` are long integers rather than strings.
- `birth_date` and `start_date` are Timestamps rather than Dates.
- `ethnicity`, `work_lat`, and `work_lon` appear to have null (missing) values.
- `student` is a boolean variable inefficiently encoded as an integer.

In [0]:
# Use the `show` method to get a SQL-like display of the data
riders.show(3)

In [0]:
# You can choose to show a subset of columns for neater printing
# Copying and pasting output from the columns attribute saves you from having to type all the column names 
riders.select(['id', 'first_name',  'last_name', 'sex',  'ethnicity',  'student',  'home_block']).show(3)


In [0]:
# Use the `head` or `take` method to get a display of the `Row` objects. 
# Not much difference between these methods but must specify number of lines with `take`. The default for `head()` is 1 
# Perhaps `head` was included to mirror pandas
riders.head()

In [0]:
# Use the `head` or `take` method to get a display of the `Row` objects 
riders.take(2)

In [0]:
# Use the `count` method to compute the number of rows
riders.count()

#### Computing summary statistics

Use the `describe` method to compute summary statistics for numeric and string columns

**Notes:** 
- The count in the `describe` method is a count of non-missing values.
- `sex` and `ethnicity` seem to have some missing values.

In [0]:
# Use the `describe` method to compute summary statistics
# but asking for all columns can be difficult to read
riders.describe().show() 


#### Inspecting a column (variable)

We often want to inspect one or a few variables at a time.
 

#### Inspect a key variable ###

The variable `id` should be a unique value.  Let us confirm this.  Use the `select` method to select the `id` column.
**Note** 
- We can express our requirements in either **DataFrame** or **SQL** syntax 
- We create a view for the SQL approach using `riders.createOrReplaceTempView("riders")`

In [0]:
riders.select("id").describe().show()

In [0]:
# Alternatively, pass the column name to the `describe` method:
riders.describe("id").show()

In [0]:
# Asking for a few columns at a time can make output easier to read
riders.describe(['id', 'birth_date', 'start_date', 'first_name', 'last_name','sex', 'ethnicity']).show() 
riders.describe(['student', 'home_block','home_lat']).show() 
riders.describe('home_lon', 'work_lat', 'work_lon').show() 

In [0]:
# Use the `countDistinct` function to determine the number of distinct values
from pyspark.sql.functions import count, countDistinct

riders.select(count("id"), countDistinct("id")).show()

# Note - This can take quite a bit of time on large DataFrames.

In [0]:
# You can use functional style, as shown above, or SQL style for examination of
# DataFrames.  SQL style requires one preliminary step:

riders.createOrReplaceTempView("riders")

spark.sql("select count(id), count(distinct id) from riders").show()

#### Inspect a categorical variable

The variable `sex` is a categorical variable.  Let us examine it more carefully.
Note that 
- `countDistinct` function does not count `NULL` as a distinct value but that the `.distinct()` method does
- We can use DataFrame or SQL format to carry-out our examination
- We can execute SQL using `spark.sql` syntax and the previously defined 'view'
- We can also write an SQL statement directly in a notebook paragraph with the `%sql` prefix (also works in Zeppelin)

In [0]:
riders.select(count("*"), count("sex"), countDistinct("sex")).show()

In [0]:
# Use the `select` method to select a subset of columns:
riders.select("sex").show(5)

In [0]:
# Use the `distinct` method to determine the unique values of `sex`
riders.select("sex").distinct().show() 
riders.select("sex").distinct().count()

In [0]:
# An alternative approach is to use an aggregation
riders.groupby("sex").count().show()

# The same query in SQL style (note that "group by 1" is reffering to the first column in the select statement)
spark.sql("select sex, count(*) from riders group by 1").show()
  
# **Note:** `sex` contains null (missing) values that we may have to deal with

You can use the `%sql` magic command to use SQL directly on the view created above. 
Databricks notebooks (and Zeppelin notebooks) include some basic visualization abilities 
on the output of queries

See https://docs.databricks.com/user-guide/visualizations/index.html#visualizations-in-python for more on visualizations in Databricks notebooks

In [0]:
%sql 
select sex, count(*) AS `Number of Riders`
from riders 
group by sex


sex,Number of Riders
,77
female,694
male,952


In [0]:
# Databricks offers a direct way of visualizing data in Spark DataFrames
display(riders.groupBy("sex").count())

sex,count
,77
female,694
male,952


#### Inspecting a numerical variable

**Notes:**
- No missing values
- No extreme values (tight distribution about [Fargo, North Dakota](https://en.wikipedia.org/wiki/Fargo%2C_North_Dakota))

In [0]:
riders.select("home_lat", "home_lon").show(5)
riders.select("home_lat", "home_lon").describe().show(5)


In [0]:
# Use the `approxQuantile` to get customized (approximate) quantiles:
riders.approxQuantile(["home_lat","home_lon"],  \
	probabilities=[0.0, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 1.0], \
	relativeError=0.1)

# **Note:** This method returns a Python list. 

#### Inspecting a datetime variable

Note that the original data was in Date format, but Spark read the data in
Timestamp format.  We will probably want to fix this.

In [0]:
# Let us inspect `birth_date` and `start_date`, which are both Timestamp
# variables:

dates = riders.select("birth_date", "start_date")
dates.show(5, truncate=False)
dates.head(5)

In [0]:
# Note that the `describe` method does not work with Date or Timestamp
# variables:

dates.describe().show(5)

###Hands On

![Hands-on](https://cis442f-open-data.s3.amazonaws.com/pictures/hands.png "Hands-on")


#### Exercises

(1) Read the raw driver data from HDFS into a Spark DataFrame.

(2) Inspect the driver DataFrame.  Are the data types for each column appropriate?

(3) Inspect the columns of the driver DataFrame.  Are there any issues with the data?



**References you might need to learn about Spark and Databricks**

[Spark SQL, DataFrames, and Datasets Guide](http://spark.apache.org/docs/latest/sql-programming-guide.html)

[DataFrame class](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#dataframe-apis)

[Spark Types](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#data-types)

[Data visualizations in Databricks](https://docs.databricks.com/user-guide/visualizations/index.html#visualizations-in-python)