## Display function

There are different ways to view data in a DataFrame.

* Transformations...
  * `limit(..)`
  * `select(..)`
  * `drop(..)`
  * `distinct()`
  * `dropDuplicates(..)`
* Actions...
  * `show(..)`
  * `display(..)`

In [0]:
parquetDF = spark.read.parquet("/FileStore/tables/userdata1.parquet")

**show()**  
the `show(..)` method effectively has two optional parameters:
* **n**: The number of records to print to the console, the default being 20.
* **truncate**: If true, columns wider than 20 characters will be truncated, where the default is true.

In [0]:
parquetDF.show()

In [0]:
parquetDF.show(100, False)

In [0]:
display(parquetDF)

### show(..) vs display(..)
* `show(..)` is part of core spark - `display(..)` is specific to our notebooks.
* `show(..)` is ugly - `display(..)` is pretty.
* `show(..)` has parameters for truncating both columns and rows - `display(..)` does not.
* `show(..)` is a function of the `DataFrame`/`Dataset` class - `display(..)` works with a number of different objects.
* `display(..)` is more powerful - with it, you can...
  * Download the results as CSV
  * Render line charts, bar chart & other graphs, maps and more.
  * See up to 1000 records at a time.
  
For the most part, the difference between the two is going to come down to preference.

Like `DataFrame.show(..)`, `display(..)` is an **action** which triggers a job.

**limit()**  
> returns the new dataset taking the first n rows 
>>it is a transformation  
>>can be used with display() function as it doesn't support row limit argument as show()

In [0]:
limitedDF=parquetDF.limit(5)

In [0]:
display(limitedDF)

In [0]:
limitedDF.show(100,False)

**select()**  
>picks only required columns from the dataset and creates a new one

In [0]:
threecolDF=parquetDF.select('first_name','last_name','gender')
threecolDF.show(10)

**drop()**  
>drops the unwanted columns from the dataframe

In [0]:
threecolDF.drop('gender').show(10)

**distinct() & dropDuplicates()**  
>These two transformations do the same thing. In fact, they are aliases for one another.

In [0]:
threecolDF

In [0]:
distinctDF=threecolDF.distinct()

In [0]:
distinctDF.show()

In [0]:
distinctgenderDF=threecolDF.select("gender").distinct()

In [0]:
distinctgenderDF.show()

In [0]:
distinctgenderDF2=threecolDF.select("gender").dropDuplicates()
distinctgenderDF2.show()

In [0]:
distinctgenderDF2.count()

**DataFrames vs SQL & Temporary Views**  

The `DataFrame`s API is built upon an SQL engine.

As such we can "convert" a `DataFrame` into a temporary view (or table) and then use it in "standard" SQL.

Let's start by creating a temporary view from a previous `DataFrame`.

In [0]:
threecolDF.createOrReplaceTempView('People')

In [0]:
%sql
select * from People;

**Converting SQL results to DataFrame**

In [0]:
femaleDF=spark.sql("select first_name, last_name from People where gender='Female' order by first_name")

In [0]:
femaleDF.show()