# Reading Data - Parquet Files

**Technical Accomplishments:**
- Introduce the Parquet file format.
- Read data from:
  - Parquet files without a schema.
  - Parquet files with a schema.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
# Mount "/mnt/training" again using "%run "./Includes/Dataset-Mounts-New"" if it is failed in "./Includes/Classroom-Setup"
try:
    files = dbutils.fs.ls("/mnt/training")
except:
    dbutils.fs.unmount('/mnt/training/')


/mnt/training/ has been unmounted.


In [0]:
%run "./Includes/Dataset-Mounts-New"

In [0]:
%run "./Includes/Utility-Methods"

Datasets are mounted


-sandbox
<div style="float:right; margin-right:1em">
  <img src="https://parquet.apache.org/assets/img/parquet_logo.png"><br>
  <a href="https://parquet.apache.org/" target="_blank">https&#58;//parquet.apache.org</a>
</div>

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading from Parquet Files

<strong style="font-size:larger">"</strong>Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.<strong style="font-size:larger">"</strong><br>

-sandbox
### About Parquet Files
* Free & Open Source.
* Increased query performance over row-based data stores.
* Provides efficient data compression.
* Designed for performance on large data sets.
* Supports limited schema evolution.
* Is a splittable "file format".
* A <a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS" target="_blank">Column-Oriented</a> data store

&nbsp;&nbsp;&nbsp;&nbsp;** Row Format ** &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **Column Format**

<table style="border:0">

  <tr>
    <th>ID</th><th>Name</th><th>Score</th>
    <th style="border-top:0;border-bottom:0">&nbsp;</th>
    <th>ID:</th><td>1</td><td>2</td>
    <td style="border-right: 1px solid #DDDDDD">3</td>
  </tr>

  <tr>
    <td>1</td><td>john</td><td>4.1</td>
    <td style="border-top:0;border-bottom:0">&nbsp;</td>
    <th>Name:</th><td>john</td><td>mike</td>
    <td style="border-right: 1px solid #DDDDDD">sally</td>
  </tr>

  <tr>
    <td>2</td><td>mike</td><td>3.5</td>
    <td style="border-top:0;border-bottom:0">&nbsp;</td>
    <th style="border-bottom: 1px solid #DDDDDD">Score:</th>
    <td style="border-bottom: 1px solid #DDDDDD">4.1</td>
    <td style="border-bottom: 1px solid #DDDDDD">3.5</td>
    <td style="border-bottom: 1px solid #DDDDDD; border-right: 1px solid #DDDDDD">6.4</td>
  </tr>

  <tr>
    <td style="border-bottom: 1px solid #DDDDDD">3</td>
    <td style="border-bottom: 1px solid #DDDDDD">sally</td>
    <td style="border-bottom: 1px solid #DDDDDD; border-right: 1px solid #DDDDDD">6.4</td>
  </tr>

</table>

See also
* <a href="https://parquet.apache.org/" target="_blank">https&#58;//parquet.apache.org</a>
* <a href="https://en.wikipedia.org/wiki/Apache_Parquet" target="_blank">https&#58;//en.wikipedia.org/wiki/Apache_Parquet</a>

### Data Source

The data for this example shows the number of requests to Wikipedia's mobile and desktop websites (<a href="https://dumps.wikimedia.org/other/pagecounts-raw" target="_blank">23 MB</a> from Wikipedia). 

The original file, captured August 5th of 2016 was downloaded, converted to a Parquet file and made available for us at **/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/**

In [0]:
%fs ls /mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/

path,name,size,modificationTime
dbfs:/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/_SUCCESS,_SUCCESS,0,1516688176000
dbfs:/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/_committed_6241970109963426653,_committed_6241970109963426653,760,1516688176000
dbfs:/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/_started_6241970109963426653,_started_6241970109963426653,0,1516688176000
dbfs:/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00000-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00000-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2996913,1516688176000
dbfs:/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00001-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00001-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2994285,1516688176000
dbfs:/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00002-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00002-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2994196,1516688176000
dbfs:/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00003-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00003-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2992431,1516688176000
dbfs:/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00004-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00004-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2990093,1516688176000
dbfs:/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00005-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00005-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2989931,1516688176000
dbfs:/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00006-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00006-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2989314,1516688176000


Unlike our CSV and JSON example, the parquet "file" is actually 11 files, 8 of which consist of the bulk of the data and the other three consist of meta-data.

In [0]:
%fs ls /mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/

path,name,size,modificationTime
dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/_SUCCESS,_SUCCESS,0,1509989475000
dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/part-00000-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet,part-00000-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet,8026016,1509989475000
dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/part-00001-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet,part-00001-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet,8350071,1509989475000
dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/part-00002-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet,part-00002-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet,8331056,1509989475000
dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/part-00003-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet,part-00003-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet,8344940,1509989475000
dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/part-00004-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet,part-00004-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet,7680335,1509989475000


In [0]:
%fs head "/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/part-00000-2f6ae1e7-8431-4663-a275-3d2c5cc46e6e-c000.gz.parquet"

### Read in the Parquet Files

To read in this files, we will specify the location of the parquet directory.

In [0]:
parquetFile = "/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/"

(spark.read              # The DataFrameReader
  .parquet(parquetFile)  # Creates a DataFrame from Parquet after reading in the file
  .printSchema()         # Print the DataFrame's schema
)

root
 |-- timestamp: string (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: integer (nullable = true)



### Review: Reading from Parquet Files
* We do not need to specify the schema - the column names and data types are stored in the parquet files.
* Only one job is required to **read** that schema from the parquet file's metadata.
* Unlike the CSV or JSON readers that have to load the entire file and then infer the schema, the parquet reader can "read" the schema very quickly because it's reading that schema from the metadata.

### Read in the Parquet Files w/Schema

If you want to avoid the extra job entirely, we can, again, specify the schema even for parquet files:

** *WARNING* ** *Providing a schema may avoid this one-time hit to determine the `DataFrame's` schema.*  
*However, if you specify the wrong schema it will conflict with the true schema and will result in an analysis exception at runtime.*

In [0]:
# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

parquetSchema = StructType(
  [
    StructField("timestamp", StringType(), False),
    StructField("site", StringType(), False),
    StructField("requests", IntegerType(), False)
  ]
)

(spark.read               # The DataFrameReader
  .schema(parquetSchema)  # Use the specified schema
  .parquet(parquetFile)   # Creates a DataFrame from Parquet after reading in the file
  .printSchema()          # Print the DataFrame's schema
)

root
 |-- timestamp: string (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: integer (nullable = true)



Let's take a look at some of the other details of the `DataFrame` we just created for comparison sake.

In [0]:
parquetDF = spark.read.schema(parquetSchema).parquet(parquetFile)

print("Partitions: " + str(parquetDF.rdd.getNumPartitions()) )
printRecordsPerPartition(parquetDF)
print("-"*80)

Partitions: 5
Per-Partition Counts
#1: 1,463,276
#2: 1,462,749
#3: 1,462,393
#4: 1,463,679
#5: 1,347,903
--------------------------------------------------------------------------------


In most/many cases, people do not provide the schema for Parquet files because reading in the schema is such a cheap process.

And lastly, let's peek at the data:

In [0]:
display(parquetDF)

timestamp,site,requests
2015-03-22T14:13:34,mobile,1425
2015-03-22T14:23:18,desktop,2534
2015-03-22T14:36:47,desktop,2444
2015-03-22T14:38:39,mobile,1488
2015-03-22T14:57:11,mobile,1519
2015-03-22T15:03:18,mobile,1559
2015-03-22T15:16:47,mobile,1510
2015-03-22T15:45:03,desktop,2673
2015-03-22T15:58:32,desktop,2463
2015-03-22T16:06:11,desktop,2525


## Next steps

Start the next lesson, [Reading Data - Tables and Views]($./4.Reading%20Data%20-%20Tables%20and%20Views)