# Reading Data - Parquet Files

**Technical Accomplishments:**
- Introduce the Parquet file format.
- Read data from:
  - Parquet files without a schema.
  - Parquet files with a schema.

## ➡️ Getting Started

Run the following cell to configure our notebook

In [None]:
%run Utilities

## ➡️ Reading from Parquet Files

<strong style="font-size:larger">"</strong>Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.<strong style="font-size:larger">"</strong><br>

## ➡️ About Parquet Files
* Free & Open Source.
* Increased query performance over row-based data stores.
* Provides efficient data compression.
* Designed for performance on large data sets.
* Supports limited schema evolution.
* Is a splittable "file format".

**Row-oriented systems**

A common method of storing a table is to serialize each row of data, like this:

```text
001:10,Smith,Joe,60000;
002:12,Jones,Mary,80000;
003:11,Johnson,Cathy,94000;
004:22,Jones,Bob,55000;
```

**Column-oriented systems**

A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. For our example table, the data would be stored in this fashion:

```text
10:001,12:002,11:003,22:004;
Smith:001,Jones:002,Johnson:003,Jones:004;
Joe:001,Mary:002,Cathy:003,Bob:004;
60000:001,80000:002,94000:003,55000:004;
```

See also
* <a href="https://parquet.apache.org/" target="_blank">https&#58;//parquet.apache.org</a>
* <a href="https://en.wikipedia.org/wiki/Apache_Parquet" target="_blank">https&#58;//en.wikipedia.org/wiki/Apache_Parquet</a>

## ➡️ Data Source

The data for this example shows population data from UNHCR’s annual statistical activities dating back to 1951. Click [here](https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-22/readme.md) for more information

In [None]:
%%sh

ls /lakehouse/default/Files/sampledata/population.parquet

## ➡️ Read in the Parquet File(s)

To read in this files, we will specify the location of the parquet directory.

In [None]:
parquetFile = "Files/sampledata/population.parquet"

parquetDF = (spark.read              # The DataFrameReader
                .parquet(parquetFile)  # Creates a DataFrame from Parquet after reading in the file
            )

parquetDF.printSchema()         # Print the DataFrame's schema

## ➡️ Review: Reading from Parquet Files
* We do not need to specify the schema - the column names and data types are stored in the parquet files.
* Only one job is required to **read** that schema from the parquet file's metadata.
* Unlike the CSV or JSON readers that have to load the entire file and then infer the schema, the parquet reader can "read" the schema very quickly because it's reading that schema from the metadata.

In most/many cases, people do not provide the schema for Parquet files because reading in the schema is such a cheap process.

And lastly, let's peek at the data:

In [None]:
display(parquetDF)