# Parquet For Data Processing

This is a demo of parquet data format and its capabilities for big data processing.
For this demo, we will be using pandas, sqlite, pyarrow and pyspark libraries to demonstrate the parquet capabilities.
The dataset that we will use is an sqlite dump of [wikibooks](https://www.kaggle.com/datasets/dhruvildave/wikibooks-dataset) from kaggle. It contains 270K chapters of wikibooks in 12 languages, but we will concentrate on the English version. To access this dataset you need to setup kaggle account and download your [kaggle.json file before proceeding](https://www.kaggle.com/docs/api#authentication).

- pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools in Python.
- sqlite is an embedded SQL database engine, that uses the more traditional [B-Tree data-format](https://www.sqlite.org/fileformat2.html) for storage on disk 
- pyarrow is Python API of the [Apache Arrow](https://arrow.apache.org/) framework that defines an in-memory data representation and can read/write parquet, including conversion to pandas.
- pyspark is a Python API to the [Apache Spark Engine](https://spark.apache.org/), interfaces Python commands with a Java/Scala execution core, and thereby gives Python programmers access to the Parquet format as parquet is natively supported in spark.

In [9]:
import sqlite3
import pandas as pd

In [7]:
from IPython.display import display
from ipywidgets import FileUpload
import os
import shutil

# Creating Parquet dataset from sqlite
We will fetch the sqlite dataset and convert and store each table into parquet files. We can then compare the on-disk sizes to get an idea of how efficient parquet is. Note that dataset is about 1.8G so might take a while to download depending on your network speed.

## Setting up kaggle token and downloading the dataset

In [29]:
# Download your kaggle credentials file from kaggle and supply it here, only necessary if you have not yet setup your kaggle credentials
# in the .kaggle folder in your home dir
# Path to the .kaggle directory
kaggle_dir = os.path.expanduser('~/.kaggle')
kaggle_file_path = os.path.join(kaggle_dir, 'kaggle.json')

# Function to check and prompt for file upload
def check_and_prompt_for_upload():
    if not os.path.isfile(kaggle_file_path):
        print("kaggle.json file not found. Please upload the file.")
        upload = FileUpload(accept='application/json', multiple=False)
        display(upload)
        return upload
    else:
        print("kaggle.json file already exists in the '~/.kaggle' directory.")
        return None

# Adjusted function to process the uploaded file based on the provided structure and set permissions
def process_uploaded_file(upload_widget):
    # Ensure the .kaggle directory exists
    os.makedirs(kaggle_dir, exist_ok=True)
    
    if upload_widget:
        # Assuming the first item in the tuple is the file info dictionary
        file_info = upload_widget.value[0]  # Extract the file details from the tuple
        
        content = file_info['content']
        with open(kaggle_file_path, 'wb') as f:
            f.write(content)
        print(f"'{file_info['name']}' has been moved to '{kaggle_dir}'.")

        # Set file permissions to 600
        os.chmod(kaggle_file_path, 0o600)
        print(f"Permissions for '{file_info['name']}' set to 600.")

upload = check_and_prompt_for_upload()

kaggle.json file not found. Please upload the file.


FileUpload(value=(), accept='application/json', description='Upload')

In [30]:
try:
    if upload.value:
        process_uploaded_file(upload)
except NameError:
    print("Upload widget not displayed or file already exists.")

'kaggle.json' has been moved to '/Users/yashdatta/.kaggle'.
Permissions for 'kaggle.json' set to 600.


In [31]:
import kaggle

In [26]:
!kaggle datasets download -d dhruvildave/wikibooks-dataset

Downloading wikibooks-dataset.zip to /Users/yashdatta/Documents/Workspace/de-experiments/data/parquet
100%|██████████████████████████████████████| 1.82G/1.82G [01:41<00:00, 18.9MB/s]
100%|██████████████████████████████████████| 1.82G/1.82G [01:41<00:00, 19.3MB/s]


In [32]:
from zipfile import ZipFile
file_name = 'wikibooks-dataset.zip' #the file is your dataset exact name
with ZipFile(file_name, 'r') as zip:
  zip.extractall()
  print('Done')

Done


## Convert data to parquet files

In [38]:
# Path to the SQLite database file
sqlite_file = 'wikibooks.sqlite'

# Get the size of the SQLite database file
sqlite_file_size_bytes = os.path.getsize(sqlite_file)
# Convert the size from bytes to megabytes (MB)
sqlite_file_size_mb = sqlite_file_size_bytes / (1024 ** 2)

# Establish a connection to the SQLite database
conn = sqlite3.connect(sqlite_file)

# Create a cursor object
cursor = conn.cursor()

# Execute the SQL query to retrieve table names
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")

# Fetch all the table names
table_names = cursor.fetchall()

# Initialize a variable to hold the sum of the sizes of the Parquet files
sum_parquet_files_size_bytes = 0

# Iterate over the table names
for table_name in table_names:
    table_name = table_name[0]  # Extract the table name from the tuple

    file_name = f"{table_name}.parquet"
    
    # Check if the Parquet file already exists
    if os.path.exists(file_name):
        print(f"File '{file_name}' already exists. Skipping...")
        sum_parquet_files_size_bytes += os.path.getsize(file_name)
        continue
    
    # Fetch all the data from the table
    cursor.execute(f"SELECT * FROM {table_name};")
    table_data = cursor.fetchall()

    # Fetch the column names
    cursor.execute(f"PRAGMA table_info({table_name});")
    column_names = cursor.fetchall()
    column_names = [column[1] for column in column_names]

    # Create a pandas DataFrame from the fetched data
    df = pd.DataFrame(table_data, columns=column_names)

    # Save the DataFrame as a Parquet file
    df.to_parquet(file_name, index=False)
    sum_parquet_files_size_bytes += os.path.getsize(file_name)

    print(f"Table '{table_name}' saved as '{file_name}'")

File 'pl.parquet' already exists. Skipping...
File 'hu.parquet' already exists. Skipping...
File 'he.parquet' already exists. Skipping...
File 'nl.parquet' already exists. Skipping...
File 'ja.parquet' already exists. Skipping...
File 'ru.parquet' already exists. Skipping...
File 'it.parquet' already exists. Skipping...
File 'en.parquet' already exists. Skipping...
File 'es.parquet' already exists. Skipping...
File 'pt.parquet' already exists. Skipping...
File 'de.parquet' already exists. Skipping...
File 'fr.parquet' already exists. Skipping...


## Space savings

In [41]:
# Convert the sum of the sizes of the Parquet files from bytes to megabytes (MB)
sum_parquet_files_size_mb = sum_parquet_files_size_bytes / (1024 ** 2)

# Calculate the percentage of space saved
space_savings_percentage = (1 - (sum_parquet_files_size_mb / sqlite_file_size_mb)) * 100

# Print the size of the SQLite database file in MB
print(f"Size of SQLite database file: {sqlite_file_size_mb:.2f} MB")

# Print the sum of the sizes of the Parquet files in MB
print(f"Sum of sizes of Parquet files: {sum_parquet_files_size_mb:.2f} MB")

# Print the percentage of space saved
print(f"Percentage of space saved by converting to Parquet: {space_savings_percentage:.2f}%")

Size of SQLite database file: 11701.34 MB
Sum of sizes of Parquet files: 3309.33 MB
Percentage of space saved by converting to Parquet: 71.72%


# Inspecting Parquet Data Format