# Lernspark Pipeline Playground

This is a jupyter notebook play ground that will allow you to discover differnet types fo SQL commands and basic data engineering and analysis type operations on a sample of a much larger set of data. This is a common day to day task in data engineering analysis. 

## Prereqs
You should only be looking at this notebook after running the command `lernspark-data` and `lernspark-play` this is because this notebook is dependent on some set up to create your example data set zipped up in a `tar.gz` format. **You must run lernspark-data before lernspark-play**

# Part 0: Python Imports

For this notebook depending on what other packages you may include you should keep adding `import` statements into this block. It is reocmmended to group all imports into a single cell at the top of a notebook.

In [1]:
import json
import tarfile
import os
import tempfile
import shutil

# Part 1: Load Data into Memory

The first step will be to extract the data which is stored in `~\Downloads\examples.tar.gz` and load it into this notebooks memory. We will extract into the system tmp folder (a common operation) and

In [7]:
# Get the path to the downloads folder
downloads_folder = os.path.expanduser("~/Downloads")

# Specify the filename of the tar.gz file
filename = "examples.tar.gz"

# Construct the full path to the tar.gz file
file_path = os.path.join(downloads_folder, filename)

# Create a temporary directory
temp_dir = tempfile.mkdtemp()


try:
    # Open the tar.gz file
    with tarfile.open(file_path, "r:gz") as tar:
        # Extract all files to the temporary directory
        tar.extractall(path=temp_dir)
    print("Extraction completed.")
    print(f"Extracted files are located in: {temp_dir}")
    
    # Get the list of files in the temporary directory
    extracted_files = os.listdir(temp_dir)
    
    # Print the names of the extracted files
    print("Extracted files:")
    for file_name in extracted_files:
        print(f"\t{file_name}")
except FileNotFoundError:
    print(f"I can't find {file_path}, you need to run `lernspark-play` to create you data sample zip file")
except tarfile.ReadError:
    print(f"Error reading the tar.gz file: {file_path}")

Extraction completed.
Extracted files are located in: /var/folders/b4/f3v6ww_s0_zcm_srr9ndn7jr0000gn/T/tmpa_gqky4z
Extracted files:
	Football_teams.parquet


# Part N: Clean-up
After we are done with out analysis we need to clean up our disk memory that we have created. While computers have lots and lots of memory today. This practice is good to keep as you never know what your application will be run on!

In [8]:
# Clean up the extracted files
shutil.rmtree(temp_dir)
print(f"Temporary directory {temp_dir} cleaned up.")

Temporary directory /var/folders/b4/f3v6ww_s0_zcm_srr9ndn7jr0000gn/T/tmpa_gqky4z cleaned up.
