# Lernspark Pipeline Playground

This is a jupyter notebook play ground that will allow you to discover differnet types fo SQL commands and basic data engineering and analysis type operations on a sample of a much larger set of data. This is a common day to day task in data engineering analysis. 

## Prereqs
You should only be looking at this notebook after running the command `lernspark-data` and `lernspark-play` this is because this notebook is dependent on some set up to create your example data set zipped up in a `tar.gz` format. **You must run lernspark-data before lernspark-play**

# Part 0: Python Imports

For this notebook depending on what other packages you may include you should keep adding `import` statements into this block. It is reocmmended to group all imports into a single cell at the top of a notebook.

In [9]:
pip install pandas pyarrow fastparquet

Collecting pyarrow
  Downloading pyarrow-16.1.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.0 kB)
Collecting fastparquet
  Downloading fastparquet-2024.5.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (4.1 kB)
Collecting cramjam>=2.3 (from fastparquet)
  Downloading cramjam-2.8.3-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.metadata (4.2 kB)
Collecting fsspec (from fastparquet)
  Downloading fsspec-2024.6.0-py3-none-any.whl.metadata (11 kB)
Downloading pyarrow-16.1.0-cp312-cp312-macosx_11_0_arm64.whl (26.0 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.0/26.0 MB[0m [31m50.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hDownloading fastparquet-2024.5.0-cp312-cp312-macosx_11_0_arm64.whl (685 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m685.1/685.1 kB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cramjam-2.8.3-cp312-cp312-macosx_10_12_x86_64.ma

In [10]:
import json
import tarfile
import os
import tempfile
import shutil
import pandas as pd

# Part 1: Load Data into Memory

The first step will be to extract the data which is stored in `~\Downloads\examples.tar.gz` and load it into this notebooks memory. We will extract into the system tmp folder (a common operation) and

In [11]:
# Get the path to the downloads folder
downloads_folder = os.path.expanduser("~/Downloads")

# Specify the filename of the tar.gz file
filename = "examples.tar.gz"

# Construct the full path to the tar.gz file
file_path = os.path.join(downloads_folder, filename)

# Create a temporary directory
temp_dir = tempfile.mkdtemp()


try:
    # Open the tar.gz file
    with tarfile.open(file_path, "r:gz") as tar:
        # Extract all files to the temporary directory
        tar.extractall(path=temp_dir)
    print("Extraction completed.")
    print(f"Extracted files are located in: {temp_dir}")
    
    # Get the list of files in the temporary directory
    extracted_files = os.listdir(temp_dir)
    
    # Print the names of the extracted files
    print("Extracted files:")
    for file_name in extracted_files:
        print(f"\t{file_name}")
except FileNotFoundError:
    print(f"I can't find {file_path}, you need to run `lernspark-play` to create you data sample zip file")
except tarfile.ReadError:
    print(f"Error reading the tar.gz file: {file_path}")

Extraction completed.
Extracted files are located in: /var/folders/b4/f3v6ww_s0_zcm_srr9ndn7jr0000gn/T/tmpgtn4b1it
Extracted files:
	Football_teams.parquet


## Load Parquet file into Python Memory
Now that we have unzipped the example data we can read it into python using the parquet modules.

In [12]:
# Pick a file name you want expect printed from above
# :: EDIT THIS LINE 
data_file = "Football_teams.parquet"

# Specify the path to the extracted Parquet file
parquet_file = os.path.join(temp_dir, data_file)

try:
    # Read the Parquet file into a DataFrame
    df = pd.read_parquet(parquet_file)
    
    # Print the first few rows of the DataFrame
    print("First few rows of the data:")
    print(df.head())
    
    # Print the summary statistics of the DataFrame
    print("\nSummary statistics:")
    print(df.describe())
    
    # Explore the data further as needed
    # ...
except FileNotFoundError:
    print(f"Parquet file not found: {parquet_file}")
except Exception as e:
    print(f"Error loading Parquet file: {str(e)}")

First few rows of the data:
   ID                Name
0   9  Elinor Satterfield
1   9         Marc Walker
2  81        Kianna Hoppe
3   7        Braulio Wolf
4  60          Chad Towne

Summary statistics:
                 ID
count  86254.000000
mean      49.976789
std       29.210527
min        0.000000
25%       25.000000
50%       50.000000
75%       75.000000
max      100.000000


# Part N: Clean-up
After we are done with out analysis we need to clean up our disk memory that we have created. While computers have lots and lots of memory today. This practice is good to keep as you never know what your application will be run on!

In [13]:
# Clean up the extracted files
shutil.rmtree(temp_dir)
print(f"Temporary directory {temp_dir} cleaned up.")

Temporary directory /var/folders/b4/f3v6ww_s0_zcm_srr9ndn7jr0000gn/T/tmpgtn4b1it cleaned up.
