First of all, _`what does Annie wants?`_

> She wants to understand her profits and margins, what tasks does she entrusted us?

1. Efficiently ingest the relevant csv files into a suitable database.
2. Transform the data to calculate profits ($) and margin (%).
3. Create a report for Annie outlining:
* a. Top 10 products based on profit ($) and margin(%).
* b. Top 10 brands based on profit ($) and margin (%).
* c. Which brands / products should she drop as a wholesales because they are loosing money.

Now that we want what Annie wants, lets work to make it happen. First we can see that the job will have two parts:
1. _`Ingest, explore and transform the data`_: According to the second goal.
2. _`Create a BI report for Annie to understand her data`_: Make data accessible to Annie via dashboard on a BI tool for her to understand her data and how business is running.

# 1. Ingest relevant files into a suitable database.

I would suggest a DB medallion architecture for Business Analytics.

`What is a medallion architecture?`:
Medallion architecture is a data storing pattern where we store the data following 3 "medals" before reaching our analytic goals. These "medals" are:
1. `Bronze`: We load the data just as we get them into a suitable database or data formats. The goal is to store the data fast, having it in its `raw` state to change a modify the other two medallion in case it is needed. (CSVs into our file system for this case scenario)
2. `Silver`: We fetch the data from the bronze medallion, apply transformations and store them into our database. This data is clean and just the relevant information is stored in this "medallion". (postgres database running locally)
3. `Gold`: The gold medallion is the goal of our Analytical purposes, as it contains the data from the "silver medal" with aggregations applied to them, to get only the most useful and "rich" data for our Analysis. These could be: Predictions, Flags, Grouping, Calculated new columns, etc. (Materialized views and aggregated data tables).


Dev plan:
1. `Bronze`: Fetch the csv from the data source, and store them into our file sistem. (csvs would be our bronze medallion).
2. `Silver`: After EDA (Exploratory Data Analysis) we will take into account the relevant columns to store into our postgreSQL database.
3. `Gold`: We would make calculated and aggregated columns, such as crossed data from distinct tables, calculated columns such as margin and profit (requested by Annie, etc).

# 1. Bronze.

We will fetch the .zips directly from the api and store them into our local file system as zip files to not use too much disk space.

* First we will create the adapter to make API calls to fetch the data.

In [1]:
import json
import requests
from typing import Any, Dict

class APIClient():
    def __init__(self, base_url: str):
        self.base_url = base_url
        self.headers: dict[str,Any] = {}

    def make_post_request(
        self, 
        endpoint: str, 
        **kwargs
    ) -> Dict[str, Any]:
        """
        Makes a POST request to the specified endpoint.

        Params:
            endpoint (str): The API endpoint to which the request is made.
            is_root_func (bool, optional): Indicates if this is the root 
            function call. Defaults to True.
            **kwargs: Additional keyword arguments to be sent as JSON in 
            the request body.

        Returns:
            Dict[str, Any]: The JSON response data from the POST request.
        """
        url = f"{self.base_url}{endpoint}"
        print(f"Making a POST request to {url}", flush=True)
        response = requests.post(url, headers=self.headers, data=json.dumps(kwargs))
        return response
    
    def make_get_request(
        self, 
        endpoint: str,
        **kwargs
    ) -> Dict[str, Any]:
        """
        Makes a GET request to the specified endpoint.

        Params:
            endpoint (str): The API endpoint to which the request is made.
            is_root_func (bool, optional): Indicates if this is the root 
            function call. Defaults to True.

        Returns:
            Dict[str, Any]: The JSON response data from the POST request.
        """
        url = f"{self.base_url}{endpoint}"
        print(f"Making a GET request to {url}", flush=True)
        response = requests.get(url, headers=self.headers, params=kwargs)
        return response
            

* Now the script to store it locally

In [2]:
import os
from io import BytesIO

def save_zip_into_directory(zip_bytes: BytesIO, output_dir: str = "extracted_files", zip_filename: str = "archive.zip") -> str:
    """
    Saves the ZIP file to the file system.

    Args:
        zip_bytes (BytesIO): ZIP file data in memory.
        output_dir (str): Directory where the ZIP will be saved (default: "extracted_files").
        zip_filename (str): Name to use for the saved ZIP file (default: "archive.zip").

    Returns:
        str: Path to the saved ZIP file, or None if saving failed.
    """
    try:
        # Create output directory if it doesn't exist
        os.makedirs(output_dir, exist_ok=True)

        zip_path = os.path.join(output_dir, zip_filename)
        with open(zip_path, "wb") as f:
            f.write(zip_bytes)


        print(f"Successfully saved ZIP to: {zip_path}", flush=True)
        return zip_path

    except Exception as e:
        print(f"Error saving ZIP file: {e}", flush=True)
        return None

* And then we will loop through the availables zips to store them

In [3]:
client = APIClient(base_url="https://www.pwc.com/us/en/careers/university_relations/data_analytics_cases_studies/")

available_data = {
    "purchases": "PurchasesFINAL12312016csv.zip",
    "beginning_inventory": "BegInvFINAL12312016csv.zip",
    "purchase_prices": "2017PurchasePricesDeccsv.zip",
    "vendor_invoices": "VendorInvoices12312016csv.zip",
    "ending_inventory": "EndInvFINAL12312016csv.zip",
    "sales": "SalesFINAL12312016csv.zip"
}

for value in available_data.values():
    response = client.make_get_request(endpoint=value)
    save_zip_into_directory(
        zip_bytes = response.content, 
        output_dir="data", 
        zip_filename=value)

Making a GET request to https://www.pwc.com/us/en/careers/university_relations/data_analytics_cases_studies/PurchasesFINAL12312016csv.zip
Successfully saved ZIP to: data\PurchasesFINAL12312016csv.zip
Making a GET request to https://www.pwc.com/us/en/careers/university_relations/data_analytics_cases_studies/BegInvFINAL12312016csv.zip
Successfully saved ZIP to: data\BegInvFINAL12312016csv.zip
Making a GET request to https://www.pwc.com/us/en/careers/university_relations/data_analytics_cases_studies/2017PurchasePricesDeccsv.zip
Successfully saved ZIP to: data\2017PurchasePricesDeccsv.zip
Making a GET request to https://www.pwc.com/us/en/careers/university_relations/data_analytics_cases_studies/VendorInvoices12312016csv.zip
Successfully saved ZIP to: data\VendorInvoices12312016csv.zip
Making a GET request to https://www.pwc.com/us/en/careers/university_relations/data_analytics_cases_studies/EndInvFINAL12312016csv.zip
Successfully saved ZIP to: data\EndInvFINAL12312016csv.zip
Making a GET r

Having the data as zip files lets procede with the `Exploratory Data Analysis`.

## [Go to EDA](eda.ipynb)