## `Exploratory Data Analysis`

We will explore the data to see what do we need to store to achieve Annie's insights goals. As we could not reach her, we would have to look up into the data to see if it fit our analytical purposes.

First, a function that help us load the csv files buffer from the extracted zips.

In [2]:
# CONST
available_data = {
    "purchases": "PurchasesFINAL12312016csv.zip",
    "beginning_inventory": "BegInvFINAL12312016csv.zip",
    "purchase_prices": "2017PurchasePricesDeccsv.zip",
    "vendor_invoices": "VendorInvoices12312016csv.zip",
    "ending_inventory": "EndInvFINAL12312016csv.zip",
    "sales": "SalesFINAL12312016csv.zip"
}

In [2]:
from zipfile import ZipFile
from io import BytesIO

def get_first_csv_buffer_from_zip(file_path: str) -> BytesIO:
    """
    Extracts the first CSV file from a ZIP archive at the given 
    file path and returns it as a BytesIO buffer.

    Args:
        file_path (str): Path to the ZIP file on disk.

    Returns:
        BytesIO: Buffer containing the CSV file's bytes, 
        or None if extraction failed.
    """
    try:
        with ZipFile(file_path, 'r') as zip_file:
            # Find the first CSV file in the ZIP
            for filename in zip_file.namelist():
                if filename.lower().endswith('.csv'):
                    with zip_file.open(filename) as csv_file:
                        csv_bytes = csv_file.read()
                        return BytesIO(csv_bytes)
            print("No CSV file found in ZIP archive.", flush=True)
            return None
    except Exception as e:
        print(f"Error extracting CSV from ZIP: {e}", flush=True)
        return None

## Now lets create the dataframes to work with them.

Taking into account that we want to know profits and margins. Let look up into the relevant dataFrames given those two metrics.

`Profit`: Difference between the total income and all expenses.
* `Gross Profit`: Difference between revenue and cost of goods sold (COGS).
* `Operating Profit`: Difference between Gross Profit and Operating Expenses.
* `Net Profit`: Difference between Profit and all expenses (tax, interests, etc).


`Margin`: Difference between a products selliong price and its cost expressed as a percentage.
* `Gross Margin`: Difference between revenue and the COGS divided by revenue.
* `Operating Margin`: Margin Including operating expenses in addition to COGS.
* `Net Profit Margin`: Margin including all expenses.

We can see that we need the data from sales, and expenses, such as purchases and vendor invoices (this will allow us verify if the vendors charge us the correct amount, and to check the balance)


As we need to calculate profits and margins, lets not use the data from 2017 as sales only goes for period 12-31-2016

In [3]:
# CONST
available_data = {
    "purchases": "PurchasesFINAL12312016csv.zip",
    "beginning_inventory": "BegInvFINAL12312016csv.zip",
    # "purchase_prices": "2017PurchasePricesDeccsv.zip",
    "vendor_invoices": "VendorInvoices12312016csv.zip",
    "ending_inventory": "EndInvFINAL12312016csv.zip",
    "sales": "SalesFINAL12312016csv.zip"
}

## Purchases prices EDA.

In [4]:
import pandas as pd

csv_buffer = get_first_csv_buffer_from_zip(
    file_path=f"data/{available_data['purchases']}"
)
df_purchases = pd.read_csv(csv_buffer)
df_purchases.head()

Unnamed: 0,InventoryId,Store,Brand,Description,Size,VendorNumber,VendorName,PONumber,PODate,ReceivingDate,InvoiceDate,PayDate,PurchasePrice,Quantity,Dollars,Classification
0,69_MOUNTMEND_8412,69,8412,Tequila Ocho Plata Fresno,750mL,105,ALTAMAR BRANDS LLC,8124,2015-12-21,2016-01-02,2016-01-04,2016-02-16,35.71,6,214.26,1
1,30_CULCHETH_5255,30,5255,TGI Fridays Ultimte Mudslide,1.75L,4466,AMERICAN VINTAGE BEVERAGE,8137,2015-12-22,2016-01-01,2016-01-07,2016-02-21,9.35,4,37.4,1
2,34_PITMERDEN_5215,34,5215,TGI Fridays Long Island Iced,1.75L,4466,AMERICAN VINTAGE BEVERAGE,8137,2015-12-22,2016-01-02,2016-01-07,2016-02-21,9.41,5,47.05,1
3,1_HARDERSFIELD_5255,1,5255,TGI Fridays Ultimte Mudslide,1.75L,4466,AMERICAN VINTAGE BEVERAGE,8137,2015-12-22,2016-01-01,2016-01-07,2016-02-21,9.35,6,56.1,1
4,76_DONCASTER_2034,76,2034,Glendalough Double Barrel,750mL,388,ATLANTIC IMPORTING COMPANY,8169,2015-12-24,2016-01-02,2016-01-09,2016-02-16,21.32,5,106.6,1


Lets get the columns first

In [9]:
df_purchases.columns

Index(['InventoryId', 'Store', 'Brand', 'Description', 'Size', 'VendorNumber',
       'VendorName', 'PONumber', 'PODate', 'ReceivingDate', 'InvoiceDate',
       'PayDate', 'PurchasePrice', 'Quantity', 'Dollars', 'Classification'],
      dtype='object')

Now let see how many missing values does we have. This will be important if columns such as Dollars, Quantity, or PurchasePrice have missin values

In [5]:
df_purchases.isna().sum()

InventoryId       0
Store             0
Brand             0
Description       0
Size              3
VendorNumber      0
VendorName        0
PONumber          0
PODate            0
ReceivingDate     0
InvoiceDate       0
PayDate           0
PurchasePrice     0
Quantity          0
Dollars           0
Classification    0
dtype: int64

Lets see if there are more than 1 product with same inventory id, and if the data is the same for each entry:

In [11]:
df_purchases["InventoryId"].value_counts()

InventoryId
73_DONCASTER_8068     180
73_DONCASTER_3545     178
76_DONCASTER_1233     175
76_DONCASTER_5364     174
67_EANVERNESS_3545    171
                     ... 
76_DONCASTER_1074       1
23_ARBINGTON_1074       1
15_WANBORNE_1069        1
52_GRAYCOTT_759         1
74_PAENTMARWY_3907      1
Name: count, Length: 245907, dtype: int64

In [15]:
df_purchases.loc[df_purchases["InventoryId"] == "73_DONCASTER_8068"].head(3)

Unnamed: 0,InventoryId,Store,Brand,Description,Size,VendorNumber,VendorName,PONumber,PODate,ReceivingDate,InvoiceDate,PayDate,PurchasePrice,Quantity,Dollars,Classification
37089,73_DONCASTER_8068,73,8068,Absolut 80 Proof,1.75L,17035,PERNOD RICARD USA,8257,2015-12-30,2016-01-07,2016-01-13,2016-02-25,18.24,10,182.4,1
37364,73_DONCASTER_8068,73,8068,Absolut 80 Proof,1.75L,17035,PERNOD RICARD USA,8257,2015-12-30,2016-01-05,2016-01-13,2016-02-25,18.24,12,218.88,1
37740,73_DONCASTER_8068,73,8068,Absolut 80 Proof,1.75L,17035,PERNOD RICARD USA,8257,2015-12-30,2016-01-06,2016-01-13,2016-02-25,18.24,36,656.64,1


In [16]:
df_purchases.loc[df_purchases["InventoryId"] == "73_DONCASTER_8068", "PurchasePrice"].value_counts()

PurchasePrice
18.24    180
Name: count, dtype: int64

They are the same, so we can load the unique entry for "purchases" by InventoryId

Lets also see if there are purchases with PurchasePrice == 0 (as this will mess with our profits and margins).

In [9]:
print(df_purchases.loc[df_purchases["PurchasePrice"] == 0].value_counts().sum(), "Missing PurchasePrice")
print(df_purchases.shape[0], "Total entries")

153 Missing PurchasePrice
2372474 Total entries


We have, its nice to have in mind while making our Charts.

Now lets see the Invoices to later on join and match the data.

In [5]:
df_vendor_invoices = pd.read_csv(
    f"data/{available_data['vendor_invoices']}", converters={"Approval": str}
)
df_vendor_invoices.head()

Unnamed: 0,VendorNumber,VendorName,InvoiceDate,PONumber,PODate,PayDate,Quantity,Dollars,Freight,Approval
0,105,ALTAMAR BRANDS LLC,2016-01-04,8124,2015-12-21,2016-02-16,6,214.26,3.47,
1,4466,AMERICAN VINTAGE BEVERAGE,2016-01-07,8137,2015-12-22,2016-02-21,15,140.55,8.57,
2,388,ATLANTIC IMPORTING COMPANY,2016-01-09,8169,2015-12-24,2016-02-16,5,106.6,4.61,
3,480,BACARDI USA INC,2016-01-12,8106,2015-12-20,2016-02-05,10100,137483.78,2935.2,
4,516,BANFI PRODUCTS CORP,2016-01-07,8170,2015-12-24,2016-02-12,1935,15527.25,429.2,


In [54]:
print(df_vendor_invoices.shape[0])

5543


In [58]:
df_vendor_invoices.isna().sum()

VendorNumber    0
VendorName      0
InvoiceDate     0
PONumber        0
PODate          0
PayDate         0
Quantity        0
Dollars         0
Freight         0
Approval        0
dtype: int64

It seems that the invoices are not detailed but grouped by Date and Vendor, as we can not see and match each product price with what is being charged, lets make a general verification

In [55]:
has_same_vendors = len(df_purchases["VendorName"].unique()) == len(df_vendor_invoices["VendorName"].unique())
has_same_quantities = df_purchases["Quantity"].sum() == df_vendor_invoices["Quantity"].sum()
has_same_amounts = df_purchases["Dollars"].sum() == df_vendor_invoices["Dollars"].sum()

print(has_same_vendors, ": For same vendors")
print(has_same_quantities, ": For same quantities")
print(has_same_amounts, ": For same amounts")

True : For same vendors
True : For same quantities
True : For same amounts


This is a good indicator, but not a source of truth, as if we want to know which products have the best margins, we need to know the exact price of the product, and for this we have to match every product with the one in the invoice. But for practical purposes lets suppose the data is as significative as it can be.

Let's check the beginning inventory to see if there is any useful data that we could use, or else we can continue with sales dataFrame

In [6]:
import pandas as pd

csv_buffer = get_first_csv_buffer_from_zip(
    file_path=f"data/{available_data['beginning_inventory']}"
)
df_beg_inventory = pd.read_csv(csv_buffer)
df_beg_inventory.head()

Unnamed: 0,InventoryId,Store,City,Brand,Description,Size,onHand,Price,startDate
0,1_HARDERSFIELD_58,1,HARDERSFIELD,58,Gekkeikan Black & Gold Sake,750mL,8,12.99,2016-01-01
1,1_HARDERSFIELD_60,1,HARDERSFIELD,60,Canadian Club 1858 VAP,750mL,7,10.99,2016-01-01
2,1_HARDERSFIELD_62,1,HARDERSFIELD,62,Herradura Silver Tequila,750mL,6,36.99,2016-01-01
3,1_HARDERSFIELD_63,1,HARDERSFIELD,63,Herradura Reposado Tequila,750mL,3,38.99,2016-01-01
4,1_HARDERSFIELD_72,1,HARDERSFIELD,72,No. 3 London Dry Gin,750mL,6,34.99,2016-01-01


Does not seems so, so lets ignore it for now

In [7]:
import pandas as pd

csv_buffer = get_first_csv_buffer_from_zip(
    file_path=f"data/{available_data['sales']}"
)
df_sales = pd.read_csv(csv_buffer)
df_sales.head()

Unnamed: 0,InventoryId,Store,Brand,Description,Size,SalesQuantity,SalesDollars,SalesPrice,SalesDate,Volume,Classification,ExciseTax,VendorNo,VendorName
0,1_HARDERSFIELD_1004,1,1004,Jim Beam w/2 Rocks Glasses,750mL,1,16.49,16.49,2016-01-01,750.0,1,0.79,12546,JIM BEAM BRANDS COMPANY
1,1_HARDERSFIELD_1004,1,1004,Jim Beam w/2 Rocks Glasses,750mL,2,32.98,16.49,2016-01-02,750.0,1,1.57,12546,JIM BEAM BRANDS COMPANY
2,1_HARDERSFIELD_1004,1,1004,Jim Beam w/2 Rocks Glasses,750mL,1,16.49,16.49,2016-01-03,750.0,1,0.79,12546,JIM BEAM BRANDS COMPANY
3,1_HARDERSFIELD_1004,1,1004,Jim Beam w/2 Rocks Glasses,750mL,1,14.49,14.49,2016-01-08,750.0,1,0.79,12546,JIM BEAM BRANDS COMPANY
4,1_HARDERSFIELD_1005,1,1005,Maker's Mark Combo Pack,375mL 2 Pk,2,69.98,34.99,2016-01-09,375.0,1,0.79,12546,JIM BEAM BRANDS COMPANY


In [61]:
# None values
df_sales.isna().sum()

InventoryId       0
Store             0
Brand             0
Description       0
Size              0
SalesQuantity     0
SalesDollars      0
SalesPrice        0
SalesDate         0
Volume            0
Classification    0
ExciseTax         0
VendorNo          0
VendorName        0
dtype: int64

In [8]:
df_sales.columns

Index(['InventoryId', 'Store', 'Brand', 'Description', 'Size', 'SalesQuantity',
       'SalesDollars', 'SalesPrice', 'SalesDate', 'Volume', 'Classification',
       'ExciseTax', 'VendorNo', 'VendorName'],
      dtype='object')

With this shallow examination, we could find of interest the following dataframes and columns to load the data and make our report:

* Purchases:['InventoryId', 'PurchasePrice'],
* Sales:['InventoryId', 'Store', 'Brand', 'SalesQuantity', 'SalesDollars', 'SalesPrice', 'SalesDate', 'Volume', 'ExciseTax', 'VendorName'],

## [Go to ETL](etl.ipynb)
