In [1]:
import pandas as pd
from dotenv import load_dotenv
from pathlib import Path
import os

In [2]:
# load environment variables from .env file for project
dotenv_path = Path('../.env')
load_dotenv(dotenv_path=dotenv_path)

True

We converted the `comma-separated values files` (CSV) in a earlier stage to `Apache parquet` files. Parquet files make processing with `pandas` faster and more memory efficient. The processed parquet files are the `OUTPUT_DIRECTORY` given in the `.env` file of the project.

In [3]:
data_directory = os.getenv("OUTPUT_DIRECTORY")

List all files in the `OUTPUT_DIRECTORY`.

In [4]:
os.listdir(data_directory)

['OmzetEansCoicopsPlus_202206_202308.parquet',
 'converted_csvs',
 'OmzetEansCoicopsLidl_202007_202202.parquet',
 'OutputEansCoicopsPlus_202107_202205.parquet',
 'OmzetEansCoicopsPlus_202107_202205.parquet',
 'OmzetEansCoicopsLidl_202203_202308.parquet',
 'KassabonPlus_va_202201.parquet',
 'OmzetEansCoicopsLidl_2018_202006.parquet']

Let's focus on the LIDL file first.

In [6]:
lidl_df = pd.read_parquet(os.path.join(data_directory, 'OmzetEansCoicopsLidl_2018_202006.parquet'), engine="pyarrow")
lidl_df.head()

Unnamed: 0,bg_number,month,coicop_number,coicop_name,isba_number,isba_name,esba_number,esba_name,rep_id,ean_number,ean_name,revenue,amount
0,908515,201808,56110,Schoonmaak- en onderhoudsproducten,56110901,Schoonmaak- en onderhoudsproducten,83_20,"Wasch-/Putz-/Reinigungsmittel_Putz-, Reinigung...",3184145,62789.0,Badreiniger,210.869995,213.0
1,908515,201808,56110,Schoonmaak- en onderhoudsproducten,56110901,Schoonmaak- en onderhoudsproducten,83_20,"Wasch-/Putz-/Reinigungsmittel_Putz-, Reinigung...",3185902,77358.0,Galzeep,52178.578125,35037.0
2,908515,201808,56110,Schoonmaak- en onderhoudsproducten,56110901,Schoonmaak- en onderhoudsproducten,83_20,"Wasch-/Putz-/Reinigungsmittel_Putz-, Reinigung...",3182649,90982.0,Allesreiniger SK1,149383.46875,150896.0
3,908515,201808,56110,Schoonmaak- en onderhoudsproducten,56110901,Schoonmaak- en onderhoudsproducten,83_20,"Wasch-/Putz-/Reinigungsmittel_Putz-, Reinigung...",3186380,90986.0,Allesreiniger eco,0.99,1.0
4,908515,201808,56110,Schoonmaak- en onderhoudsproducten,56110901,Schoonmaak- en onderhoudsproducten,83_20,"Wasch-/Putz-/Reinigungsmittel_Putz-, Reinigung...",3192008,99134.0,Eco afwasmiddel SK3,28517.759766,28826.0


As per their definition COICOP definitions should be 5 digits long: 
- Two digits for the COICOP division, ranging from 01 until
- One digit for the COICOP group
- One digit for the COICOP class
- One digit for the COICOP subclass
See for more information the PDF [here](https://unstats.un.org/unsd/classifications/unsdclassifications/COICOP_2018_-_pre-edited_white_cover_version_-_2018-12-26.pdf). 
Check if this is the case for the coicop numbers in the LIDL dataframe.

In [9]:
lidl_df.coicop_number.str.len().value_counts().reset_index()

Unnamed: 0,coicop_number,count
0,5,123416
1,6,32925
2,1,1367


It seems there are COICOP numbers with 5 digits, 6 digits, and even 1 digits. Let's check the COICOP numbers with one digit first:

In [12]:
lidl_df[lidl_df.coicop_number.str.len() == 1].head(10)

Unnamed: 0,bg_number,month,coicop_number,coicop_name,isba_number,isba_name,esba_number,esba_name,rep_id,ean_number,ean_name,revenue,amount
3618,908515,202001,0,Onbekend,0,Onbekend,121_100,Sport Hartwaren_Fahrrad,17927206,327185.0,Fietsbel ping,17.9,18.0
3619,908515,202001,0,Onbekend,0,Onbekend,121_100,Sport Hartwaren_Fahrrad,17927207,327187.0,Fietswielverlichting LED,73.010002,15.0
3620,908515,202001,0,Onbekend,0,Onbekend,121_100,Sport Hartwaren_Fahrrad,17927208,327218.0,Zadelhoes,69.0,69.0
3621,908515,202001,0,Onbekend,0,Onbekend,121_100,Sport Hartwaren_Fahrrad,17927209,344101.0,Fietskrat wit,26.969999,3.0
3622,908515,202001,0,Onbekend,0,Onbekend,121_100,Sport Hartwaren_Fahrrad,3193227,8005.0,Elektrische push-bel,2.0,1.0
3623,908515,202001,0,Onbekend,0,Onbekend,121_100,Sport Hartwaren_Fahrrad,3193231,8009.0,Spiraalslot,10.0,2.0
3624,908515,202001,0,Onbekend,0,Onbekend,121_100,Sport Hartwaren_Fahrrad,3193232,8010.0,Kettingslot,120.0,24.0
3625,908515,202001,0,Onbekend,0,Onbekend,121_100,Sport Hartwaren_Fahrrad,3193233,8017.0,Fietstas 35 L,14.99,1.0
3626,908515,202001,0,Onbekend,0,Onbekend,121_100,Sport Hartwaren_Fahrrad,3193235,8029.0,Bagagedragerkussen,43.889999,11.0
3627,908515,202001,0,Onbekend,0,Onbekend,121_100,Sport Hartwaren_Fahrrad,3193239,8035.0,Fietszadel design,5.0,1.0


The COICOP number for COICOP numbers of length 1 has value 0 for the first 10 rows. See which other values are possible:

In [14]:
lidl_df[lidl_df.coicop_number.str.len() == 1].coicop_number.value_counts()

coicop_number
0    1367
Name: count, dtype: int64