# Sales and Forecast Data Analysis Project
Author: Sofia Shchetinina

## 1. Project Overview

This project involves the cleaning, processing, and analysis of sales and forecast data from different regions (Americas, EMEA, Asia).
The goal is to load, transform and consolidate this data into a unified database for easier querying while ensuring data quality and integrity.

The data is sourced from multiple CSV and Excel files provided by business teams, so, data inconsistencies are expected. The final output is stored in an SQLite database, ready for further analysis, and used for creation of interactive dashboard in Tableau.


## 2. Exploratory Data Analysis

In [639]:
import pandas as pd

import warnings

with warnings.catch_warnings():
    warnings.simplefilter('ignore')

In [640]:
# Load the data from csv sources
americas_data = pd.read_csv('data/americas.csv')
emea_data = pd.read_csv('data/emea.csv')
forecast_data = pd.read_csv('data/forecast.csv')

In [641]:
# Load the data from Excel ensuring the possibility of adding new sheets
asia_sheets_dict = pd.read_excel('data/asia.xlsx', sheet_name=None)

# Standardize column names for different sheets
def standardize_columns(df):
    df.columns = df.columns.str.lower()  # Convert all column names to lowercase
    return df
# Apply function to all sheets
asia_sheets_dict = {sheet_name: standardize_columns(df) for sheet_name, df in asia_sheets_dict.items()}
# Combine the sheets into one dataframe
asia_data = pd.concat(asia_sheets_dict.values(), ignore_index=True)

In [642]:
# Standardize columns for other dataframes
americas_data = standardize_columns(americas_data)
emea_data = standardize_columns(emea_data)
forecast_data = standardize_columns(forecast_data)

Quick overview of the data in all dataframes

In [643]:
print('Americas Data:')
print(americas_data.head(), '\n')

print('EMEA Data:')
print(emea_data.head(), '\n')

print('Asia Data:')
print(asia_data.head(), '\n')

print('Forecast Data:')
print(forecast_data.head(), '\n')

Americas Data:
   unnamed: 0    surcharge  material_nbr  period  \
0           0  5053.311857    11947192.0    2020   
1           1  3744.609416    12502640.0    2022   
2           2  2346.913894    11947192.0    2021   
3           3  3507.780298    12515444.0    2021   
4           4  1515.461933           NaN    2020   

   commercial_sales_territory_code  sales_tcfxact      net_qty  \
0                            923.0     866.632718   631.691689   
1                            923.0  -30561.190505    24.500444   
2                            923.0   27128.528866  1198.988892   
3                            921.0   13872.159979   314.982145   
4                            922.0    3199.227007  1587.437073   

  commercial_team commercial_subregion_desc  company_code  ...  \
0            CT13                  UNDEF CA        1318.0  ...   
1            CT13                  UNDEF CA        1318.0  ...   
2            CT13                  UNDEF CA        1318.0  ...   
3          

- Irrelevant columns detected: 'unnamed: 0' in americas_data, emea_data, forecast_data; 'sales_tcfxact' in americas_data
- Format inconsistencies in 'period' column: only year / year and month. Forecast is made for the year
- Potential naming inconsistencies in 'commercial_country_name'

## 3. Data Cleaning

In [644]:
# Drop extra columns
americas_data = americas_data.drop(columns=['unnamed: 0', 'sales_tcfxact'], errors='ignore')
emea_data = emea_data.drop(columns=['unnamed: 0', 'sales_tc_fxact'], errors='ignore')
asia_data = asia_data.drop(columns=['unnamed: 0'], errors='ignore')
forecast_data = forecast_data.drop(columns=['unnamed: 0'], errors='ignore')

In [645]:
# Check for duplicates
print('Duplicates in Americas data:', americas_data.duplicated().sum())
print('Duplicates in EMEA data:', emea_data.duplicated().sum())
print('Duplicates in Asia data:', asia_data.duplicated().sum())
print('Duplicates in Forecast data:', forecast_data.duplicated().sum())

Duplicates in Americas data: 0
Duplicates in EMEA data: 0
Duplicates in Asia data: 0
Duplicates in Forecast data: 0


In [646]:
# Add 'region' column and fill it with the name of the file
americas_data['region'] = 'Americas'
emea_data['region'] = 'EMEA'
asia_data['region'] = 'Asia'

In [647]:
# Check the data types and missing values
americas_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3717 entries, 0 to 3716
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   surcharge                        3717 non-null   float64
 1   material_nbr                     3613 non-null   float64
 2   period                           3717 non-null   int64  
 3   commercial_sales_territory_code  3717 non-null   float64
 4   net_qty                          3717 non-null   float64
 5   commercial_team                  3717 non-null   object 
 6   commercial_subregion_desc        3717 non-null   object 
 7   company_code                     3717 non-null   float64
 8   commercial_district_description  3717 non-null   object 
 9   commercial_area_description      3717 non-null   object 
 10  commercial_country_name          3717 non-null   object 
 11  commercial_area_code             3717 non-null   object 
 12  commercial_district 

Missing values detected in 'material_nbr', column 'period' is not in date format, which is not optimal

In [648]:
emea_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5707 entries, 0 to 5706
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   commercial_team                  5707 non-null   object 
 1   commercial_country_name          5707 non-null   object 
 2   period                           5707 non-null   float64
 3   commercial_subregion_desc        5707 non-null   object 
 4   commercial_district_description  5707 non-null   object 
 5   net_qty                          5707 non-null   float64
 6   commercial_district              5707 non-null   object 
 7   region_description               5707 non-null   object 
 8   commercial_sales_territory_code  5707 non-null   float64
 9   crop                             5707 non-null   object 
 10  commercial_area_description      5707 non-null   object 
 11  commercial_area_code             5707 non-null   object 
 12  commercial_team_desc

Missing values in 'material_nbr', column 'period' is not in date format

In [649]:
asia_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7403 entries, 0 to 7402
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   commercial_team                  7403 non-null   object 
 1   commercial_area_code             7403 non-null   object 
 2   surcharge                        7403 non-null   float64
 3   region_description               7403 non-null   object 
 4   commercial_area_description      7403 non-null   object 
 5   commercial_sales_territory_code  7403 non-null   int64  
 6   period                           7403 non-null   float64
 7   commercial_subregion_desc        7403 non-null   object 
 8   net_qty                          7403 non-null   float64
 9   commercial_district              7403 non-null   object 
 10  commercial_team_description      7403 non-null   object 
 11  commercial_district_description  7403 non-null   object 
 12  crop                

Missing values in 'material_nbr', column 'period' is not in date format

In [650]:
forecast_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128526 entries, 0 to 128525
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   material_number     128526 non-null  int64  
 1   commercial_segment  106590 non-null  object 
 2   cmrcl_wrld_area_cd  128526 non-null  object 
 3   cmrcl_rgn_nm        128526 non-null  object 
 4   cmrcl_cntry_cd      128526 non-null  object 
 5   cmrcl_cntry_dsc     128526 non-null  object 
 6   cmrcl_subrgn_cd     128526 non-null  object 
 7   cmrcl_subrgn_nm     128526 non-null  object 
 8   sku_cd              107940 non-null  object 
 9   year                128526 non-null  int64  
 10  forecast_val        128526 non-null  float64
dtypes: float64(1), int64(2), object(8)
memory usage: 10.8+ MB


Missing values in 'commercial_segment', 'sku_cd' detected. These columns don't match with sales data, therefore they can't be used in the analysis.

In [651]:
# Drop extra columns with missing values
forecast_data = forecast_data.drop(columns=['commercial_segment', 'sku_cd'], errors='ignore')

In [652]:
# Evaluate the share of missing values
americas_missing = americas_data['material_nbr'].isnull().mean() * 100
emea_missing = emea_data['material_nbr'].isnull().mean() * 100
asia_missing = asia_data['material_nbr'].isnull().mean() * 100

print(f"Missing 'material_nbr' in Americas: {americas_missing:.2f}%")
print(f"Missing 'material_nbr' in EMEA: {emea_missing:.2f}%")
print(f"Missing 'material_nbr' in Asia: {asia_missing:.2f}%")

Missing 'material_nbr' in Americas: 2.80%
Missing 'material_nbr' in EMEA: 6.12%
Missing 'material_nbr' in Asia: 1.99%


This column is essential for the correct join of sales and forecast, additional checks are needed to figure out if there are any regularities about these rows.

In [653]:
# Look at the rows where material_number is missing for EMEA region
missing_material_rows = emea_data[emea_data['material_nbr'].isnull()]
missing_material_rows.head(20)

Unnamed: 0,commercial_team,commercial_country_name,period,commercial_subregion_desc,commercial_district_description,net_qty,commercial_district,region_description,commercial_sales_territory_code,crop,...,commercial_team_description,surcharge,gross_sales,material_nbr,region_code,company_code,base_sales,discount,net_sales,region
0,CT7,Italy,2020.03,ITALY,CD-Italy,149.730967,CD-IT,EMEA,602.0,"BEAN, GARDEN",...,ITALY,1331.547643,17698.469477,,1,3515.0,16366.921834,1427.180248,16271.289229,EMEA
1,CT7,Italy,2020.03,ITALY,CD-Italy,628.91372,CD-IT,EMEA,602.0,OTHER VEGETABLE SEED,...,ITALY,2937.590259,18141.307759,,1,3515.0,15203.7175,2564.028564,15577.279194,EMEA
2,CT52,Germany,2021.06,NW OPEN FIELD,CD-Germany-Openf,1576.427147,CD-DE-OPEN,EMEA,671.0,OTHER VEGETABLE SEED,...,NW OPEN FIELD,3107.940426,32893.538171,,1,3385.0,29785.597744,4587.631824,28305.906346,EMEA
3,CT52,Germany,2022.12,NW OPEN FIELD,CD-Germany-Openf,1151.868825,CD-DE-OPEN,EMEA,671.0,OTHER VEGETABLE SEED,...,NW OPEN FIELD,2498.692982,20117.33427,,1,3385.0,17618.641288,3841.863792,16275.470478,EMEA
4,CT52,Netherlands,2021.07,NW OPEN FIELD,CD-Netherland-Openf,1076.568791,CD-NL-OPEN,EMEA,669.0,CARROT,...,NW OPEN FIELD,1789.255107,18492.63175,,1,3605.0,16703.376644,650.319978,17842.311772,EMEA
5,CT59,Iran,2022.1,IRAN,CD-Iran,2046.73795,CD-IR,EMEA,688.0,CARROT,...,IRAN,4957.499154,29348.580023,,1,3605.0,24391.080869,4241.998726,25106.581297,EMEA
6,CT59,Iran,2022.03,IRAN,CD-Iran,683.47052,CD-IR,EMEA,688.0,SQUASH,...,IRAN,3426.082475,22345.65978,,1,3605.0,18919.577305,2672.224993,19673.434787,EMEA
7,CT53,Kuwait,2020.02,MIDDLE EAST,CD-Kuwait,814.712329,CD-KW,EMEA,656.0,TOMATO,...,MIDDLE EAST & EGYPT,2066.209757,20373.990754,,1,3605.0,18307.780997,3685.186973,16688.803781,EMEA
8,CT51,Netherlands,2021.09,EMEA GLASS,CD-Netherlands-Glass,208.086718,CD-NL-GLAS,EMEA,663.0,TOMATO,...,EMEA GLASS,4787.052491,33142.540511,,1,3605.0,28355.488021,4121.115022,29021.425489,EMEA
9,CT52,Germany,2022.03,NW OPEN FIELD,CD-Germany-Openf,1775.678647,CD-DE-OPEN,EMEA,671.0,OTHER VEGETABLE SEED,...,NW OPEN FIELD,1589.365197,15238.269824,,1,3385.0,13648.904627,2703.065348,12535.204476,EMEA


Rows with missing values seem random and probably are caused by human error. I'll remove them because it'll be more robust to keep the column in integer format, and do not overcomplicate it with placeholders.

In [654]:
# Remove rows with missing material_number in all regions
americas_data = americas_data.dropna(subset=['material_nbr'])
emea_data = emea_data.dropna(subset=['material_nbr'])
asia_data = asia_data.dropna(subset=['material_nbr'])

In [655]:
# Convert material_number to integer after dropping missing rows
americas_data['material_nbr'] = americas_data['material_nbr'].astype(int)
emea_data['material_nbr'] = emea_data['material_nbr'].astype(int)
asia_data['material_nbr'] = asia_data['material_nbr'].astype(int)

Check the data for consistency

In [656]:
# Check the date formats in 'Period'
print('Unique period values in Americas Data:')
print(americas_data['period'].unique())

print('\nUnique period values in EMEA Data:')
print(emea_data['period'].unique())

print('\nUnique period values in Asia Data:')
print(asia_data['period'].unique())

print('\nUnique period values in Forecast Data:')
print(forecast_data['year'].unique())

Unique period values in Americas Data:
[2020 2022 2021]

Unique period values in EMEA Data:
[2021.03 2021.04 2022.07 2022.09 2021.08 2021.1  2021.01 2020.03 2021.02
 2020.01 2022.05 2022.04 2020.04 2020.02 2020.12 2020.11 2020.05 2021.07
 2022.02 2020.06 2020.08 2021.12 2020.1  2021.09 2022.12 2022.08 2021.06
 2020.09 2022.03 2022.06 2022.11 2020.07 2022.01 2021.11 2021.05 2022.1 ]

Unique period values in Asia Data:
[2020.1  2020.01 2020.04 2020.05 2020.06 2020.11 2020.02 2020.03 2020.12
 2020.07 2020.08 2020.09 2021.02 2021.04 2021.06 2021.12 2021.11 2021.03
 2021.08 2021.01 2021.1  2021.05 2021.07 2021.09 2022.1  2022.12 2022.01
 2022.04 2022.08 2022.03 2022.07 2022.09 2022.06 2022.02 2022.05 2022.11]

Unique period values in Forecast Data:
[2022 2020 2021]


In [657]:
# Fix date format
americas_data['period'] = americas_data['period'].astype(str) + '.01'
emea_data['period'] = emea_data['period'].apply(lambda x: f"{str(x).split('.')[0]}.{str(x).split('.')[1].zfill(2)}")
asia_data['period'] = asia_data['period'].apply(lambda x: f"{str(x).split('.')[0]}.{str(x).split('.')[1].zfill(2)}")
forecast_data['year'] = forecast_data['year'].astype(str) + '.01'

# Check the date formats in 'Period'
print('Unique period values in Americas Data:')
print(americas_data['period'].unique())

print('\nUnique period values in EMEA Data:')
print(emea_data['period'].unique())

print('\nUnique period values in Asia Data:')
print(asia_data['period'].unique())

print('\nUnique period values in Forecast Data:')
print(forecast_data['year'].unique())

Unique period values in Americas Data:
['2020.01' '2022.01' '2021.01']

Unique period values in EMEA Data:
['2021.03' '2021.04' '2022.07' '2022.09' '2021.08' '2021.01' '2020.03'
 '2021.02' '2020.01' '2022.05' '2022.04' '2020.04' '2020.02' '2020.12'
 '2020.11' '2020.05' '2021.07' '2022.02' '2020.06' '2020.08' '2021.12'
 '2021.09' '2022.12' '2022.08' '2021.06' '2020.09' '2022.03' '2022.06'
 '2022.11' '2020.07' '2022.01' '2021.11' '2021.05']

Unique period values in Asia Data:
['2020.01' '2020.04' '2020.05' '2020.06' '2020.11' '2020.02' '2020.03'
 '2020.12' '2020.07' '2020.08' '2020.09' '2021.02' '2021.04' '2021.06'
 '2021.12' '2021.11' '2021.03' '2021.08' '2021.01' '2021.05' '2021.07'
 '2021.09' '2022.01' '2022.12' '2022.04' '2022.08' '2022.03' '2022.07'
 '2022.09' '2022.06' '2022.02' '2022.05' '2022.11']

Unique period values in Forecast Data:
['2022.01' '2020.01' '2021.01']


In [658]:
# Convert to datetime format and check
americas_data['period'] = pd.to_datetime(americas_data['period'], format='%Y.%m')
print('Americas data:', americas_data['period'])

emea_data['period'] = pd.to_datetime(emea_data['period'], format='%Y.%m')
print('EMEA data:', emea_data['period'])

asia_data['period'] = pd.to_datetime(asia_data['period'], format='%Y.%m')
print('Asia data:', asia_data['period'])

forecast_data['year'] = pd.to_datetime(forecast_data['year'], format='%Y.%m')
print('Forecast data:', forecast_data['year'])

Americas data: 0      2020-01-01
1      2022-01-01
2      2021-01-01
3      2021-01-01
21     2022-01-01
          ...    
3712   2021-01-01
3713   2021-01-01
3714   2021-01-01
3715   2022-01-01
3716   2021-01-01
Name: period, Length: 3613, dtype: datetime64[ns]
EMEA data: 102    2021-03-01
119    2021-04-01
206    2022-07-01
291    2022-09-01
353    2021-08-01
          ...    
5702   2022-01-01
5703   2020-03-01
5704   2021-02-01
5705   2020-12-01
5706   2021-01-01
Name: period, Length: 5358, dtype: datetime64[ns]
Asia data: 0      2020-01-01
3      2020-01-01
4      2020-04-01
5      2020-05-01
6      2020-06-01
          ...    
7398   2022-09-01
7399   2022-09-01
7400   2022-07-01
7401   2022-09-01
7402   2022-02-01
Name: period, Length: 7256, dtype: datetime64[ns]
Forecast data: 0        2022-01-01
1        2022-01-01
2        2022-01-01
3        2022-01-01
4        2022-01-01
            ...    
128521   2021-01-01
128522   2021-01-01
128523   2021-01-01
128524   2021-01-01
1285

In [659]:
# Check the country names for consistency
print('Unique country names in Americas Data:')
print(sorted(americas_data['commercial_country_name'].unique()))

print('\nUnique country names in EMEA Data:')
print(sorted(emea_data['commercial_country_name'].unique()))

print('\nUnique country names in Asia Data:')
print(sorted(asia_data['commercial_country_name'].unique()))

print('\nUnique country names in Forecast Data:')
print(sorted(forecast_data['cmrcl_cntry_dsc'].unique()))

Unique country names in Americas Data:
['Argentina', 'Bolivia', 'Brazil', 'Canada', 'Canadá', 'Chile', 'Colombia', 'Costa Rica', 'Dominican Rep.', 'Ecuador', 'Ecuator', 'El Salvador', 'Guatemala', 'Honduras', 'Jamaica', 'Japan', 'Mexico', 'Nicaragua', 'Panama', 'Paraguay', 'Peru', 'Trinidad,Tobago', 'USA', 'Uruguay', 'Venezuela']

Unique country names in EMEA Data:
['Albania', 'Algeria', 'Armenia', 'Azerbaijan', 'Bahrain', 'Belarus', 'Benin', 'Bosnia-Herz.', 'Bulgaria', 'Burundi', 'Congo Democr. R', 'Croatia', 'Cyprus', 'Egypt', 'Ethiopia', 'France', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Hungary', 'Iran', 'Iraq', 'Israel', 'Italy', 'Ivory Coast', 'Jordan', 'Kazakhstan', 'Kenya', 'Kuwait', 'Lebanon', 'Libya', 'Moldova,Rep. of', 'Morocco', 'Netherlands', 'Niger', 'Nigeria', 'North Macedonia', 'Oman', 'Palestinian Ter', 'Poland', 'Portugal', 'Qatar', 'Rep. of Belarus', 'Romania', 'Russian Fed.', 'Saudi Arabia', 'Senegal', 'Serbia', 'Slovakia', 'Slovenia', 'South Africa', 'Spain', 'Swe

In [660]:
# Map differently spelled values
country_name_mapping = {
    'Canadá': 'Canada',
    'México': 'Mexico',
    'Brasil': 'Brazil',
    'UK': 'United Kingdom',
    'U.S.A': 'United States',
    'Estados Unidos': 'United States',
    'España': 'Spain',
    'Türkiye': 'Turkey'
}

americas_data['commercial_country_name'] = americas_data['commercial_country_name'].replace(country_name_mapping)
emea_data['commercial_country_name'] = emea_data['commercial_country_name'].replace(country_name_mapping)
asia_data['commercial_country_name'] = asia_data['commercial_country_name'].replace(country_name_mapping)
forecast_data['cmrcl_cntry_dsc'] = forecast_data['cmrcl_cntry_dsc'].replace(country_name_mapping)

In [661]:
# Check crop field for consistency
print('Unique crop names in Americas Data:')
print(sorted(americas_data['crop'].unique()))

print('\nUnique crop names in EMEA Data:')
print(sorted(emea_data['crop'].unique()))

print('\nUnique crop names in Asia Data:')
print(sorted(asia_data['crop'].unique()))

Unique crop names in Americas Data:
['BEAN, DRY', 'BEAN, GARDEN', 'BROCCOLI', 'CABBAGE', 'CARROT', 'CAULIFLOWER', 'CUCUMBER', 'EGGPLANT', 'LETTUCE', 'MELON', 'ONION', 'OTHER VEGETABLE SEED', 'PEPPER', 'SPINACH', 'SQUASH', 'SWEET CORN', 'TOMATO', 'WATERMELON']

Unique crop names in EMEA Data:
['BEAN, GARDEN', 'BROCCOLI', 'CABBAGE', 'CARROT', 'CAULIFLOWER', 'CUCUMBER', 'EGGPLANT', 'FENNEL', 'LEEK', 'LETTUCE', 'MELON', 'ONION', 'OTHER VEGETABLE SEED', 'PEA', 'PEPPER', 'SPINACH', 'SQUASH', 'SWEET CORN', 'TOMATO', 'WATERMELON']

Unique crop names in Asia Data:
['BEAN, GARDEN', 'BROCCOLI', 'CABBAGE', 'CARROT', 'CAULIFLOWER', 'CUCUMBER', 'EGGPLANT', 'GOURD', 'LEEK', 'LETTUCE', 'MELON', 'OKRA', 'ONION', 'OTHER VEGETABLE SEED', 'PEA', 'PEPPER', 'RADISH', 'SPINACH', 'SQUASH', 'SWEET CORN', 'TOMATO', 'WATERMELON']


Crop names are consistent

In [662]:
# Combine all region's sales into one dataframe
combined_sales = pd.concat([americas_data, emea_data, asia_data], axis=0, ignore_index=True)
print('Combined sales data:', combined_sales.head())

Combined sales data:      surcharge  material_nbr     period  commercial_sales_territory_code  \
0  5053.311857      11947192 2020-01-01                            923.0   
1  3744.609416      12502640 2022-01-01                            923.0   
2  2346.913894      11947192 2021-01-01                            923.0   
3  3507.780298      12515444 2021-01-01                            921.0   
4  4209.321190      10762610 2022-01-01                            921.0   

       net_qty commercial_team commercial_subregion_desc  company_code  \
0   631.691689            CT13                  UNDEF CA        1318.0   
1    24.500444            CT13                  UNDEF CA        1318.0   
2  1198.988892            CT13                  UNDEF CA        1318.0   
3   314.982145            CT13                  UNDEF US        3605.0   
4   165.566886            CT13                  UNDEF US        3605.0   

  commercial_district_description commercial_area_description  ...  \
0      

## 4. Data Quality and Integrity Checks

In [663]:
# Overview for missing values and data types in combined sales
combined_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16227 entries, 0 to 16226
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   surcharge                        16227 non-null  float64       
 1   material_nbr                     16227 non-null  int64         
 2   period                           16227 non-null  datetime64[ns]
 3   commercial_sales_territory_code  16227 non-null  float64       
 4   net_qty                          16227 non-null  float64       
 5   commercial_team                  16227 non-null  object        
 6   commercial_subregion_desc        16227 non-null  object        
 7   company_code                     16227 non-null  float64       
 8   commercial_district_description  16227 non-null  object        
 9   commercial_area_description      16227 non-null  object        
 10  commercial_country_name          16227 non-null  object   

- Missing values - not found
- Data types - correct

In [664]:
# Overview for missing values and data types in forecast data
forecast_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128526 entries, 0 to 128525
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   material_number     128526 non-null  int64         
 1   cmrcl_wrld_area_cd  128526 non-null  object        
 2   cmrcl_rgn_nm        128526 non-null  object        
 3   cmrcl_cntry_cd      128526 non-null  object        
 4   cmrcl_cntry_dsc     128526 non-null  object        
 5   cmrcl_subrgn_cd     128526 non-null  object        
 6   cmrcl_subrgn_nm     128526 non-null  object        
 7   year                128526 non-null  datetime64[ns]
 8   forecast_val        128526 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(6)
memory usage: 8.8+ MB


- Missing values - not found
- Data types - correct

In [665]:
# Check combined sales for duplicates
duplicates = combined_sales.duplicated().sum()
print(f"Duplicate rows in combined sales data: {duplicates}")

Duplicate rows in combined sales data: 0


In [666]:
# Check forecast data for duplicates
duplicates = forecast_data.duplicated().sum()
print(f"Duplicate rows in forecast data: {duplicates}")

Duplicate rows in forecast data: 0


In [667]:
# Check the country names in forecast data for consistency
print('Unique country names in forecast data data:')
print(sorted(forecast_data['cmrcl_cntry_dsc'].unique()))

Unique country names in forecast data data:
['Albania', 'Algeria', 'Angola', 'Antigua/Barbuda', 'Argentina', 'Armenia', 'Australia', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Belize', 'Benin', 'Bolivia', 'Bosnia-Herz.', 'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cameroon', 'Canada', 'Chad', 'Chile', 'China', 'Colombia', 'Congo Democr. R', 'Costa Rica', 'Croatia', 'Cyprus', 'Dominican Rep.', 'Dutch Antilles', 'Ecuador', 'Egypt', 'El Salvador', 'Ethiopia', 'France', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Guadeloupe', 'Guatemala', 'Haiti', 'Honduras', 'Hungary', 'India', 'Indonesia', 'Iran', 'Iraq', 'Israel', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Lebanon', 'Libya', 'Madagascar', 'Malaysia', 'Mali', 'Martinique', 'Mexico', 'Moldova,Rep. of', 'Morocco', 'Mozambique', 'Myanmar', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'North Macedonia', 'Oman', 'Pakistan', 'Palestinia

In [668]:
# Check the country names in forecast date for consistency
print('Unique country names in combined sales data:')
print(sorted(combined_sales['commercial_country_name'].unique()))

Unique country names in combined sales data:
['Albania', 'Algeria', 'Argentina', 'Armenia', 'Australia', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Belarus', 'Benin', 'Bolivia', 'Bosnia-Herz.', 'Brazil', 'Bulgaria', 'Burundi', 'Canada', 'Chile', 'China', 'Colombia', 'Congo Democr. R', 'Costa Rica', 'Croatia', 'Cyprus', 'Dominican Rep.', 'Ecuador', 'Ecuator', 'Egypt', 'El Salvador', 'Ethiopia', 'France', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Guatemala', 'Honduras', 'Hungary', 'India', 'Iran', 'Iraq', 'Israel', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kuwait', 'Lebanon', 'Libya', 'Mexico', 'Moldova,Rep. of', 'Morocco', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'North Macedonia', 'Oman', 'Pakistan', 'Palestinian Ter', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Qatar', 'Rep. of Belarus', 'Romania', 'Russian Fed.', 'Saudi Arabia', 'Senegal', 'Serbia', 'Slovakia', 'Slovenia', 'South Africa', 'South K

Date format was checked previously

## 5. Database Schema

Logical DB schema is pictured below

![Sales and Forecast Data Schema](./sales_forecast.drawio.png)

Create a database and load sales and forecast data into it. For simplicity I didn't include all sales columns in the diagram, only the most importnant once. Also, tables don't have primary keys, but I put there columns that I use to create a composite primary key for join.
There will be 2 tables: monthly sales and yearly sales aggregation with forecast.

In [669]:
import sqlite3

In [670]:
# Create a connection to the SQLite database
conn = sqlite3.connect('sales_forecast.db')

# Create a cursor object
cursor = conn.cursor()

In [671]:
combined_sales.to_sql('combined_sales', conn, if_exists='replace', index=False)

forecast_data.to_sql('forecast_data', conn, if_exists='replace', index=False)

128526

In [672]:
monthly_sales_query = """
SELECT 
    date(period) AS period,
    material_nbr AS material_number,
    commercial_country_name AS country,
    region,
    region_description,
    crop,
    net_sales,
    gross_sales,
    base_sales,
    surcharge,
    discount,
    net_qty,
    commercial_team,
    company_code,
    commercial_team_description
FROM combined_sales
"""

In [673]:
# Execute query and save to CSV. Note: Americas data is yearly due to the source.
monthly_sales_data = pd.read_sql_query(monthly_sales_query, conn)
monthly_sales_data.to_csv('monthly_sales_data.csv', index=False)

In [674]:
# Query for aggregated sales data joined with forecast data
yearly_sales_forecast_query = """
SELECT
    strftime('%Y', period) AS year,
    material_nbr AS material_number,
    commercial_country_name AS country,
    region,
    commercial_team,
    company_code,
    commercial_team_description,
    crop,
    region_description,
    SUM(net_sales) AS yearly_net_sales,
    SUM(gross_sales) AS yearly_gross_sales,
    SUM(base_sales) AS yearly_base_sales,
    SUM(surcharge) AS yearly_surcharge,
    SUM(discount) AS yearly_discount,
    SUM(net_qty) AS yearly_net_qty,
    forecast_val AS forecasted_sales
FROM
    combined_sales
LEFT JOIN
    forecast_data
ON
    combined_sales.material_nbr = forecast_data.material_number
    AND strftime('%Y', combined_sales.period) = strftime('%Y', forecast_data.year)
    AND combined_sales.commercial_country_name = forecast_data.cmrcl_cntry_dsc
    AND combined_sales.commercial_subregion_desc = forecast_data.cmrcl_subrgn_nm
GROUP BY
    1,2,3,4,5,6,7,8,9
"""

In [675]:
# Execute query and save to CSV
yearly_sales_forecast_data = pd.read_sql_query(yearly_sales_forecast_query, conn)
yearly_sales_forecast_data.to_csv('yearly_sales_forecast.csv', index=False)

In [676]:
conn.commit()
conn.close()

## 5. Known issues and potential improvements
- In americas_data and forecast_data, I converted years to full dates, which might be misleading in the context of the analysis
- Approximately 3.5% of Net Sales were lost due to the removal of rows with missing material numbers. While these rows could be further investigated using plots, the most effective solution would be to address this issue at the data source
- Implementing a proper ETL  process with distinct layers for raw data, cleaned and transformed data, and a curated datamart would be beneficial. For this project, I performed transformations upfront for simplicity, but using a star schema with separate dimension tables would be a good approache to reduce redundancy
- To improve sustainability, instead of cleaning country names in the sales data, it's better to use country codes and store names and additional information in a separate table
- Column naming across all tables could be improved for better clarity and consistency
- Some numerical columns, like commercial_sales_territory_code, could be converted from float64 to integers for better performance