# Sales and Forecast Data Analysis Project
Author: Sofia Shchetinina

## 1. Project Overview

This project involves the cleaning, processing, and analysis of sales and forecast data from different regions (Americas, EMEA, Asia).
The goal is to load, transform and consolidate this data into a unified database for easier querying while ensuring data quality and integrity.

The data is sourced from multiple CSV and Excel files provided by business teams, so, data inconsistencies are expected. The final output is stored in an SQLite database, ready for further analysis, and used for creation of interactive dashboard in Tableau.


## 2. Exploratory Data Analysis

In [31]:
import pandas as pd

import warnings

with warnings.catch_warnings():
    warnings.simplefilter('ignore')

In [None]:
# Load the data from csv sources
americas_data = pd.read_csv('data/americas.csv')
emea_data = pd.read_csv('data/emea.csv')
forecast_data = pd.read_csv('data/forecast.csv')

In [None]:
# Load the data from Excel ensuring the possibility of adding new sheets
asia_sheets_dict = pd.read_excel('data/asia.xlsx', sheet_name=None)

# Standardize column names for different sheets
def standardize_columns(df):
    df.columns = df.columns.str.lower()  # Convert all column names to lowercase
    return df
# Apply function to all sheets
asia_sheets_dict = {sheet_name: standardize_columns(df) for sheet_name, df in asia_sheets_dict.items()}
# Combine the sheets into one dataframe
asia_data = pd.concat(asia_sheets_dict.values(), ignore_index=True)

In [None]:
# Standardize columns for other dataframes
americas_data = standardize_columns(americas_data)
emea_data = standardize_columns(emea_data)
forecast_data = standardize_columns(forecast_data)

Quick overview of the data in all dataframes

In [None]:
print('Americas Data:')
print(americas_data.head(), '\n')

print('EMEA Data:')
print(emea_data.head(), '\n')

print('Asia Data:')
print(asia_data.head(), '\n')

print('Forecast Data:')
print(forecast_data.head(), '\n')

- Irrelevant columns detected: 'unnamed: 0' in americas_data, emea_data, forecast_data; 'sales_tcfxact' in americas_data
- Format inconsistencies in 'period' column: only year / year and month. Forecast is made for the year
- Potential naming inconsistencies in 'commercial_country_name'

## 3. Data Cleaning

In [None]:
# Drop extra columns
americas_data = americas_data.drop(columns=['unnamed: 0', 'sales_tcfxact'], errors='ignore')
emea_data = emea_data.drop(columns=['unnamed: 0', 'sales_tc_fxact'], errors='ignore')
asia_data = asia_data.drop(columns=['unnamed: 0'], errors='ignore')
forecast_data = forecast_data.drop(columns=['unnamed: 0'], errors='ignore')

In [None]:
# Check for duplicates
print('Duplicates in Americas data:', americas_data.duplicated().sum())
print('Duplicates in EMEA data:', emea_data.duplicated().sum())
print('Duplicates in Asia data:', asia_data.duplicated().sum())
print('Duplicates in Forecast data:', forecast_data.duplicated().sum())

In [None]:
# Add 'region' column and fill it with the name of the file
americas_data['region'] = 'Americas'
emea_data['region'] = 'EMEA'
asia_data['region'] = 'Asia'

In [None]:
# Check the data types and missing values
americas_data.info()

Missing values detected in 'material_nbr', column 'period' is not in date format, which is not optimal

In [None]:
emea_data.info()

Missing values in 'material_nbr', column 'period' is not in date format

In [None]:
asia_data.info()

Missing values in 'material_nbr', column 'period' is not in date format

In [None]:
forecast_data.info()

Missing values in 'commercial_segment', 'sku_cd' detected. These columns don't match with sales data, therefore they can't be used in the analysis.

In [None]:
# Drop extra columns with missing values
forecast_data = forecast_data.drop(columns=['commercial_segment', 'sku_cd'], errors='ignore')

In [None]:
# Evaluate the share of missing values
americas_missing = americas_data['material_nbr'].isnull().mean() * 100
emea_missing = emea_data['material_nbr'].isnull().mean() * 100
asia_missing = asia_data['material_nbr'].isnull().mean() * 100

print(f"Missing 'material_nbr' in Americas: {americas_missing:.2f}%")
print(f"Missing 'material_nbr' in EMEA: {emea_missing:.2f}%")
print(f"Missing 'material_nbr' in Asia: {asia_missing:.2f}%")

This column is essential for the correct join of sales and forecast, additional checks are needed to figure out if there are any regularities about these rows.

In [None]:
# Look at the rows where material_number is missing for EMEA region
missing_material_rows = emea_data[emea_data['material_nbr'].isnull()]
missing_material_rows.head(20)

Rows with missing values seem random and probably are caused by human error. I'll remove them because it'll be more robust to keep the column in integer format, and do not overcomplicate it with placeholders.

In [None]:
# Remove rows with missing material_number in all regions
americas_data = americas_data.dropna(subset=['material_nbr'])
emea_data = emea_data.dropna(subset=['material_nbr'])
asia_data = asia_data.dropna(subset=['material_nbr'])

In [None]:
# Convert material_number to integer after dropping missing rows
americas_data['material_nbr'] = americas_data['material_nbr'].astype(int)
emea_data['material_nbr'] = emea_data['material_nbr'].astype(int)
asia_data['material_nbr'] = asia_data['material_nbr'].astype(int)

Check the data for consistency

In [None]:
# Check the date formats in 'Period'
print('Unique period values in Americas Data:')
print(americas_data['period'].unique())

print('\nUnique period values in EMEA Data:')
print(emea_data['period'].unique())

print('\nUnique period values in Asia Data:')
print(asia_data['period'].unique())

print('\nUnique period values in Forecast Data:')
print(forecast_data['year'].unique())

In [None]:
# Fix date format
americas_data['period'] = americas_data['period'].astype(str) + '.01'
emea_data['period'] = emea_data['period'].apply(lambda x: f"{str(x).split('.')[0]}.{str(x).split('.')[1].zfill(2)}")
asia_data['period'] = asia_data['period'].apply(lambda x: f"{str(x).split('.')[0]}.{str(x).split('.')[1].zfill(2)}")
forecast_data['year'] = forecast_data['year'].astype(str) + '.01'

# Check the date formats in 'Period'
print('Unique period values in Americas Data:')
print(americas_data['period'].unique())

print('\nUnique period values in EMEA Data:')
print(emea_data['period'].unique())

print('\nUnique period values in Asia Data:')
print(asia_data['period'].unique())

print('\nUnique period values in Forecast Data:')
print(forecast_data['year'].unique())

In [None]:
# Convert to datetime format and check
americas_data['period'] = pd.to_datetime(americas_data['period'], format='%Y.%m')
print('Americas data:', americas_data['period'])

emea_data['period'] = pd.to_datetime(emea_data['period'], format='%Y.%m')
print('EMEA data:', emea_data['period'])

asia_data['period'] = pd.to_datetime(asia_data['period'], format='%Y.%m')
print('Asia data:', asia_data['period'])

forecast_data['year'] = pd.to_datetime(forecast_data['year'], format='%Y.%m')
print('Forecast data:', forecast_data['year'])

In [None]:
# Check the country names for consistency
print('Unique country names in Americas Data:')
print(sorted(americas_data['commercial_country_name'].unique()))

print('\nUnique country names in EMEA Data:')
print(sorted(emea_data['commercial_country_name'].unique()))

print('\nUnique country names in Asia Data:')
print(sorted(asia_data['commercial_country_name'].unique()))

print('\nUnique country names in Forecast Data:')
print(sorted(forecast_data['cmrcl_cntry_dsc'].unique()))

In [None]:
# Map differently spelled values
country_name_mapping = {
    'Canadá': 'Canada',
    'México': 'Mexico',
    'Brasil': 'Brazil',
    'UK': 'United Kingdom',
    'U.S.A': 'United States',
    'Estados Unidos': 'United States',
    'España': 'Spain',
    'Türkiye': 'Turkey'
}

americas_data['commercial_country_name'] = americas_data['commercial_country_name'].replace(country_name_mapping)
emea_data['commercial_country_name'] = emea_data['commercial_country_name'].replace(country_name_mapping)
asia_data['commercial_country_name'] = asia_data['commercial_country_name'].replace(country_name_mapping)
forecast_data['cmrcl_cntry_dsc'] = forecast_data['cmrcl_cntry_dsc'].replace(country_name_mapping)

In [None]:
# Check crop field for consistency
print('Unique crop names in Americas Data:')
print(sorted(americas_data['crop'].unique()))

print('\nUnique crop names in EMEA Data:')
print(sorted(emea_data['crop'].unique()))

print('\nUnique crop names in Asia Data:')
print(sorted(asia_data['crop'].unique()))

Crop names are consistent

In [None]:
# Combine all region's sales into one dataframe
combined_sales = pd.concat([americas_data, emea_data, asia_data], axis=0, ignore_index=True)
print('Combined sales data:', combined_sales.head())

## 4. Data Quality and Integrity Checks

In [None]:
# Overview for missing values and data types in combined sales
combined_sales.info()

- Missing values - not found
- Data types - correct

In [None]:
# Overview for missing values and data types in forecast data
forecast_data.info()

- Missing values - not found
- Data types - correct

In [None]:
# Check combined sales for duplicates
duplicates = combined_sales.duplicated().sum()
print(f"Duplicate rows in combined sales data: {duplicates}")

In [None]:
# Check forecast data for duplicates
duplicates = forecast_data.duplicated().sum()
print(f"Duplicate rows in forecast data: {duplicates}")

In [None]:
# Check the country names in forecast data for consistency
print('Unique country names in forecast data data:')
print(sorted(forecast_data['cmrcl_cntry_dsc'].unique()))

In [None]:
# Check the country names in forecast date for consistency
print('Unique country names in combined sales data:')
print(sorted(combined_sales['commercial_country_name'].unique()))

Date format was checked previously

## 5. Database Schema

Logical DB schema is pictured below

![Sales and Forecast Data Schema](./sales_forecast.drawio.png)

Create a database and load sales and forecast data into it. For simplicity I didn't include all sales columns in the diagram, only the most importnant once. Also, tables don't have primary keys, but I put there columns that I use to create a composite primary key for join.

In [None]:
import sqlite3

In [None]:
# Create a connection to the SQLite database
conn = sqlite3.connect('sales_forecast.db')

# Create a cursor object
cursor = conn.cursor()

In [None]:
combined_sales.to_sql('combined_sales', conn, if_exists='replace', index=False)

forecast_data.to_sql('forecast_data', conn, if_exists='replace', index=False)

In [None]:
query = """
SELECT 
    date(cs.period) AS period,
    cs.material_nbr AS material_number,
    cs.commercial_country_name AS country,
    cs.net_sales,
    cs.gross_sales,
    cs.base_sales,
    cs.surcharge,
    cs.discount,
    cs.net_qty,
    cs.commercial_team,
    cs.company_code,
    cs.commercial_team_description,
    cs.crop,
    cs.region,
    cs.region_description,
    fd.forecast_val AS forecasted_sales
FROM 
    combined_sales cs
LEFT JOIN 
    forecast_data fd 
ON 
    cs.material_nbr = fd.material_number
    AND strftime('%Y', cs.period) = fd.year
    AND cs.commercial_country_name = fd.cmrcl_cntry_dsc
"""

In [None]:
query = """
SELECT 
    date(cs.period) AS period,
    cs.material_nbr AS material_number,
    cs.commercial_country_name AS country,
    cs.net_sales,
    cs.gross_sales,
    cs.base_sales,
    cs.surcharge,
    cs.discount,
    cs.net_qty,
    cs.commercial_team,
    cs.company_code,
    cs.commercial_team_description,
    cs.crop,
    cs.region,
    cs.region_description,
    fd.forecast_val AS forecasted_sales
FROM 
    combined_sales cs
LEFT JOIN 
    forecast_data fd 
ON 
    cs.material_nbr = fd.material_number
    AND strftime('%Y', cs.period) = strftime('%Y', fd.year)
    AND cs.commercial_country_name = fd.cmrcl_cntry_dsc
"""

In [None]:
pd.read_sql_query(query, conn).to_csv('sales_forecast.csv', index=False)

In [None]:
conn.commit()
conn.close()

## 5. Known issues and potential improvements
- In americas_data and forecast_data, I converted years to full dates, which might be misleading in the context of the analysis
- Approximately 3.5% of Net Sales were lost due to the removal of rows with missing material numbers. While these rows could be further investigated using plots, the most effective solution would be to address this issue at the data source
- Implementing a proper ETL  process with distinct layers for raw data, cleaned and transformed data, and a curated datamart would be beneficial. For this project, I performed transformations upfront for simplicity, but using a star schema with separate dimension tables would be a good approache to reduce redundancy
- To improve sustainability, instead of cleaning country names in the sales data, it's better to use country codes and store names and additional information in a separate table
- Column naming across all tables could be improved for better clarity and consistency
- Some numerical columns, like commercial_sales_territory_code, could be converted from float64 to integers for better performance