📦 Supply Chain Analytics and Demand Forecasting with Python

This project is my practice with Python, applying statistical models learned during master's program at TU Dresden. It demonstrates a complete data analytics workflow on a real-world supply chain dataset, using Python for ETL (Extract, Transform, Load), Exploratory Data Analysis (EDA), and basic Time Series Forecasting.

📊 Dataset

Source:
Constante, Fabian; Silva, Fernando; Pereira, António (2019), DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS

Please cite the original authors if you use this dataset.

Why I chose it:

This dataset reflects real business operations, including order details, product categories, regions, and dates. It's rich enough to support meaningful analysis, yet manageable for prototyping. It helped me practice a realistic analytics pipeline using Python.

📁 Data Overview

180,000+ customer order records
Covers multiple countries, product types, and time periods
Includes structured time-related features for trend modeling

⚙️ Tools & Libraries I used

Python (pandas, numpy, matplotlib, seaborn, pmdarima, statsmodels)
VS Code – main code editor
Anaconda – manages the Python environment, packages, and dependencies

⚒️ Key Tasks

Data ETL

This stage includes loading and cleaning the dataset. Irrelevant fields create noise and slow processing.

What was done:

Dropped irrelevant columns (customer emails, descriptions, GPS data, etc.)
Parsed order dates and created new time-based features (month, weekday, etc.)
Removed duplicates and missing values
Aggregated sales data by month

Exploratory Data Analysis (EDA)

This stage includes exploring and visualizing the data to get insights.

Here are some summary statistics of data features:

Feature	Count	Mean	Min	25%	50%	75%	Max	Std Dev
Category Id	169,138	31.80	2.00	18.00	29.00	45.00	76.00	15.76
Order Date	169,138	~2016-06	2015-01	2015-09	2016-06	2017-03	2018-01	–
Order Item Quantity	169,138	2.17	1.00	1.00	1.00	3.00	5.00	1.47
Sales	169,138	202.54	9.99	119.98	199.92	299.95	1999.99	133.55
Order Year	169,138	2015.97	2015	2015	2016	2017	2018	0.83

Insights from this:

Sales are typically between $120 and $300, with a few large outliers up to $1999.
Orders mainly span 2015 to early 2018, ideal for monthly time series modeling.
Most orders have 1 to 3 items, averaging around 2.17 items per order.
Data is evenly distributed across Category IDs, with 2–76 range.

Some visualizations and insights from them:

The top 5 product categories generate a significantly higher volume of sales compared to the rest. Most other categories contribute very little to overall revenue, indicating a long tail of low-performing SKUs. This could inform inventory focus, promotional efforts, or bundling strategies.

Western Europe, Central America, and South America are the top three regions by total sales. There's a clear sales imbalance — the bottom regions like Central Asia, Canada, and Southern Africa contribute very little. Potential to expand or market more effectively in underperforming regions, or reduce costs in low-return markets.

Sales were relatively stable from 2015 to late 2017, fluctuating around 900k–1.1 M. There is a sharp and unexpected drop starting Nov 2017 — likely due to incomplete or missing data (not actual business decline). => Action: Removed the last 3 months to avoid distorting the forecast model.

It can be seen that there's some seasonal or operational variability, but no extreme volatility. Without the noisy low months (Nov 2017–Jan 2018), the plot reflects a more realistic and interpretable trend.

3. Time Series Forecasting

The goal of this forecast is to predict the next 6-month demand using historical monthly demand. To do this, ARIMA and SARIMA models were used.

❓What Are ARIMA and SARIMA?

ARIMA (AutoRegressive Integrated Moving Average) is a time series model that looks at patterns in past values and past errors to forecast the future. It also removes trends to make data more predictable.
SARIMA (Seasonal ARIMA) is just like ARIMA, but it also considers seasonality — repeating patterns over time (like monthly or yearly sales cycles).

I used these 2 models because the dataset includes monthly sales across several years, which fits time series forecasting, and these 2 models are classic and popular in business to use in this case. However, ARIMA and SARIMA assume the data is stable over time, meaning no big trends or shifting behavior. Therefore, one important step is to conduct a stationarity check using the ADF (Augmented Dickey-Fuller) test. This is a statistical test used in time series analysis to check whether a series is stationary or not.

📃Results of ADF Test:

ADF Statistic: -0.374
p-value: 0.914

Interpretation: The p-value is much higher than 0.05, which means the time series is not stationary — it likely has a trend or changing behavior over time. This makes it unsuitable for ARIMA/SARIMA unless they are transformed.

=> Next steps: use auto_arima() to detect instationary and apply differencing automatically. Besides, use SARIMA with manual application of differencing.

The plot shows:

Historical Sales (Blue Line): Sales showed moderate growth with fluctuations, and a noticeable upward trend appeared in late 2017.
ARIMA Forecast (Red Dashed Line): Continues the rising trend from historical data and predicts higher sales each month/ The confidence interval suggests greater uncertainty in the forecast.
SARIMA Forecast (Green Line): Captures both trend and seasonal effects (slight dips and rises). The forecast is more conservative and realistic, not overly optimistic. A narrower confidence interval suggests a more stable prediction for short-term planning.

4. Project Conclusion:

Both models work, but they reflect different assumptions:

ARIMA fits recent growth aggressively, which can be useful if you expect continued expansion.
SARIMA is better if seasonality or external cycles (like holidays, campaigns, or market rhythms) matter.

🟢 Recommendation: Use SARIMA for operational and strategic planning where seasonality matters, and ARIMA as a secondary reference model for trend analysis.

🧾 Final Thoughts

This is a hands-on, learning-driven project where I applied forecasting models to understand and predict monthly sales. I'm actively improving my skills and open to suggestions — feedback is very welcome.

Feel free to open an issue or message me if you have advice or ideas. More updates, improvements, and model testing will follow soon!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
LICENSE		LICENSE
Project_code.py		Project_code.py
README.md		README.md
project_code_and_output.ipynb		project_code_and_output.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📦 Supply Chain Analytics and Demand Forecasting with Python

📊 Dataset

📁 Data Overview

⚙️ Tools & Libraries I used

⚒️ Key Tasks

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

thuynguyentud/SC-analytics-and-forecasting-using-Python

Folders and files

Latest commit

History

Repository files navigation

📦 Supply Chain Analytics and Demand Forecasting with Python

📊 Dataset

📁 Data Overview

⚙️ Tools & Libraries I used

⚒️ Key Tasks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages