A collection of data analytics projects built with Python, covering the full spectrum from exploratory visualization to machine learning and inferential statistics. Each project is self-contained with its own dataset, notebook, and output charts.
python-analyst/
│
├── 📊 01_global-superstore-visualization/
├── ⚡ 02_household-energy-time-series/
├── 🏦 03_credit-risk-ml/
├── 🔬 04_credit-risk-inferential/
└── 🌍 05_governance-innovation-analysis/
Type: Exploratory Data Analysis + Data Visualization
Dataset: Global Superstore — 51,290 retail transactions (2011–2014)
A deep-dive into retail business performance across 7 global markets. The analysis covers quarterly trends, product profitability, geographic distribution, customer segmentation, and logistics efficiency. Every major chart type from the curriculum is represented: line, bar, donut, sunburst, scatter, bubble, stacked bar, heatmap, violin.
Key finding: Discounts above 30% systematically destroy profit margin — confirmed both visually and quantitatively via Pearson correlation. The Furniture–Tables sub-category is the single largest source of losses in the catalogue.
Charts produced: 13 PNG outputs
Models: None (pure EDA)
Libraries: pandas · numpy · matplotlib · seaborn · plotly
Type: Time Series Analysis + Forecasting
Dataset: UCI Individual Household Electric Power Consumption — ~2.07M minute-level readings (2006–2010)
Analysis of electricity consumption patterns in a French household over four years. The project moves from pattern mining (heatmaps, sub-metering breakdown) through classical decomposition and stationarity testing to multi-model forecasting with diagnostic validation.
Key finding: SARIMA outperforms ARIMA on the 60-day holdout by capturing weekly periodicity. ARCH-LM detects heteroscedasticity in all models' residuals — winter confidence intervals should be interpreted conservatively. The additive decomposition choice is data-driven, justified by a flat rolling coefficient of variation.
Charts produced: 16 PNG outputs
Models: ARIMA · SARIMA · Facebook Prophet
Diagnostics: Ljung-Box · Jarque-Bera · ARCH-LM
Libraries: pandas · numpy · matplotlib · seaborn · statsmodels · prophet · scikit-learn
Type: Supervised Classification + Model Evaluation
Dataset: German Credit Risk — 1,000 loan applicants
A credit scoring model that predicts default probability using three supervised learning algorithms. The project handles class imbalance explicitly, applies differentiated encoding strategies (ordinal vs one-hot), and evaluates models on operationally relevant metrics — F1 on the bad class and ROC-AUC rather than raw accuracy.
Key finding: Checking account status and loan duration are the dominant predictors across all three models. Without class imbalance correction, all models would achieve artificially high accuracy by predicting "good" for every applicant — masking their failure to detect actual defaults.
Charts produced: 7 PNG outputs
Models: Logistic Regression · Random Forest · XGBoost
Metrics: Accuracy · Precision · Recall · F1 · ROC-AUC · Confusion Matrix · Feature Importance
Libraries: pandas · numpy · matplotlib · seaborn · scikit-learn · xgboost
Type: Inferential Statistics + Hypothesis Testing
Dataset: German Credit Risk — 1,000 loan applicants (same dataset, different lens)
A statistically rigorous investigation into whether demographic variables (gender, housing, age) are associated with differences in credit amounts and default rates. Every test follows the full assumption-checking pipeline: Shapiro-Wilk → Levene → parametric or non-parametric test → effect size. Includes OLS with interaction terms and Chi-Square with Cramér's V for categorical associations.
Key finding: Gender is statistically significant for loan amount but shows weak Cramér's V with default risk — a direct fair lending implication. Checking account status remains the strongest predictor of default across both the ML and inferential approaches, validating the signal from two independent methodologies.
Charts produced: 9 PNG outputs
Tests: Shapiro-Wilk · Levene · Mann-Whitney U · Kruskal-Wallis · Dunn (Bonferroni) · OLS + ANOVA · Chi-Square · Cramér's V
Effect sizes: Rank-biserial r · η² · Cramér's V
Libraries: pandas · numpy · matplotlib · seaborn · scipy · statsmodels · scikit-posthocs
Type: Comparative Country Analysis + Inferential Statistics
Dataset: Multi-country governance and innovation indicators — 200 countries
Construction of two composite indices (Governance and Innovation) from survey-based indicators, followed by systematic group comparisons across development status (Developed vs Developing) and seven world regions. The testing pipeline adapts to distributional properties at runtime — Welch t-test or Mann-Whitney depending on Shapiro-Wilk output.
Key finding: Regional heterogeneity is the dominant source of variance — where a country is matters more than the developed/developing binary. ECA transition economies show a governance-innovation gap: institutional quality is relatively high but innovation remains constrained by limited access to capital and R&D investment, not by governance deficits.
Charts produced: 8 PNG outputs
Tests: Shapiro-Wilk · Levene · Welch t-test · Mann-Whitney U · Kruskal-Wallis · Dunn (Bonferroni) · Spearman · Chi-Square · Cramér's V
Effect sizes: Cohen's d · Rank-biserial r · η² · Cramér's V
Libraries: pandas · numpy · matplotlib · seaborn · scipy · statsmodels · scikit-posthocs
| Library | Role |
|---|---|
pandas / numpy |
Data loading, cleaning, aggregation, feature engineering |
matplotlib / seaborn |
Static visualizations (all projects) |
plotly |
Interactive charts — sunburst, scatter (Project 01) |
statsmodels |
Time series decomposition, ARIMA/SARIMA, OLS, ADF (Projects 02, 04, 05) |
prophet |
Facebook Prophet forecasting (Project 02) |
scikit-learn |
ML models, preprocessing, evaluation metrics (Projects 02, 03) |
xgboost |
Gradient boosting classifier (Project 03) |
scipy |
Full inferential statistics suite (Projects 04, 05) |
scikit-posthocs |
Dunn post-hoc test with Bonferroni correction (Projects 04, 05) |
Each project has its own requirements.txt and virtual environment. General setup:
git clone https://github.com/<your-username>/python-analyst.git
cd python-analyst/<project-folder>
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -r requirements.txtNote on Prophet (Project 02): if installation fails, use
pip install pystan==2.19.1.1 && pip install prophetorconda install -c conda-forge prophet.
Note on datasets: large files (> 50MB) are not included in the repo. Download links are in each project's individual README.
Giusy Grieco
Data analyst & policy evaluation specialist
📍 Bologna, Italy
🔗 LinkedIn · GitHub
Five projects. One dataset at a time. Always asking why the numbers look the way they do.