Skip to content

teragramgius/python-analyst

Repository files navigation

🐍 Python Analyst Portfolio

A collection of data analytics projects built with Python, covering the full spectrum from exploratory visualization to machine learning and inferential statistics. Each project is self-contained with its own dataset, notebook, and output charts.


📂 Repository Structure

python-analyst/
│
├── 📊 01_global-superstore-visualization/
├── ⚡ 02_household-energy-time-series/
├── 🏦 03_credit-risk-ml/
├── 🔬 04_credit-risk-inferential/
└── 🌍 05_governance-innovation-analysis/

📊 01 — Global Superstore Visualization

Type: Exploratory Data Analysis + Data Visualization
Dataset: Global Superstore — 51,290 retail transactions (2011–2014)

A deep-dive into retail business performance across 7 global markets. The analysis covers quarterly trends, product profitability, geographic distribution, customer segmentation, and logistics efficiency. Every major chart type from the curriculum is represented: line, bar, donut, sunburst, scatter, bubble, stacked bar, heatmap, violin.

Key finding: Discounts above 30% systematically destroy profit margin — confirmed both visually and quantitatively via Pearson correlation. The Furniture–Tables sub-category is the single largest source of losses in the catalogue.

Charts produced: 13 PNG outputs
Models: None (pure EDA)
Libraries: pandas · numpy · matplotlib · seaborn · plotly

⚡ 02 — Household Energy Time Series

Type: Time Series Analysis + Forecasting
Dataset: UCI Individual Household Electric Power Consumption — ~2.07M minute-level readings (2006–2010)

Analysis of electricity consumption patterns in a French household over four years. The project moves from pattern mining (heatmaps, sub-metering breakdown) through classical decomposition and stationarity testing to multi-model forecasting with diagnostic validation.

Key finding: SARIMA outperforms ARIMA on the 60-day holdout by capturing weekly periodicity. ARCH-LM detects heteroscedasticity in all models' residuals — winter confidence intervals should be interpreted conservatively. The additive decomposition choice is data-driven, justified by a flat rolling coefficient of variation.

Charts produced: 16 PNG outputs
Models: ARIMA · SARIMA · Facebook Prophet
Diagnostics: Ljung-Box · Jarque-Bera · ARCH-LM
Libraries: pandas · numpy · matplotlib · seaborn · statsmodels · prophet · scikit-learn

🏦 03 — Credit Risk Prediction (Machine Learning)

Type: Supervised Classification + Model Evaluation
Dataset: German Credit Risk — 1,000 loan applicants

A credit scoring model that predicts default probability using three supervised learning algorithms. The project handles class imbalance explicitly, applies differentiated encoding strategies (ordinal vs one-hot), and evaluates models on operationally relevant metrics — F1 on the bad class and ROC-AUC rather than raw accuracy.

Key finding: Checking account status and loan duration are the dominant predictors across all three models. Without class imbalance correction, all models would achieve artificially high accuracy by predicting "good" for every applicant — masking their failure to detect actual defaults.

Charts produced: 7 PNG outputs
Models: Logistic Regression · Random Forest · XGBoost
Metrics: Accuracy · Precision · Recall · F1 · ROC-AUC · Confusion Matrix · Feature Importance
Libraries: pandas · numpy · matplotlib · seaborn · scikit-learn · xgboost

🔬 04 — Credit Risk Inferential Analysis

Type: Inferential Statistics + Hypothesis Testing
Dataset: German Credit Risk — 1,000 loan applicants (same dataset, different lens)

A statistically rigorous investigation into whether demographic variables (gender, housing, age) are associated with differences in credit amounts and default rates. Every test follows the full assumption-checking pipeline: Shapiro-Wilk → Levene → parametric or non-parametric test → effect size. Includes OLS with interaction terms and Chi-Square with Cramér's V for categorical associations.

Key finding: Gender is statistically significant for loan amount but shows weak Cramér's V with default risk — a direct fair lending implication. Checking account status remains the strongest predictor of default across both the ML and inferential approaches, validating the signal from two independent methodologies.

Charts produced: 9 PNG outputs
Tests: Shapiro-Wilk · Levene · Mann-Whitney U · Kruskal-Wallis · Dunn (Bonferroni) · OLS + ANOVA · Chi-Square · Cramér's V
Effect sizes: Rank-biserial r · η² · Cramér's V
Libraries: pandas · numpy · matplotlib · seaborn · scipy · statsmodels · scikit-posthocs

🌍 05 — Governance & Innovation Analysis

Type: Comparative Country Analysis + Inferential Statistics
Dataset: Multi-country governance and innovation indicators — 200 countries

Construction of two composite indices (Governance and Innovation) from survey-based indicators, followed by systematic group comparisons across development status (Developed vs Developing) and seven world regions. The testing pipeline adapts to distributional properties at runtime — Welch t-test or Mann-Whitney depending on Shapiro-Wilk output.

Key finding: Regional heterogeneity is the dominant source of variance — where a country is matters more than the developed/developing binary. ECA transition economies show a governance-innovation gap: institutional quality is relatively high but innovation remains constrained by limited access to capital and R&D investment, not by governance deficits.

Charts produced: 8 PNG outputs
Tests: Shapiro-Wilk · Levene · Welch t-test · Mann-Whitney U · Kruskal-Wallis · Dunn (Bonferroni) · Spearman · Chi-Square · Cramér's V
Effect sizes: Cohen's d · Rank-biserial r · η² · Cramér's V
Libraries: pandas · numpy · matplotlib · seaborn · scipy · statsmodels · scikit-posthocs

🛠️ Tech Stack

Library Role
pandas / numpy Data loading, cleaning, aggregation, feature engineering
matplotlib / seaborn Static visualizations (all projects)
plotly Interactive charts — sunburst, scatter (Project 01)
statsmodels Time series decomposition, ARIMA/SARIMA, OLS, ADF (Projects 02, 04, 05)
prophet Facebook Prophet forecasting (Project 02)
scikit-learn ML models, preprocessing, evaluation metrics (Projects 02, 03)
xgboost Gradient boosting classifier (Project 03)
scipy Full inferential statistics suite (Projects 04, 05)
scikit-posthocs Dunn post-hoc test with Bonferroni correction (Projects 04, 05)

⚙️ Setup

Each project has its own requirements.txt and virtual environment. General setup:

git clone https://github.com/<your-username>/python-analyst.git
cd python-analyst/<project-folder>

python -m venv .venv
source .venv/bin/activate       # macOS / Linux
# .venv\Scripts\activate        # Windows

pip install -r requirements.txt

Note on Prophet (Project 02): if installation fails, use pip install pystan==2.19.1.1 && pip install prophet or conda install -c conda-forge prophet.

Note on datasets: large files (> 50MB) are not included in the repo. Download links are in each project's individual README.


👤 Author

Giusy Grieco
Data analyst & policy evaluation specialist
📍 Bologna, Italy
🔗 LinkedIn · GitHub


Five projects. One dataset at a time. Always asking why the numbers look the way they do.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors