# Appendix C -- Project Ideas and Further Reading
## *Python for AI/ML: A Complete Learning Journey*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/APP_C_Projects_and_Further_Reading.ipynb)
&nbsp;&nbsp;[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)

---

This appendix provides ten capstone project ideas using the SO 2025 dataset,
followed by curated further reading and course recommendations for each
major topic in the book.

All ten projects use data you already have. Each is scoped to be completable
in a weekend for a motivated reader who has finished the main chapters.


---

## C.1 -- Ten Capstone Projects

### Project 1 -- Salary Predictor Web App

**Difficulty:** Intermediate  
**Chapters used:** 3, 6  
**Goal:** Deploy the Chapter 6 salary regression model as an interactive web app
where a user enters their profile (years experience, country, languages) and
receives a salary prediction with a confidence interval.

**Key steps:**
1. Train the best Chapter 6 model and save it with `joblib`
2. Build a `Gradio` or `Streamlit` front-end with input widgets
3. Display prediction, confidence interval, and SHAP explanation
4. Deploy to Hugging Face Spaces (free)

**Stretch goal:** Add a 'compare yourself to similar developers' feature
showing the user's predicted salary vs the median for their country and role.

---

### Project 2 -- Developer Trend Analysis (2019-2025)

**Difficulty:** Beginner-Intermediate  
**Chapters used:** 3, 4, 5  
**Goal:** Download multiple years of SO survey data and track how language
popularity, salary, and AI tool adoption have changed over time.

**Key steps:**
1. Download SO surveys from 2019, 2021, 2023, and 2025
2. Harmonise column names across years (they change frequently)
3. Build time-series charts for Python vs JavaScript vs SQL adoption
4. Test whether salary trends differ significantly by year (Chapter 5 ANOVA)

**Stretch goal:** Forecast 2026 language adoption using scipy curve fitting.

---

### Project 3 -- AI Tool Adoption Classifier

**Difficulty:** Intermediate  
**Chapters used:** 6, 8  
**Goal:** Build a model that predicts which AI tools a developer is likely
to adopt based on their role, experience, and current language stack.

**Key steps:**
1. Explode the `AIToolCurrently` column into binary flags per tool
2. Build a multi-label classifier (one binary classifier per tool)
3. Evaluate with precision/recall per tool
4. Identify which features most predict AI tool adoption (SHAP)

**Stretch goal:** Use Chapter 8 zero-shot classification to tag free-text
job descriptions with likely AI tool affinity.

---

### Project 4 -- Compensation Equity Audit Tool

**Difficulty:** Intermediate-Advanced  
**Chapters used:** 5, 6, 9  
**Goal:** Build a reusable audit tool that takes any tabular dataset with
a salary column and a group column, and outputs a structured fairness report.

**Key steps:**
1. Generalise the Chapter 9 audit code into a `FairnessAuditor` class
2. Compute demographic parity, equalised odds, and calibration automatically
3. Generate a PDF report with charts and statistical test results
4. Apply to the SO 2025 dataset with Country, EdLevel, and RemoteWork as groups

**Stretch goal:** Add a mitigation recommendation engine that suggests
reweighting, resampling, or threshold adjustment based on the disparity type.

---

### Project 5 -- Developer Archetype Deep-Dive

**Difficulty:** Intermediate  
**Chapters used:** 3, 4, 6  
**Goal:** Extend the Chapter 6 KMeans clustering into a full developer
archetype analysis with richer features and interpretable profiles.

**Key steps:**
1. Include 15+ binary language and tool flags as clustering features
2. Try k=5 and k=8 clusters; use silhouette score to choose
3. Name each cluster based on its top features
4. Build an interactive Plotly dashboard showing cluster profiles

**Stretch goal:** Assign new respondents to clusters using `model.predict()`
and build a 'which developer archetype are you?' quiz.

---

### Project 6 -- Salary Regression with Deep Learning

**Difficulty:** Advanced  
**Chapters used:** 6, 7  
**Goal:** Push the Chapter 7 MLP further with richer features, architecture
search, and a rigorous comparison against the Chapter 6 Random Forest.

**Key steps:**
1. One-hot encode Country and EdLevel; concatenate with numeric features
2. Try embedding layers for high-cardinality categoricals (Country has 100+ values)
3. Implement manual early stopping and learning rate warm-up
4. Compare Random Forest vs MLP at equal feature sets

**Stretch goal:** Implement a simple neural architecture search over
hidden layer sizes and depths using RandomizedSearchCV-style random sampling.

---

### Project 7 -- NLP: Mining Developer Sentiment at Scale

**Difficulty:** Advanced  
**Chapters used:** 8  
**Goal:** Apply the Chapter 8 sentiment pipeline to a larger corpus of
developer text (Stack Overflow questions, GitHub issues, or Reddit r/programming)
and track sentiment trends by topic.

**Key steps:**
1. Collect data using the Stack Overflow API or Reddit API
2. Run the Chapter 8 sentiment pipeline in batches
3. Aggregate sentiment by topic (Kubernetes, React, Python, etc.)
4. Visualise sentiment trends over time

**Stretch goal:** Fine-tune a model on domain-specific developer sentiment
rather than the general SST-2 model.

---

### Project 8 -- Reproducible ML Pipeline with MLflow

**Difficulty:** Intermediate  
**Chapters used:** 6  
**Goal:** Add experiment tracking to the Chapter 6 pipeline using MLflow,
so every training run is logged with parameters, metrics, and artifacts.

**Key steps:**
1. `pip install mlflow`
2. Wrap the Chapter 6 training loop with `mlflow.start_run()`
3. Log hyperparameters with `mlflow.log_param()`, metrics with `mlflow.log_metric()`
4. Save the model with `mlflow.sklearn.log_model()`
5. Open the MLflow UI to compare runs

**Stretch goal:** Implement a hyperparameter sweep with 50 runs and
identify the Pareto frontier of accuracy vs training time.

---

### Project 9 -- Time-to-First-Job Predictor

**Difficulty:** Intermediate  
**Chapters used:** 3, 5, 6  
**Goal:** Use the SO 2025 `YearsCode` vs `YearsCodePro` columns to estimate
time from first coding to first professional role, and model what factors
predict a faster transition.

**Key steps:**
1. Engineer `time_to_pro = YearsCode - YearsCodePro` (careful with edge cases)
2. Analyse distribution by education level, country, and primary language
3. Build a regression model predicting `time_to_pro`
4. Identify the features most associated with faster professional entry

**Stretch goal:** Use survival analysis (`lifelines` library) to model
time-to-employment as a censored event.

---

### Project 10 -- End-to-End Book Capstone

**Difficulty:** Advanced  
**Chapters used:** All  
**Goal:** Build a complete, deployed ML product that uses techniques from
every chapter of the book.

**Suggested product:** A 'Developer Profile Analyser' that:
- Loads and cleans SO 2025 data (Ch 3)
- Produces an EDA report with charts (Ch 4)
- Runs statistical tests on group differences (Ch 5)
- Predicts salary with a tuned sklearn pipeline (Ch 6)
- Compares with a PyTorch MLP (Ch 7)
- Classifies developer type from free-text role description (Ch 8)
- Audits predictions for fairness and generates a model card (Ch 9)
- Deploys as a Gradio app on HuggingFace Spaces

This project is your portfolio centrepiece. It demonstrates every skill
covered in the book in an integrated, working product.


---

## C.2 -- Further Reading and Resources

### Foundational Books

**Python and Data Science:**
- *Python for Data Analysis* -- Wes McKinney (the Pandas creator; essential reference)
- *Python Data Science Handbook* -- Jake VanderPlas (free online at jakevdp.github.io)
- *Fluent Python* -- Luciano Ramalho (deep Python internals; read after finishing this book)

**Machine Learning:**
- *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow* -- Aurélien Géron
- *The Elements of Statistical Learning* -- Hastie, Tibshirani, Friedman (free PDF; rigorous)
- *Pattern Recognition and Machine Learning* -- Bishop (Bayesian perspective)

**Deep Learning:**
- *Deep Learning* -- Goodfellow, Bengio, Courville (free at deeplearningbook.org)
- *Dive into Deep Learning* -- Zhang et al. (free at d2l.ai; PyTorch-native)

**NLP and Transformers:**
- *Natural Language Processing with Transformers* -- Lewis Tunstall et al. (HuggingFace team)
- *Speech and Language Processing* -- Jurafsky and Martin (free draft at web.stanford.edu/~jurafsky)

**Ethics and Responsible AI:**
- *Weapons of Math Destruction* -- Cathy O'Neil (accessible; real-world cases)
- *The Alignment Problem* -- Brian Christian (AI safety and values)
- *Data Feminism* -- Catherine D'Ignazio and Lauren Klein (power and data)

---

### Online Courses

**Free:**
- fast.ai Practical Deep Learning for Coders -- practical-first, PyTorch
- CS231n (Stanford) -- Convolutional Neural Networks; lectures on YouTube
- CS224n (Stanford) -- NLP with Deep Learning; lectures on YouTube
- HuggingFace NLP Course -- huggingface.co/learn/nlp-course
- Kaggle Learn -- short, practical courses on ML, feature engineering, deep learning

**Paid:**
- deeplearning.ai Specialisations (Coursera) -- Andrew Ng; systematic and rigorous
- Full Stack Deep Learning (fullstackdeeplearning.com) -- production ML systems

---

### Key Papers

**Transformers:**
- *Attention Is All You Need* -- Vaswani et al. 2017 (the transformer paper)
- *BERT: Pre-training of Deep Bidirectional Transformers* -- Devlin et al. 2019
- *Language Models are Few-Shot Learners* -- Brown et al. 2020 (GPT-3)

**Fairness and Ethics:**
- *Model Cards for Model Reporting* -- Mitchell et al. 2019 (Google)
- *A Framework for Understanding Unintended Consequences of ML* -- Suresh and Guttag 2021
- *Fairness and Abstraction in Sociotechnical Systems* -- Selbst et al. 2019

**Explainability:**
- *A Unified Approach to Interpreting Model Predictions* -- Lundberg and Lee 2017 (SHAP)
- *Why Should I Trust You? Explaining the Predictions of Any Classifier* -- Ribeiro et al. 2016 (LIME)

---

### Communities and Practice

- **Kaggle** -- competitions, notebooks, and datasets; essential for building intuition fast
- **Papers With Code** -- every paper linked to its implementation; great for staying current
- **HuggingFace Hub** -- 500k+ models and datasets; browse what is possible
- **r/MachineLearning** -- research discussion; high signal-to-noise
- **Stack Overflow** -- specific implementation questions; the dataset source for this book

---

## C.3 -- Your Learning Path Forward

You have completed a book that takes you from Python basics to fine-tuning
a transformer. That puts you in the top tier of what most online courses cover.

The honest next steps, in order of impact:

**1. Build something.** Pick one of the ten projects above and complete it.
   A deployed, working project teaches more than three more courses.

**2. Enter a Kaggle competition.** The structured feedback loop of a leaderboard
   reveals gaps in your knowledge faster than any other mechanism.

**3. Read one paper per week.** Start with the papers listed above.
   Reading the original source builds understanding that tutorials cannot.

**4. Contribute to open source.** Fix a bug in a library you use.
   Reading production-quality code accelerates your own code quality.

**5. Teach someone else.** Writing a blog post or explaining a concept to a
   colleague exposes every gap in your own understanding.

---

*End of Appendix C -- Python for AI/ML*  
[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)
