In [None]:
from lec_utils import *
plotly.io.renderers.default = 'notebook'

#### DAIR-3 Workshop, Day 2 • Building Robust ML Models

# Part 1: Overview of Machine Learning and Tools

**Instructor**: Suraj Rampure (rampure@umich.edu)

### Survey

As we filter in, please open the following website, and fill out the survey linked at the top.

<center><h3><a href="https://rampure.org/dair3">rampure.org/dair3</h3></center>
    
The survey will help me decide how to prioritize our time this morning.

### Why are we here?

- The replication crisis is pervasive in science.

- By writing code, we provide a lineage of analysis for others to read and reproduce.

    In an ideal world, to reproduce someone's work, we'd run:

    ```bash
        git clone https://github.com/surajrampure/analysis
        cd analysis
        python run.py
    ```
    
    and out would come the same figures and tables as in the original paper.

- **The above ideal is rare.** Code, and code notebooks in particular, have plenty of issues with reproducibility too.<br><small>See [this paper](https://link.springer.com/article/10.1007/s10664-021-09961-9) for more context.</small>

- And, even if that ideal is achieved, the results may never have been valid in the first place.

### Rigor and reproducibility in the context of generative AI

- New generative AI tools make it easy to get started writing code, but have increased the prevalence of **analytical errors** – cases where the code "runs" but performs flawed analyses.

> While some applications, such as using AI for literature review, don't involve writing code, most applications of AI for science are, in essence, software development.
> 
> Unfortunately, scientists are notoriously poor software engineers. Practices that are bog-standard in the industry, like automated testing, version control, and following programming design guidelines, are largely absent or haphazardly adopted in the research community. --- [Could AI slow science?](https://www.aisnakeoil.com/p/could-ai-slow-science)

- Now, more than ever, we need to build models in both a replicable and **mathematically sound** fashion.

- Today, we'll gain a concrete understanding of common failure modes to look for, building upon your knowledge of introductory machine learning.

### Agenda 📆

1. Overview of machine learning and tools.
2. Dimensionality reduction.
3. Model selection.
4. Model evaluation.

<div class="alert alert-danger"><h3>Watch for Warnings!</h3>

Since our emphasis today is on **rigor**, I will present several "warnings" to keep in mind when building models, in red boxes like this one. We'll summarize these at the end of the session.
    
</div>

### Tools of the trade

- This morning's session will use the Python programming language, and we will run code in Jupyter Notebooks.<br><small>All the code has already been written for you, so if you've never used Python before, you can just run the code. If you have used Python before, experiment!</small>

- There are four notebooks for our session, one per part.<br>Access web versions of them by clicking the links at the session website:

<center><h3><a href="rampure.org/dair3">rampure.org/dair3</h3></center>

- Once you have a notebook open, hit **`Shift` + `Enter`** to run each code cell.

### Example: Wisconsin breast cancer dataset

- To make sure your notebooks are set up correctly, run the following cell.<br>You should see a table (DataFrame) with information about breast cancer patients.

In [None]:
from sklearn.datasets import load_breast_cancer

full = load_breast_cancer()
df = pd.DataFrame(full['data'], columns=full['feature_names'])
df['target'] = 1 - full['target']
df

- Run the cell below to see a scatter plot.

In [None]:
px.scatter(df, x='mean radius', y='mean texture', color=df['target'].replace({1: 'Malignant', 0: 'Benign'}))

### Local setup

- **Alternatively**, if you have Jupyter Lab or VSCode set up on your computer, you can access this session's materials by downloading the zip file at this link:

<h3><center><a href="github.com/surajrampure/dair3-2025">github.com/surajrampure/dair3-2025</a></center></h3>

- A benefit to downloading materials locally is that you can use AI-embedded IDEs, like **Cursor** and **Windsurf**, to edit files.<br><small>Throughout the session, I'll demonstrate how to (responsibly) use Cursor.</small>

## Taxonomy of machine learning

---

<center><img src="images/taxonomy.svg" width=900></center>

### From raw data to predictions

- In supervised learning, we don't jump straight to building models. First, we engineer features that best reflect the meaning in the data. This involves data cleaning and visualization, which we won't cover.

    <center><img src="images/data-to-features.svg" width=1000></center>

    <small>One of the benefits of neural networks (i.e. deep learning) is that features are largely created automatically. Then, the focus shifts from engineering features to engineering model architecture.</small>

- Once we've developed meaningful features, we choose an appropriate model.

<center><img src="images/features-to-preds.svg" width=1000></center>

### `sklearn`

- `sklearn` (scikit-learn) implements many common steps in the feature and model creation pipeline.<br><small>It is **widely** used in the Python ecosystem for (non-deep learning) modeling.</small>

<center><img src='images/sklearn.png' width=20%></center>

- It interfaces with `numpy` arrays, and to an extent, `pandas` DataFrames.

- Huge benefit: the [documentation online](https://scikit-learn.org/stable/modules/classes.html) is excellent.

### Import, instantiate, fit, predict

- All models in `sklearn` can be used using the following four steps.

1. Import the relevant class.

In [None]:
from sklearn.linear_model import LogisticRegression

2. Instantiate an object from that class (optionally, with hyperparameters).

In [None]:
model = LogisticRegression()

3. Fit the model using **training** data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['mean radius', 'mean texture']], 
                                                    df['target'])

model.fit(X_train, y_train)

4. Make predictions and evaluate the model.

In [None]:
X_train

In [None]:
model.predict_proba([[14.59, 22.68]])

In [None]:
# How were predicted probabilities turned into classifications?
model.predict([[14.59, 22.68]])

In [None]:
model.score(X_test, y_test)

### Logistic regression

In [None]:
model.intercept_, model.coef_

**Discussion Question**: What formula is our trained logistic regression model using to predict the probability of malignancy? **How** did it find that formula?

### Train-test splits

**Discussion Question**: What is the difference between training and test data, and why do we **need** a train-test split when building a predictive model?

### An inconsistency 🤔

- Run the code cells under "**Import, instantiate, fit, predict**" one more time. What do you notice?

- **Why** are the results different?