# Think Like a Data Scientist: <br/> &emsp; An Introduction to Solving <br/> &emsp; Data Oriented Problems
:author: Stephen Webb

The purpose of this workshop is to give its participants a guided hands-on tour to how Data Scientists approach problems, and how this may be different from how Software Engineers may approach the same problem. By the end of this workshop, the goal is for everyone to have first-hand experience of what it is like to tackle an open-ended, exploratory data problem using real data.

***
***
## Introduction

### The Data Problem

> Given a data set and some information about how it was collected,<br/> what questions can we ask from it?, and <br/> what problems can we solve with it?

***

### A glib distinction between science and engineering from an authoritative source

> "In science, <br/>
if you know what you are doing, <br/> 
you should not be doing it. 
<br/> <br/> 
In engineering, <br/>
if you do not know what you are doing, <br/>
you should not be doing it."​

 - Richard Hamming, [The Art of Doing Science and Engineering: Learning to Learn](https://www.google.com/books/edition/The_Art_of_Doing_Science_and_Engineering/vKbpywEACAAJ?hl=en)

Most software engineers come with a _Builder's Mindset_ – given a problem, how can we design a system that will solve it?

Most data scientists come with an _Experimenter's Mindset_ – given a question, how can we use the information available to us to go about answering it?

***

### The Challenge

![XKCD comic #1425, in which someone asks for a model to detect whether a photo is from a national park, easy with the presence of metadata, and then if there is a bird in the photo, which at the time was a full blown research problem.](./images/xkcd.png "XKCD #1425")

Seemingly small changes in the business problem being asked can move things from trivially solvable to impossible, or extremely challenging, based on the data available, the requirements on the solution, and other considerations.

# Example of Business Problem that then required a technical solution. Concrete example of a solution.

***
***
## Step 1: Look at the data


### Why do we need to look at the data?

| **How someone described the data** |
|:--:|
|![Futuristic utopian city](./images/future-city.jpg) |

| **What the people who work with it think the data looks like** |
|:--:|
|![Study After Velazquez's Portrait of Pope Innocent X](./images/innocent-x.png)|

| **What the data actually looks like** |
|:--:|
|![A Rube Goldberg Machine](./images/rubegoldberg.jpg) |



### So let's look at some data!

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('./data/fremont-bridge-bicycle-counts-exercise.csv')

***
## Step 2: Ask a question of the data


***
## The Ladder of Data Maturity

![A diagram of a ladder indicating the steps of data maturity, beginning with Existence, then Reporting and Analytics, followed by "Baby" Machine Learning Models, and concluding with "Big" Machine Learning & Automation](./data_maturity.jpeg "Ladder of Data Maturity")

***
### **Existence**

**What it is** - 

The data is accessible through standardized data tables with access control, where the information input _should_ be considered accurate.

**What it looks like** - 

Moving from data manually entered into Excel spreadsheets that sit in OneDrive to access controlled form entries that write to a table in Fabric with automatic processes for recording edits and revisions to entries.

[Tidy Data](https://vita.had.co.nz/papers/tidy-data.pdf) is a very good guide to how to organize data. The four bullet summary is:
* Each variable forms a column
* Each observation forms a row
* Each type of observational unit forms a table


***
### **Reporting & Analytics**

**What it is** - 

The data in the standardized tables can be aggregated (Reporting), summarized, and described (Analytics) in ways that provide information that informs decisions.

**What it looks like** - 

All stakeholders are able to look at a dashboard that exposes the data in SAP describing current status and forecasts that include seasonality effects based on data that is validated and updated synchronously.

***
### **"Traditional" Machine Learning Models**

**What it is** - 

Complex correlations in data are encoded in single models that output probabilities or quantitative predictions – "traditional" machine learning, lightweight models, etc.

**What it looks like** - 

A statistical model which takes disparate features – month of the year, current data, unemployment and inflation data – and provides a three month demand forecast.


***
### **LLMs, RAG, Agentic Systems, Automation...**

**What it is** - 

Language models, agentic systems, and other automated or autonomous decision-making models.

**What it looks like** - 

An agentic system that receives a forecasted supply shortfall based on a demand forecasting model and adjusts production schedules automatically to minimize the shortfall.

***
***
## Common Pitfalls

### Falling in love with using a tool instead of solving a problem

Someone just published the latest GraphRAG algorithm and you're dying to use it for your information retrieval problem. 

But is it better than [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) for solving your problem?

**Corollary**: always start with the simplest model that you think might solve the problem, and see how it fails. Machine learning models are [the high interest credit card of technical debt](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43146.pdf) and it is important to earn that debt.

***

### A data point is not a unit test

I once worked on a project that was meant to be core infrastructure for the whole business to replace an outside vendor.

We developed a model that was substantially better than what was already in production in every metric we were asked to use.

The Product Owner blocked deployment for six months, at a cost of several million dollars, because the model wasn't perfect on one particular pathological data point that was deemed part of the acceptance criteria.

**Corollary**: Machine learning models learn to replicate distributions, and all distributions have tails and outliers. Models will always fail _somewhere_.

***

### A machine learning model is not software

LLMs have a `Temperature` parameter that adds stochasticity and non-repeatability of the output at fixed input

A model given an input that is outside of the training distribution will give _an_ answer, but there is no _a priori_ check that the result will be sensible

Model performance is measured using evaluation metrics, which are `Float`, rather than unit testing suites, which are `Boolean`

There is no way to "refactor" a model – when requirements change, a new model likely needs to be developed

**Corollary**: once deployed, though, a machine learning model looks a lot like an endpoint, and good software engineering and architecture practices do apply _to that endpoint_.

***

### Patterns can change, your model does not

I worked at a business travel booking company.

There were clear time periods of <br/>
"Before COVID", <br/> 
"After COVID, but before widespread vaccine availability", and <br/>
"After widespread vaccine availability"<br/>
in the user behavior.

**Corollary**: monitoring for data drift is an incredibly valuable and underrated activity.