# Teaching Data Science Effectively

### Robert Schroll
### The Data Incubator

Follow along at [bit.ly/tdi-odsc](https://bit.ly/tdi-odsc)

## Data Scientists Needed

Quanthub estimates a shortage of 250,000 data scientists in 2020.

Few are coming directly from academia.

[Quanthub's survey](https://quanthub.com/data-scientist-shortage-2020/) is based on data from 2019 and 2020, so it probably does not fully reflect the effects of the pandemic.  But hiring remains strong, and this estimate is consistent with the order of magnitude from other research.

## Data Science Skills Needed

Many who don't consider themselves data scientists benefit from data science skills.

Pandas can be a game-changer, by automating spreadsheet tasks.

## Data Science Skills

Data Scientists need to combine skills in
1. Data analysis
2. Programming
3. Communication

## About the Data Incubator

The Data Incubator trains recent graduates and experienced workers.

Our Data Science Fellowship is an 8-week bootcamp.
- 1000+ students trained in the basics of data science.
- Students placed at 100+ companies, including Genentech, Square Space, and Freddie Mac

We run private, customized trainings for dozens of companies.

## About Me

I have been programming and conducting data analysis for most of this century.

First as a physicist, now as a data scientist.

I have been teaching data science for 5 years.

## Teaching Data Science Effectively

1. Jupyter as a pedagogical tool
2. Learning by doing
3. The importance of failure
4. Where we're still learning

# Jupyter as a Pedagogical Tool

Jupyter notebooks are a great tool for data science.

They also work well for teaching.

## Mixing Languages

Data scientists are often moving between multiple languages and tools.

Jupyter and the IPython kernel allow you to mix many of these tools in a single interface.

## Python

In [None]:
for i in range(10):
    print('Hi! ' * i)

## Bash

Shell code can be run with the `!` shortcut.

In [None]:
! grep "Bash" ODSC.ipynb

## Bash

There is also a bash magic, for longer code sections.

In [None]:
%%bash

for f in *; do
    echo "Hi $f"
done

## SQL

The [IPython SQL magic](https://github.com/catherinedevlin/ipython-sql) allows SQL code directly in cells.

In [None]:
%load_ext sql
%sql sqlite:///small_data/customers.sqlite

In [None]:
%%sql

SELECT * FROM customers LIMIT 5;

## HTML/JS

Jupyter already has tools to display HTML documents, but our [iHTML](https://github.com/thedataincubator/ihtml) package lets you demo HTML code directly in a notebook.

First, we'll create a Javascript document:

In [None]:
import ihtml

In [None]:
%%jsdoc clicker
document.addEventListener("DOMContentLoaded", function(e) {
    document.querySelector("h1").addEventListener("click", function(event) {
        var div = document.createElement("div");
        div.textContent = "Hi!";
        document.body.appendChild(div);
    })
})

## HTML/JS

Then we can use this inside a HTML document:

In [None]:
%%ihtml 200
<html>
    <head>
        <style>
            body { background: #eee; }
        </style>
        {{ clicker | jsdoc }}
    </head>
    <body>
        <h1>Click me!</h1>
    </body>
</html>

## Benefits of Jupyter Notebooks

**Interactivity:** Cause and effect are tightly coupled.

**Modifiability:** Easy to make small changes and see effects.

Together, these encourage experimentation.

## Deploying Jupyter for Everyone

[Jupyter](https://jupyter.org/) provides a server for a single user.

[JupyterHub](https://jupyter.org/hub) provides many users their own servers.

[Zero-to-JupyterHub](https://zero-to-jupyterhub.readthedocs.io/en/latest/) runs JupyterHub on a K8s cluster.

All for free, with a wonderful community!

For details on our set up, see [our DigitalOcean Tech Talk](https://www.digitalocean.com/community/tech_talks/scaling-a-school-bringing-data-science-curriculum-to-20-000-students-in-the-cloud).

## Learning by Doing

Promote action at four levels:
1. Prepared interactive elements
2. Small inline coding exercises
3. Stand-alone "miniprojects"
4. Capstone project

## Prepared Interactive Elements

### excerpt from _Bias, Variance, and Overfitting_

In this notebook, we will illustrate the notion of overfitting and learn how we can try to prevent it.  We'll use data describing daily sales in a store given the number of customers in the store that day. Let's start by plotting the data.

In [None]:
import seaborn as sns
sns.set()

import pandas as pd
import matplotlib.pyplot as plt

df_sales = pd.read_csv('small_data/SalesData.csv')

X = df_sales[['Customers']]
y = df_sales['Sales']

plt.plot(X, y, 'k.')
plt.title('Daily Store Sales Given Number of Customers That Day')
plt.xlabel('Number of customers')
plt.ylabel('Store sales [dollars]');

## Train a decision tree model


We want to predict store sales given the number of customers in the store. We will use a decision tree model to illustrate an example of the bias-variance tradeoff. Decision trees are used to solve both regression and classification problems. A decision tree model makes a prediction by using a flowchart-like binary tree structure that is constructed during training. 

One of the hyperparameters of a decision tree is `max_depth` that controls the maximum depth of the tree from its root.  Let's experiment with trees of different depths by using a widget that will allow us to pick different `max_depth` values.

In [None]:
from ipywidgets import interactive, IntSlider
from sklearn.tree import DecisionTreeRegressor

def train_and_plot(max_depth):
    est = DecisionTreeRegressor(max_depth=max_depth)
    est.fit(X, y)

    plt.figure(figsize=(9,6))
    plt.plot(X, y, 'k.', label='data')
    line = plt.plot(X, est.predict(X), c='#ba2121ff', label='model')
    plt.setp(line, linewidth=3., alpha=0.7)
    plt.title('Daily Store Sales Given Number of Customers That Day')
    plt.xlabel('Number of customers')
    plt.ylabel('Store sales [dollars]');
    plt.legend(loc='upper left')
    plt.show()
    
max_depth_slider = IntSlider(min=1, max=10, step=1, value=2)
interactive(train_and_plot, max_depth=max_depth_slider)

### Exercise

Play around with `max_depth` slider in the above plot.  What happens when you set it to `max_depth=5`.  Does the fitted model look better or worse?  What about for `max_depth=2` or `max_depth=3`? 

## Inline Coding Exercises

### excerpt from _Basic Neural Networks_

In [None]:
import tensorflow as tf
import numpy as np

### The XOR problem

Research into artificial neurons dates to the late 40s, but it was not until 1969 that Marvin Minsky and Seymour Papert pointed out that basic neurons were unable to reproduce the **exclusive-or** (XOR) function.  This Boolean function of two Boolean variables returns true if exactly one of its inputs is true:

$$ \mathrm{XOR}(0, 0) = \mathrm{XOR}(1, 1) = 0 \ \ \ \ \ \ \mathrm{XOR}(0, 1) = \mathrm{XOR}(1, 0) = 1 $$

Below, we create a related two-class classification problem, with one class clustered about (0, 0) and (1, 1), and the other about (0, 1) and (1, 0).  It would be quite easy to draw a boundary separating the two classes by hand. 

In [None]:
centers = np.array([[0, 0]] * 100 + [[1, 1]] * 100
                   + [[0, 1]] * 100 + [[1, 0]] * 100)
np.random.seed(42)
data = (np.random.normal(0, 0.2, (400, 2)) + centers).astype(np.float32)
labels = np.array([[0]] * 200 + [[1]] * 200).astype(np.float32)

plt.scatter(data[:,0], data[:,1], c=labels.ravel(), cmap=plt.cm.RdYlBu)
plt.colorbar();

The math behind this isn't as bad as it might seem at first.  All of the weights of the neurons in the hidden layer can be combined into a single $2\times2$ matrix $W^{(1)}$.  The final neuron's weights will be in a $2\times1$ matrix $W^{(2)}$.  The biases behave similarly.  Then our final probabilistic prediction is just

$$ p_j = f_2\bigg( f_1\left( X_{ji} W^{(1)}_{ik} + b^{(1)}_k \right) W^{(2)}_k + b^{(2)} \bigg).$$

We are using the Einstein notation: All repeated indices are implicitly summed over.  Both $f_1$ and $f_2$ represent the logistic function, which is taken to operate element-wise over tensors.

The **backpropagation** algorithm, developed by Paul Werbos in 1975, points out that we can use gradient descent (or similar algorithms) to optimize all of the parameters in these sorts of expressions.  All it takes is successive applications of the chain rule.  In fact, there's nothing special we have to do to make use of it: TensorFlow's optimizers automatically work though the successive derivatives to generate the update rules.  All we have to do is set up the calculation:

We will build a simple neural network with a single hidden layer.  Tensorflow's automatic differentiation will handle the backpropagation step for us.

In [None]:
tf.random.set_seed(42)  #  set seed for reproducibility

hidden_size = 2

layer_shapes = [(2, hidden_size), (hidden_size, 1)]

#  set up W and b of the appropriate sizes for each layer
layers = [(tf.Variable(tf.random.normal(shape)), tf.Variable(tf.zeros(shape[1])))
         for shape in layer_shapes]

We also need to redefine the `logits` function to account for the hidden layer (and therefore redefine `loss` for the new definition of `logits`).  The `predict` and `score` functions are the same as for logistic regression. 

In [None]:
def logits(X):
    _ = X
    for W, b in layers[:-1]:
        _ = tf.nn.sigmoid(tf.matmul(_, W) + b)
        
    W, b = layers[-1]
    return tf.matmul(_, W) + b

def loss(X, y):
    def loss_():
        return tf.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(logits=logits(X), labels=y))
    
    return loss_

def predict_proba(X):
    return tf.nn.sigmoid(logits(X))

def predict(X):
    return tf.cast(predict_proba(X) > 0.5, tf.float32)
    
def score(X, y):
    return tf.reduce_mean(tf.cast(tf.equal(predict(X), y), tf.float32))

And let's run it.  We need more steps to get all of the weights well-trained.

In [None]:
steps = 300
rounds = 10

eta = 0.5

opt = tf.keras.optimizers.SGD(learning_rate=eta)

for i in range(1, rounds+1):
    for j in range(steps):
        opt.minimize(loss(data, labels), [var for layer in layers for var in layer])
    print(f'Round {i}:')
    print(f'   Loss: {loss(data, labels)()}')
    print(f'   Accuracy: {score(data, labels)}')

We can verify the improved accuracy by examining our predictions.

In [None]:
mesh = np.column_stack([a.reshape(-1) for a in np.meshgrid(np.r_[-1:2:100j], np.r_[-1:2:100j])]).astype(np.float32)
ymesh = predict_proba(mesh).numpy()

plt.imshow(ymesh.reshape(100,100), cmap=plt.cm.RdYlBu, origin='lower',
           extent=(-1, 2, -1, 2), vmin=0, vmax=1)
plt.scatter(data[:, 0], data[:, 1], c=labels.ravel(), cmap=plt.cm.RdYlBu,
            edgecolor='w', lw=1)
plt.axis((-1, 2, -1, 2))
plt.colorbar();

### Exercise: Number of hidden neurons

Change the number of neurons in the hidden layer.  How does this change the predictions made by the model?  What happens when you add many neurons to this hidden layer?


## Miniprojects

### excerpt from _Assignment 1_

In [None]:
import grader

Other questions will require you to build a function that takes some input and returns a specified output. Build a function that takes a list of numbers as input and returns a list of their squares.

In [None]:
def square(x):
    ...

In this case, the `grader.score` method will ask the grader for the input, pass it to your function, and report the output back for grading. Before we submit our answer to the grader, let's discuss `grader.check`.

For some questions, we have checks along the way. These checks let you know if there's any potential problems with your solution. If the expression in `grader.check` is `True`, it will return `True`. If not, it will raise a `ValueError` and we should see how we can resolve the issue. In the check below, we make sure that the length of the output list is the same as the input list.

In [None]:
# Check to see that the input and output list are the same length.
grader.check(len(square([0, 1, 2])) == 3)

Now that we have passed the check, let's submit our `square` function.

In [None]:
grader.score("assignment1__square", square)

## Miniproject: New York Social Graph

New York Social Diary hosted 1000+ pages of photos.

[Each page](https://web.archive.org/web/20150913112351/http://www.newyorksocialdiary.com/party-pictures/2014/holiday-dinners-and-doers) has photographs with names in captions.

We can understand the social network by:
- Finding all photo pages
- Scraping all captions
- Detecting names in captions
- Building a connectivity graph

## Miniproject Feedback

Automatic grader provides feedback on answer, not code style.

Instructor-led code review sessions review student-submitted code.

Deep focus on one part of the project.
- Compare different approaches.
- Fix bad coding practices.

## Capstone Project

Data scientists must solve the *right* problems and communicate the results.

Students design and execute a full data science project.

Usual problems are not technical, but around use cases.

Some recent projects:
- Finding broadband opportunities for under-served communities in Tennessee.
- Identifying useful product reviews.
- Finding shelter pets for adoption by image search.

## The Importance of Failure

You need both successes and failures to train a ML model.

Students learn as much from failures as from successes.

## The Benefits of Failure

Failure is the default state of programming.  Most time is spent debugging.

The marginal cost of code is $0. Known solutions would already be in a library.

Students need to learn how to find their way out of problems.

Hand-holding does not promote self-sufficiency.

## Google-fu is a Skill

<img src="https://i.redd.it/7lfrc6p5xna21.png" width="70%">

## Designing for Failure

More than just fill-in-the-blank.

Problems should involve aspects not covered previously.

Projects should have multiple roots to success.

## Helping Students Overcome Failure

Provide sketches or outlines of potential solutions.

Checkpoints let students check their own work.

Make instructors available:
- Lecture
- Office hours
- One-on-one meetings
- Slack

## Instructor Failures are Teaching Moments

Students learn the most when the instructor gets stuck.

Yes, some _schadenfreude_.

But also, an invaluable chance to teach debugging skills.

$\Rightarrow$ Get into trouble while teaching!

## Still Learning: Demoing Failure

Prepared failures don't resonate as well.

It's hard to fake the panic of facing a bewildering bug.

Part of bug hunting is going down, and then abandoning blind alleys.

## Still Learning: Teaching Generalization

Some students want a flow chart to follow.

Repeat examples step by step, without understanding logic.

These students struggle mightily on miniprojects.

## Still Learning: Magical Thinking

Some students treat code as a magical incantation.

Rearrange syntax until it runs.

They seem uninterested in understanding *why* code failed.

## Find Out More

- Data Science Fellowship
- Data Engineering Fellowship
- Private Training

[thedataincubator.com](https://www.thedataincubator.com)

Robert Schroll &mdash; robert@thedataincubator.com

Visit us at booth #9!

Slides: [bit.ly/tdi-odsc](https://bit.ly/tdi-odsc)