# Technology Innovation 510
## Introduction to Data Science Methods: Data Science and Visualization

**Instructor**: Wesley Beckner

**Contact**: wesleybeckner@gmail.com

<br>

---

<br>

🎉 Today, we'll be working from this _digital_ notebook to complete exercises! If you don't have a computer, not to worry. Grab a notepad and pencil to write down your ideas and notes! 🎉

<br>

---

# Preparing Notebook for Demos

## Importing Packages

Once we have our packages installed, we need to import them. We can also import packages that are pre-installed in the Colab environment.

In [None]:
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

## Importing Data

We also have the ability to import data, and use it elsewhere in the notebook 📝!

In [None]:
# No Data Today! :)

## 📊 What is Data Science?

### The Emergence of Data Science

Data Science is a broad field, and depending on who you talk to, it can mean different things. In summary, many independent scientific fields began accumulating large amounts of data. At the UW in particular, these were dominated by the astronomy and oceanography departments. Folks began to realize that they needed a particular set of tools to handle large amounts of data. This culminated in the [eScience studio](https://escience.washington.edu/), which began to service the data needs of many departments on campus.

Today, data science not only has to do with large amounts of data, but refers generally to tools that allow us to work with a variety of data types. Because of this, machine learning is a tool within data science. But there are other tools apart from machine learning that makeup the data science ecosystem. Some of them are:

* data visualization
* databases 
* statistics

You could argue for others as well (algorithms, web servers, programming, etc.), but these are the formally accepted areas. 

#### 💭 7

We've talked a lot! Wow! Last topic. D A T A S C I E N C E. What is it? Any idea? Talk to your neighbor, convene together, then let's share. Do this at 2 different levels:

How would you explain data science to:
1. your grandmother
2. a student

#### 💬 7

I'll write these down, let's see if we can all agree on a precise definition

### Saying Stuff About Data (Statistics)

When we're talking about statistics, we're really talking about data story telling. Statistics is at the C O R E of data science, really. Without a basic knowledge of statistics it'll be hard for you to construct your data narratives and have them hold water. 

Let's start with some simple examples of data story telling, and use these to generate our own thoughts on the matter.

#### Anscombe's Quartet

There's a very famous anomaly in DS caled Anscombe's quartet. Observe the following data

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds1.png"></img>

We can construct this in python and confirm the summary statistics ourselves

In [None]:
df = pd.read_excel("https://github.com/wesleybeckner/technology_explorers/blob"\
                   "/main/assets/data_science/anscombes.xlsx?raw=true",
              header=[0,1])
df

Unnamed: 0_level_0,I,I,II,II,III,III,IV,IV
Unnamed: 0_level_1,X,Y,X,Y,X,Y,X,Y
0,10,8.04,10,9.14,10,7.46,8,6.58
1,8,6.95,8,8.14,8,6.77,8,5.76
2,13,7.58,13,8.74,13,12.74,8,7.71
3,9,8.81,9,8.77,9,7.11,8,8.84
4,11,8.33,11,9.26,11,7.81,8,8.47
5,14,9.96,14,8.1,14,8.84,8,7.04
6,6,7.24,6,6.13,6,6.08,8,5.25
7,4,4.26,4,3.1,4,5.39,19,12.5
8,12,10.84,12,9.13,12,8.15,8,5.56
9,7,4.82,7,7.26,7,6.42,8,7.91


We can calculate the mean/variance of X and Y for samples I, II, III, and IV

In [None]:
df.mean()

I    X    9.000000
     Y    7.500909
II   X    9.000000
     Y    7.500909
III  X    9.000000
     Y    7.500000
IV   X    9.000000
     Y    7.500909
dtype: float64

In [None]:
# do we remember the relationship between standard deviation and variance?
df.std()**2

I    X    11.000000
     Y     4.127269
II   X    11.000000
     Y     4.127629
III  X    11.000000
     Y     4.122620
IV   X    11.000000
     Y     4.123249
dtype: float64

We talked about the equation for a linear line last time:

$$y(x)= m\cdot x + b$$ 

In [None]:
model = LinearRegression()
sets = ['I', 'II', 'III', 'IV']
for data in sets:
  model.fit(df[data]['X'].values.reshape(11,1),
            df[data]['Y'])
  print("Linear Regression Line: Y = {:.2f}X + {:.2f}".format(model.coef_[0], model.intercept_))

Linear Regression Line: Y = 0.50X + 3.00
Linear Regression Line: Y = 0.50X + 3.00
Linear Regression Line: Y = 0.50X + 3.00
Linear Regression Line: Y = 0.50X + 3.00


$R^2$ measures the goodness of fit. $R^2$ is generally defined as the ratio of the total sum of squares $SS_{\sf tot} $ to the residual sum of squares $SS_{\sf res} $:

We already talked about the residual sum of squares last session (what were we trying to do with this equation??)

$$SS_{\sf res}=\sum_{i=1}^{N} \left(y^{\sf exact}_i - y^{\sf calc}_i\right)^2$$

We now define the total sum of squares, a measure of the total variance in the data:

$$SS_{\sf tot}=\sum_{i=1}^{N} \left(y^{\sf exact}_i-\bar{y}\right)^2$$

The $R^2$ tells us how much of the variance of the data, is captured by the model we created:

$$R^2 = 1 - {SS_{\sf res}\over SS_{\sf tot}}$$

In the first equation, $\bar{y}=\sum_i y^{\sf exact}_i/N$ is the average value of y for $N$ points. The best value of $R^2$ is 1 but it can also take a negative value if the error is large.

In [None]:
for data in sets:
  # calc the ssr
  ssr = np.sum((df[data]['Y'] - 
                model.predict(df[data]['X'].values.reshape(-1,1)))**2)

  # calc the sst
  sst = np.sum((df[data]['Y'] - 
                df[data]['Y'].mean())**2)

  # calc the r2
  r2 = 1 - (ssr/sst)
  print("R2 = {:.2f}".format(r2))

R2 = 0.67
R2 = 0.67
R2 = 0.67
R2 = 0.67


As we can see, everything checks out. The summary statistics are all the same!

Can we answer the following:

> What dataset is best described by the line of best fit?

We will revisit this question when we talk about data visualization

#### Taxonomy of Data Types

Another important topic in data science, is simply what kind of data we are working with. This will help us decide what kind of models to build, as well as how to visualize our data, and perhaps store it as well.

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds2.png"></img>

#### 💬 8

What are some examples of the different datatypes we can think of?

### Data Visualization

Data visualization, like it sounds, has to do with how we display and communicate information. At the end of the day, your findings and algorithms aren't worth very much if we can't share them with others.

#### Guiding Principles of Data Visualization

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds3.png"></img>

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds9.gif"></img>


<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds7.png"></img>

wattenberg and Viegas visualization

In [None]:
%%HTML
<video width="640" height="580" controls>
  <source src="https://github.com/wesleybeckner/technology_explorers/blob/main/assets/data_science/ds4.mp4?raw=true" type="video/mp4">
</video>

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds6.png"></img>

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds4.png"></img>

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds5.png"></img>


<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds8.png"></img>


<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds9.png"></img>

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds10.png"></img>


<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds11.png"></img>

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds12.png"></img>

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds13.png"></img>

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds14.jpg"></img>

#### Visualization Un-Examples

**Unexample 1**

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds15.jpg"></img>

**Unexample 2**

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds19.png"></img>

**Unexample 3**

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds17.png"></img>

**Unexample 4**

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds18.png"></img>

#### 💭 8

Find an example of an interactive data visualization online. Here's one I [found](https://www.migrationpolicy.org/programs/data-hub/charts/us-immigrant-population-state-and-county) that I though was quite interesting!

#### 💬 8

Swap visualization links with your neighbor. What do you think could be improved about each one?

#### Back to Anscombe's Quartet

<p align=center>
<img src="https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/data_science/ds20.png"></img>