# The Data Science Design Manual

Notes by Tobias Reaper

---
---

## Introduction

Fundamental principles of becoming a good data scientist:

* Valuing doing the simple things right
  * Understanding the application domain
  * Cleaning and integrating relevant data sources
  * Presenting your results clearly to others
* Developing mathematical intuition
  * Particularly statistics and linear algebra
  * Why the concepts were developed, how they are useful, and when they work best
* Think like a computer scientist, but act like a statistician

---
---

## Chapter 1: What is Data Science?

### 1.2: Asking Interesting Questions from Data

Good data scientists have wide-ranging interests. They read the newspaper every day to get a broader perspective on what is exciting. They understand that the world is an interesting place. Knowing a little something about everything equips them to play in other people's backyards. They are brave enough to get out of their comfort zones a bit, and driven to learn more once they get there.

#### Baseball

First example / exercise is baseball. Here are (interesting?) questions I came up with:

- Who are the most expensive players, and how did they perform compared with the least expensive? Or compared with the median?
- How much does a home run cost for the most/least expensive players?
- Do height and weight determine the length of a player's career?
- Are the most valuable players those who both bat and throw well / are well-rounded?
- Do star players help a team win championships?

Some interesting demographic ones:

- How often do people return to live in the same place where they grew up?
- Do lefties live longer than righties?

#### IMDb



---
---

## Chapter 2: Mathematical Preliminaries

---

### 2.1 Probability

> Probability theory provides a formal framework for reasoning about the likelihood of events.

- An experiment is a procedure yielding one of a set of possible outcomes
  - On-going example: tossing two 6-sided dice, one red and one blue
- A _sample space_ $S$ is the set of possible outcomes of an experiment
  - In ex: there are 36 possible outcomes. 
- An _event_ $E$ is a specified subset of the outcomes of an experiment
- THe _probability of an outcome_ $s$, denoted $p(s)$, is a number with two properties
  - For each outcome $s$ in sample space $S$, $0 \leq p(s) \leq 1$
  - The sum of probabilities of all outcomes adds to one
- The _probability of an event_ $E$ is the sum of the probabilities of the outcomes of the experiment
  - An easier method of calculating the probability is via the complement of $E$: $P(E) = 1 - P(\bar{E})$
- A _random variable_ $V$ is a numerical function on the outcomes of a probability space
- The _expected value_ of a random variable $V$ defined on sample space $S$ is ...
  - the probability of the event times its respective value, summed over all events

$E(V) = \sum p(s) \cdot V(s)$

#### 2.1.1 Probability vs. Statistics

- Probability deals with predicting the likelihood of future events; theoretical math
  - Probability theory enables us to find the consequences of a given ideal world
- Statistics involves the analysis of the frequency of past events; applied math
  - Statistical theory enables us to measure the extent to which our world is ideal

#### 2.1.2 Compound Events and Independence

- Intersection
  - The outcomes in common between both events $A$ and $B$ are the intersection: $A \cap B$
  - Written as $A \cap B = A - (S - B)$
- Union
  - The outcomes in which either $A$ or $B$ appear are the union: $A \cup B$
- Events $A$ and $B$ are independent if and only if ${P(A \cap B) = P(A) \times P(B)}$
  - prob of intersection of A and B is equal to prob of A times prob of B

#### 2.1.3 Conditional Probability

- Interested in the likelihood of event A as a function of some evidence B
- The conditional probability of $A$ given $B$, $P(A \mid B)$ is defined as

$P(A \mid B) = \frac{P(A \cap B)}{P(B)}$

Example:

- Event $A$ is that at least one of the two dice be an even number
- Event $B$ is the sum of the two dice is either a 7 or 11

Explanation:

- $P(A \mid B) = 1$ because any roll summing to odd must have one even and one odd number.
- $A \cap B = B$ - intersection b/w A and B is equal to B (A has more outcomes)
- $A \cap B = \frac{9}{36}$ and $P(A) = \frac{25}{36}$
- $P(B \mid A) = \frac{9}{25}$

> Primary tool to compute conditional probabilities is Bayes Theorem:

$P(B \mid A) = \frac{P(A \mid B)P(B)}{P(A)}$

Using Bayes Theorem:

$P(B \mid A) = \frac{1 \cdot 9/36}{25/36} = 9/25$

(same result as above.)

#### 2.1.4 Probability Distributions

- Random variables are numerical functions with values associated with probabilities of occurrence
- Random variables can be represented by their _probability density function_, or pdf
  - x-axis is range of values the variable can take on
  - y-axis shows probability of that value
- pdf plots are related to histograms of data frequency
  - in this case, y-axis is the observed frequency of each value of x
- histogram -> pdf: divide each bucket by the total frequency over all buckets
  - the sum of the entries becomes 1, and we have a probability distribution
- Random variables can also be visualized with a _cumulative density function_, or cdf
  - the running sum of probabilities in the pdf
  - reflects the probability that a value x is above the line
- pdf and cdf contain the same information

---

### 2.2 Descriptive Statistics

- Provide methods of capturing properties of a dataset or sample
  - aggregation as data reduction
- Two main types
  - Central tendency measures
  - Variability measures

#### 2.2.1 Centrality Measures

- Mean (arithmetic): sum the values and divide by the number of observations
  - Good for characterizing symmetric distributions without outliers
- Geometric mean: the nth root of the product of n values (multiply all values, then take the nth root)
  - Always less than or equal to the arithmetic mean
  - Very sensitive to values near zero (a single 0 destroys all meaning - like an outlier of infinity in arithmetic)
  - Good for averaging ratios
- Median: exact middle of a sorted array of values
  - Must be an actual observation (unless taking the arithmetic mean of middle two values)
  - Generally better for skewed distributions or data with outliers
- Mode: most frequent element in the data

#### 2.2.2 Variability Measures

- Standard deviation is the most common
  - Sum of squares differences between individual elements and the mean
  - variance is stdev squared (i.e. they measure the same thing)
    - variance is very sensitive to outliers
- Population vs sample: divide by $n$ or $n - 1$?
  - Full population: $n$
  - Sample: $n - 1$
  - Example:
    - Sampling one point doesn't say anyhting about underlying variance of any population
    - But it makes sense to say there is zero variance in X among the population of a one-person island
  - For reasonable-sized datasets, they are going to be about the same, so it doesn't really matter

#### 2.2.3 Interpreting Variance

- _Sampling errors_ happen when observations capture unrepresentative circumstances
- _Measurement errors_ represent the limits of precision inherent in any sensing device
- _Signal to noise ratio_: how much the observations measure the actual quantity rather than the data variance
- Example: weighing yourself in the morning
  - when you last ate = sampling error
  - quality / age of the scale = measurement error
  - changes in body mass = actual variation
- Distressingly (for data scientists), much of what happens in the world is random fluctuations or arbitrary coincidence

#### 2.2.4 Characterizing Distributions

- Central tendencies by themselves don't do much to descibe distributions
- Combined with measures of variability, a decent job can be done describing any distribution
- Best to always report both when characterizing a distribution

---

### 2.3 Correlation Analysis

- x and y are correlated when x has some predictive power of y
- Correlations around 0 are useless for predictions

#### 2.3.1 Correlation Coefficients: Pearson and Spearman Rank

- Measure somewhat different things but both operate on scale of -1 to 1
- Pearson Correlation Coefficient
  - How well a linear predictor (line) can fit the data
  - More prominent; shows overall the weight of points above and below the mean
  - Can be relatively easily tricked to show zero correlation when there is obvious predictive power
  - $r = ... = \frac{Cov(X, Y)}{\sigma (X) \sigma (Y)}$
  - $Cov(X, Y) = \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})$
- Spearman Rank Correlation Coefficient
  - Counts the number of pairs of input points that are out of order
  - Gives high score to non-linear but monotonic functions
  - Less sensitive to extreme outliers than Pearson

#### 2.3.2 The Power and Significance of Correlation

- "if you want to fit your data with a straight line, best to sample only two points."
- Strength of correlation, $R^2$: square of sample correlation coefficient, $r^2$
  - Estimates the proportion of variance in Y explained by X in a simple linear regression
    - OLS line, by its very nature, will result in a residual values dataset with mean of 0
    - If totally uncorrelated, $V(y) \approx V(r)$ - i.e. the fit contributes nothing
- Statistical significance
  - Depends on sample size $n$ and $r$
  - Traditionally, a correlation of $n$ points is significant if $\alpha \leq 1/20 = 0.05$
  - Not a very strong standard, as that significance can be somewhat easily achieved with large enough samples

#### 2.3.3 Correlation Does Not Imply Causation!

- Correlation implies causation is a common error in thinking

#### 2.3.4 Detecting Periodicities by Autocorrelation

- Compare a sequence to itself
- A peak at a shift of 7 days (and multples of) means there is a weekly periodicity
- Important concept in predicting future events: use previous observations as features in a model
- Autocorrelation function tends to be highest for short lags; long-term predictions are less accurate
- Efficient algorithm based on the _fast Fourier transform (FFT)_ makes it possible to calculate autocorrelation functions for long sequences

---

### 2.4 Logarithms

- The logarithm is the inverse exponential function
  - $y = b^x$, which can be rewritten as $x = log_b y$
  - Same as $b^{log_b y} = y$
- Logarithms grow at a very slow rate
  - Associated with processes of repeated multiplication or division

#### 2.4.1 Logarithms and Multiplying Probabilities

- Logarithms are still very important, particularly in the multiplication of long chains of probabilities
  - Probabilities are small numbers
  - Multiplying probabilities yeild very small numbers governing the chances of very rare events
- Computers don't deal with very small floating point numbers well, because of errors
  - Summing the logarithms of probabilities is much more numerically stable than multiplying them
- Logarithms of probabilities are all negative numbers except $log(1) = 0$
  - Something to be aware of; watch out for negative symbols in strange places

#### 2.4.2 Logarithms and Ratios

- Doing things like averaging ratios is committing a statistical sin
  - $200/100 = 200%$ above baseline; while $100/200 = 50%$ below
  - Same magnitude difference, but wildly different results
  - e.g. average the above changes, and it seems like an overall _increase_, though it should be 0
    - $200/100 = 2$; $100/200 = 0.5$ - average them together: $\frac{2.5}{2} = 1.25$
  - Do the same thing with logs results in the correct answer
    - $log_2 2 = 1$; $log_2 (1/2) = -1$
- Always plot the logarithm of ratios, not the ratio values themselves

#### 2.4.3 Logarithms and Normalizing Skewed Distributions

- Logarithms are great at transforming skewed data into something more normally-distributed
  - e.g. power law distributions
- Sometimes logs are too drastic; can try square root
  - Plot the frequency distributions to see what comes up with the most normal distribution

#### 2.5 - 2.6

- 2.5 is the war story about using logarithms to find more pronounced patterns in gene coding
- 2.6 is the chapter notes

---

### 2.7 Exercises

> Probability

#### 2-1

Suppose that 80% of people like peanut butter, 89% like jelly, and 78% like both. Given that a randomly samples person likes peanut butter, what is the probability that she also likes jelly?

$P(PB) = 0.80$, $P(J) = 0.89$, and $P(PB \cap J) = 0.78$

In [3]:
p_pb = 0.8
p_j = 0.89
p_both = 0.78

# Dependent if not equal to 0.78
print(p_pb * p_j)

0.7120000000000001


These events are not independent, because $P(PB \cap J) \neq P(PB) \times P(J)$. Or, these events are dependent.

Looking for the probability of liking jelly, given she likes pb. Using Bayes Theorem: $P(J \mid PB) = \frac{P(PB \mid J)P(J)}{P(PB)}$

First, we need the probability of pb given jelly: $P(PB \mid J) = \frac{P(PB \cap J)}{P(J)} = \frac{0.78}{0.89} = 0.8764$

In [5]:
# Prob of pb given jelly
p_pb_j = p_both / p_j
p_pb_j

0.8764044943820225

Then, we have all the values needed for bayes theorem:

$P(J \mid PB) = \frac{P(PB \mid J)P(J)}{P(PB)} = \frac{0.8764 \cdot 0.89}{0.8} = 0.975$

In [6]:
# Plug into bayes theorem
p_j_pb = (p_pb_j * p_j) / p_pb
p_j_pb

0.975

#### 2-2

Suppose that $P(A) = 0.3$ and $P(B) = 0.7$

1. Can you compute $P(A and B)$ if you only know $P(A)$ and $P(B)$?

Only if the events are independent. If they are independent, then $P(A \cap B) = P(A) \times P(B) = 0.3 * 0.7 = 0.21$

2. Assuming that events $A$ and $B$ arise from independent random processes:
- What is $P(A and B)$ / $P(A \cap B)$?

$P(A \cap B) = P(A) \times P(B) = 0.3 * 0.7 = 0.21$

- What is $P(A or B)$ / $P(A \cup B)$?

$P(A \cup B) = P(A) + P(B) = 1$

- What is $P(A \mid B)$?

If the events are independent, the probability of A does not change at all given other information. So, $P(A \mid B) = P(A) = 0.3$.

#### 2-3

Consider a game where your score is the maximum value from two dice. Compute the probability of each event from {1, ..., 6}.

I believe this is indicating I should calculate the probability of each of these events happening - my score equals:

- 1 => 0
- 2 => $\frac{1}{36} = 0.0278$
- 3 => {(1, 2), (2, 1)} => $\frac{2}{36} = \frac{1}{16} = 0.0625$
- 4 => {(1, 3), (2, 2), (3, 1)} => $\frac{3}{36} = \frac{1}{12} = 0.083$
- 5 => {(1, 4), (2, 3), (3, 2), (4, 1)} => $\frac{4}{36} = \frac{1}{9} = 0.11$
- 6 => {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} => $\frac{5}{36} = 0.1389$

#### 2-4

Prove that the cumulative distribution function of the maximum of a pair of values drawn from random variable $X$ is the square of the original distribution function of $X$.

#### 2-5

If two binary random variables $X$ and $Y$ are independent, are $\bar{X}$ (the complement of $X$) and $Y$ also independent? Give a proof or a counterexample.