# The Data Science Design Manual

Notes by Tobias Reaper

---

## Introduction

Fundamental principles of becoming a good data scientist:

* Valuing doing the simple things right
  * Understanding the application domain
  * Cleaning and integrating relevant data sources
  * Presenting your results clearly to others
* Developing mathematical intuition
  * Particularly statistics and linear algebra
  * Why the concepts were developed, how they are useful, and when they work best
* Think like a computer scientist, but act like a statistician

---

## Chapter 1: What is Data Science?

### 1.2: Asking Interesting Questions from Data

Good data scientists have wide-ranging interests. They read the newspaper every day to get a broader perspective on what is exciting. They understand that the world is an interesting place. Knowing a little something about everything equips them to play in other people's backyards. They are brave enough to get out of their comfort zones a bit, and driven to learn more once they get there.

#### Baseball

First example / exercise is baseball. Here are (interesting?) questions I came up with:

- Who are the most expensive players, and how did they perform compared with the least expensive? Or compared with the median?
- How much does a home run cost for the most/least expensive players?
- Do height and weight determine the length of a player's career?
- Are the most valuable players those who both bat and throw well / are well-rounded?
- Do star players help a team win championships?

Some interesting demographic ones:

- How often do people return to live in the same place where they grew up?
- Do lefties live longer than righties?

#### IMDb



---

## Chapter 2: Mathematical Preliminaries

### 2.1 Probability

> Probability theory provides a formal framework for reasoning about the likelihood of events.

- An experiment is a procedure yielding one of a set of possible outcomes
  - On-going example: tossing two 6-sided dice, one red and one blue
- A _sample space_ $S$ is the set of possible outcomes of an experiment
  - In ex: there are 36 possible outcomes. 
- An _event_ $E$ is a specified subset of the outcomes of an experiment
- THe _probability of an outcome_ $s$, denoted $p(s)$, is a number with two properties
  - For each outcome $s$ in sample space $S$, $0 \leq p(s) \leq 1$
  - The sum of probabilities of all outcomes adds to one
- The _probability of an event_ $E$ is the sum of the probabilities of the outcomes of the experiment
  - An easier method of calculating the probability is via the complement of $E$: $P(E) = 1 - P(\bar{E})$
- A _random variable_ $V$ is a numerical function on the outcomes of a probability space
- The _expected value_ of a random variable $V$ defined on sample space $S$ is ...
  - the probability of the event times its respective value, summed over all events

$E(V) = \sum p(s) \cdot V(s)$

#### 2.1.1 Probability vs. Statistics

- Probability deals with predicting the likelihood of future events; theoretical math
  - Probability theory enables us to find the consequences of a given ideal world
- Statistics involves the analysis of the frequency of past events; applied math
  - Statistical theory enables us to measure the extent to which our world is ideal

#### 2.1.2 Compound Events and Independence

- Intersection
  - The outcomes in common between both events $A$ and $B$ are the intersection: $A \cap B$
  - Written as $A \cap B = A - (S - B)$
- Union
  - The outcomes in which either $A$ or $B$ appear are the union: $A \cup B$
- Events $A$ and $B$ are independent if and only if ${P(A \cap B) = P(A) \times P(B)}$
  - prob of intersection of A and B is equal to prob of A times prob of B

#### 2.1.3 Conditional Probability

- The conditional probability of $A$ given $B$, $P(A \mid B)$ is defined as

$P(A \mid B) = \frac{P(A \cap B)}{P(B)}$

Example:

- Event $A$ is that at least one of the two dice be an even number
- Event $B$ is the sum of the two dice is either a 7 or 11