# Practical Statistics

## Data Definition
> Distinct pieces of information

## Data Types
Quantitative and Categorical.
* **Quantitative** data takes on numeric values that allow us to perform mathematical operations (like the number of dogs).
* **Categorical** are used to label a group or set of items (like dog breeds - Collies, Labs, Poodles, etc.).

### Categorical Ordinal vs. Categorical Nominal
We can divide categorical data further into two types: **Ordinal** and **Nominal**.
* **Categorical Ordinal** data take on a ranked ordering (like a ranked interaction on a scale from `Very Poor` to `Very Good` with the dogs).
* **Categorical Nominal** data do not have an order or ranking (like the breeds of the dog).

### Quantitative Continuous vs. Discrete
We can think of quantitative data as being either continuous or discrete.

* **Continuous** data can be split into smaller and smaller units, and still a smaller unit exists. An example of this is the age of the dog - we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with the age.

* **Discrete** data only takes on countable values. The number of dogs we interact with is an example of a discrete data type.

![data_types](images/data_types.PNG)

### Another Look
To break down our data types, there are two main blocks:

### Quantitative and Categorical

**Quantitative** can be further divided into `Continuous` or `Discrete`.<br>

**Categorical** data can be divided into `Ordinal` or `Nominal`.

#### Quantitative vs. Categorical
Some of these can be a bit tricky - notice even though zip codes are a number, they aren’t really a quantitative variable. If we add two zip codes together, we do not obtain any useful information from this new value. Therefore, this is a categorical variable.<br>

**Height**, **Age**, the **Number of Pages** in a Book and **Annual Income** all take on values that we can add, subtract and perform other operations with to gain useful insight. Hence, these are `quantitative`.<br>

**Gender**, **Letter Grade**, **Breakfast Type**, **Marital Status**, and **Zip Code** can be thought of as labels for a group of items or individuals. Hence, these are `categorical`.

#### Continuous vs. Discrete
To consider if we have continuous or discrete data, we should see if we can split our data into smaller and smaller units. Consider time - we could measure an event in years, months, days, hours, minutes, or seconds, and even at seconds we know there are smaller units we could measure time in. Therefore, we know this data type is continuous. **Height**, **age**, and **income** are all examples of `continuous data`. Alternatively, the **number of pages** in a book, **dogs I count** outside a coffee shop, or **trees in a yard** are `discrete data`. We would not want to split our dogs in half.

#### Ordinal vs. Nominal
In looking at categorical variables, we found **Gender**, **Marital Status**, **Zip Code** and your **Breakfast items** are `nominal variables` where there is no order ranking associated with this type of data. Whether you ate cereal, toast, eggs, or only coffee for breakfast; there is no rank ordering associated with your breakfast.<br>

Alternatively, the **Letter Grade** or **Survey Ratings** have a rank ordering associated with it, as `ordinal data`. If you receive an A, this is higher than an A-. An A- is ranked higher than a B+, and so on... Ordinal variables frequently occur on rating scales from very poor to very good. In many cases we turn these ordinal variables into numbers, as we can more easily analyze them, but more on this later!

## Analyzing Quantitative Data

There are four main aspects to analyzing **Quantitative** data.

1. Measures of `Center`
2. Measures of `Spread`
3. The `Shape` of the data.
4. `Outliers`

> Analyzing **categorical data** has fewer parts to consider. Categorical data is analyzed usually be 
> 1. looking at the counts 
> 2. or proportion of individuals that fall into each group. 
> 
>For example if we were looking at the breeds of the dogs, we would care about how many dogs are of each breed, or what proportion of dogs are of each breed type.

### Measures of Center
Gives us an idea of an average element. There are three measures of center:
1. `Mean`
2. `Median`
3. `Mode`

#### The Mean
The mean is often called the **average** or the **expected value** in mathematics. 
> We calculate the mean by adding all of our values together, and dividing by the number of values in our dataset.

#### The Median
The `median` splits our data so that **50% of our values are lower and 50% are higher**. We found in this video that how we calculate the median depends on if we have an even number of observations or an odd number of observations.<br>

In order to compute the median we MUST sort our values first.

> **For odd values**: If we have an odd number of observations, the median is simply the number in the direct middle.

> **For even values**: If we have an even number of observations, the median is the average of the two values in the middle.

Whether we use the `mean` or `median` to describe a dataset is largely dependent on the **shape** of our dataset and if there are any **outliers**.

#### The Mode
The mode is the most frequently observed value in our dataset.<br>

> There might be multiple modes for a particular dataset, or no mode at all.

* **No Mode**: If all observations in our dataset are observed with the same frequency, there is no mode. If we have the dataset:
```
1, 1, 2, 2, 3, 3, 4, 4
```
There is no mode, because all observations occur the same number of times.

* **Many Modes**: If two (or more) numbers share the maximum value, then there is more than one mode. If we have the dataset:
```
1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9
```
There are two modes 3 and 6, because these values share the maximum frequencies at 3 times, while all other values only appear once.

### Notation Capital vs. Lower case letters for variables
**Random variables** are represented by `capital letters`. Once we observe an outcome of these random variables, we notate it as a lower case of the same letter. Like follows:

`X` (capital) is the whole set of the amount of time individuals spend on the website, like: 
```
[x1, x2, x3, x4, x5, ... , xn]
```
and `x` (lower) represents a single observation.<br>

As a quick recap, **capital letters** signify **random variables**. When we look at **individual instances** of a particular random variable, we identify these as **lowercase letters** with subscripts attach themselves to each specific observation.

![notation](images/notation.PNG)

### Measures of Spread
> Gives us an idea of how the elements differ and how far are points from one another.

Measures of Spread are used to provide us an idea of how spread out our data are from one another. Common measures of spread include:

1. **Range**
2. **Interquartile Range (IQR)**
3. **Standard Deviation**
4. **Variance**

#### Histograms
Through Histograms we can understand better how the data behaves and we can have more information than the center values as the following image about the number of dogs saw shows. Both of them have about the same Mean, Median and Mode, but on the weekends there is a higher probability to find more dogs than on weekdays, and it also more likely to see less dogs than the average number on weekends compared to weekdays. Those are example of insights a histogram shows.
![histogram](images/histogram.png) 

##### Quick notes
* Pay attention to the scale of histograms when comparing them.
* The median is the middle number and is not affected by outliers.
* The average factors in all the numbers so outliers will bring the average towards them.
* Left Skewed is when the graphs start with a low frequency and then slopes up. Right Skewed is when the graph starts with a high frequency and slopes down.

#### Calculating the 5 Number Summary
The five number summary consist of 5 values:

1. **Minimum**: The smallest number in the dataset.
2. **Q1**: The value such that 25% of the data fall below.
3. **Q2**: The value such that 50% of the data fall below.
4. **Q3**: The value such that 75% of the data fall below.
5. **Maximum**: The largest value in the dataset.

Calculating each of these values was essentially just finding the median of a bunch of different datasets.

#### Range and IQR (Interquartile range)
* The `range` is then calculated as the difference between the `maximum` and the `minimum`.
* The `interquartile range` is calculated as the difference between `Q3` and `Q1`

#### Boxplot
Idea behind the boxplot. They are useful for quickly comparing the spread of two datasets

![boxplot](images/boxplot.png)

#### Standard Deviation and Variance
The `standard deviation` is one of the most common measures for talking about the spread of data. It is defined as **the average distance of each observation from the mean**.

#### Recap

For datasets that are **not symmetric**, the five number summary and a corresponding box plot are a great way to get started with understanding the spread of your data. **Although I still prefer a histogram in most cases, box plots can be easier to compare two or more groups**. You will see this in the quizzes towards the end of this lesson.<br>

The variance is the **average squared difference of each observation from the mean**.

#### Important Final Points
1. The `variance` is used to compare the spread of two different groups. A set of data with higher variance is more spread out than a dataset with lower variance. Be careful though, there might just be an **outlier (or outliers) that is increasing the variance**, when most of the data are actually very close.
2. When comparing the spread between two datasets, the **units of each must be the same**.
3. When data are related to money or the economy, higher variance (or standard deviation) is associated with higher risk.
4. The standard deviation is used more often in practice than the variance, because it shares the units of the original dataset.



### Shape

![normal](images/normal.png)

![right](images/skew_right.png)

![left](images/skew_left.png)

### Outliers
We learned that `outliers` are points that fall very far from the rest of our data points. 

> This influences measures like the mean and standard deviation much more than measures associated with the five number summary.

#### How to deal with Outliers
1. Noting they exist and the impact on summary statistics.
2. If typo - remove or fix
3. Understanding why they exist, and the impact on questions we are trying to answer about our data.
4. Reporting the 5 number summary values is often a better indication than measures like the mean and standard deviation when we have outliers.
5. Be careful in reporting. Know how to ask the right questions.

#### Outliers Advice
Below are my guidelines for working with any column (random variable) in your dataset.

1. Plot your data to identify if you have outliers.
2. Handle outliers accordingly via the methods above.
3. If no outliers and your data follow a normal distribution - use the mean and standard deviation to describe your dataset, and report that the data are normally distributed.
4. If you have skewed data or outliers, use the five number summary to summarize your data and report the outliers.

If you aren't sure if your data are normally distributed, there are plots called normal quantile plots and statistical methods like the [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) test that are aimed to help you understand whether or not your data are normally distributed.

## Inferential Statistics
Drawing conclusions about a population based on data collected from a sample of individuals from that population.
### Descriptive vs. Inferential Statistics

1. **Descriptive Statistics**
    * `Descriptive statistics` **is about describing our collected data** using the measures discussed throughout this lesson: measures of center, measures of spread, shape of our distribution, and outliers. We can also use plots of our data to gain a better understanding.
2. **Inferential Statistic**
    * `Inferential Statistics` **is about using our collected data to draw conclusions to a larger population**. Performing inferential statistics well requires that we take a sample that accurately represents our population of interest.

A common way to collect data is via a survey. However, surveys may be extremely biased depending on the types of questions that are asked, and the way the questions are asked. This is a topic you should think about when tackling the first project.<br>

We looked at specific examples that allowed us to identify the
1. **Population** - our entire group of interest.
2. **Parameter** - numeric summary about a population
3. **Sample** - subset of the population
4. **Statistic** numeric summary about a sample



### Simpsons Paradox
It is so easy to get caught up in looking at full aggregates of your data. Depending on how the data is aggregated we can be led to different conclusions, so it is very important to look to the data form multiple perspectives to avoid false conclusions.


In [19]:
import pandas as pd
import numpy as np

In [38]:
# Load and view first few lines of dataset
df = pd.read_csv('data/admission_data.csv')
df.head()

Unnamed: 0,student_id,gender,major,admitted
0,35377,female,Chemistry,False
1,56105,male,Physics,True
2,31441,female,Chemistry,False
3,51765,male,Physics,True
4,53714,female,Physics,True


In [72]:
df['gender'].value_counts()/len(df)

female    0.514
male      0.486
Name: gender, dtype: float64

In [40]:
df[['male', 'female']] = pd.get_dummies(df['gender'])

In [43]:
df.head()

Unnamed: 0,student_id,gender,major,admitted,male,female
0,35377,female,Chemistry,False,1,0
1,56105,male,Physics,True,0,1
2,31441,female,Chemistry,False,1,0
3,51765,male,Physics,True,0,1
4,53714,female,Physics,True,1,0


In [84]:
df_agg = df.groupby(['major','gender']).agg({'gender':'count',
                                             'admitted':['sum','mean']}).rename(columns={'gender':'candidates'}) 
col_names = []
for cols in df_agg.columns:
    new_col = '_'.join(cols)
    col_names.append(new_col)
# rename columns
df_agg.columns = col_names

df_agg['acceptance_rate'] = df_agg['admitted_sum'] / df_agg['candidates_count']         

In [85]:
df_agg.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,candidates_count,admitted_sum,admitted_mean,acceptance_rate
major,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chemistry,female,226,51.0,0.225664,0.225664
Chemistry,male,18,2.0,0.111111,0.111111
Physics,female,31,23.0,0.741935,0.741935
Physics,male,225,116.0,0.515556,0.515556


In [69]:
df_agg_1 = df.groupby(['gender']).agg({'gender':'count',
                                             'admitted':'sum'}).rename(columns={'gender':'candidates'})     
df_agg_1['acceptance_rate'] = df_agg_1['admitted'] / df_agg_1['candidates'] 


In [92]:
df_agg_1.head()

Unnamed: 0_level_0,candidates,admitted,acceptance_rate
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,257,74.0,0.287938
male,243,118.0,0.485597


## Probability

![proba](images/probability_statistics.PNG)

We can get two generic rules from this:

* The probability of any event must be between 0 and 1, inclusive.

* The probability of the complement event is 1 minus the probability of an event. That is the probability of all other possible events is 1 minus the probability an event itself. Therefore, the sum of all possible events is equal to 1.

* If our events are independent, then the probability of the string of possible events is the product of those events. That is the probability of one event AND the next AND the next event, is the product of those events.

## Binomial Distribution

The **Binomial Distribution** helps us determine the probability of a string of independent 'coin flip like events'.<br>

The [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) associated with the binomial distribution is of the following form:

![binomial_formula](images/binomial_formula.PNG)

where **n** is the number of events, **x** is the number of "successes", and **p** is the probability of "success".<br>

We can now use this distribution to determine the probability of things like:

* The probability of 3 heads occurring in 10 flips.
* The probability of observing 8 or more heads occurring in 10 flips.
* The probability of not observing any heads in 20 flips.

In [93]:
import pandas as pd
import numpy as np