# Practical Statistics

## Data Definition
> Distinct pieces of information

## Data Types
Quantitative and Categorical.
* **Quantitative** data takes on numeric values that allow us to perform mathematical operations (like the number of dogs).
* **Categorical** are used to label a group or set of items (like dog breeds - Collies, Labs, Poodles, etc.).

### Categorical Ordinal vs. Categorical Nominal
We can divide categorical data further into two types: **Ordinal** and **Nominal**.
* **Categorical Ordinal** data take on a ranked ordering (like a ranked interaction on a scale from `Very Poor` to `Very Good` with the dogs).
* **Categorical Nominal** data do not have an order or ranking (like the breeds of the dog).

### Quantitative Continuous vs. Discrete
We can think of quantitative data as being either continuous or discrete.

* **Continuous** data can be split into smaller and smaller units, and still a smaller unit exists. An example of this is the age of the dog - we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with the age.

* **Discrete** data only takes on countable values. The number of dogs we interact with is an example of a discrete data type.

![data_types](images/data_types.PNG)

### Another Look
To break down our data types, there are two main blocks:

### Quantitative and Categorical

**Quantitative** can be further divided into `Continuous` or `Discrete`.<br>

**Categorical** data can be divided into `Ordinal` or `Nominal`.

#### Quantitative vs. Categorical
Some of these can be a bit tricky - notice even though zip codes are a number, they aren’t really a quantitative variable. If we add two zip codes together, we do not obtain any useful information from this new value. Therefore, this is a categorical variable.<br>

**Height**, **Age**, the **Number of Pages** in a Book and **Annual Income** all take on values that we can add, subtract and perform other operations with to gain useful insight. Hence, these are `quantitative`.<br>

**Gender**, **Letter Grade**, **Breakfast Type**, **Marital Status**, and **Zip Code** can be thought of as labels for a group of items or individuals. Hence, these are `categorical`.

#### Continuous vs. Discrete
To consider if we have continuous or discrete data, we should see if we can split our data into smaller and smaller units. Consider time - we could measure an event in years, months, days, hours, minutes, or seconds, and even at seconds we know there are smaller units we could measure time in. Therefore, we know this data type is continuous. **Height**, **age**, and **income** are all examples of `continuous data`. Alternatively, the **number of pages** in a book, **dogs I count** outside a coffee shop, or **trees in a yard** are `discrete data`. We would not want to split our dogs in half.

#### Ordinal vs. Nominal
In looking at categorical variables, we found **Gender**, **Marital Status**, **Zip Code** and your **Breakfast items** are `nominal variables` where there is no order ranking associated with this type of data. Whether you ate cereal, toast, eggs, or only coffee for breakfast; there is no rank ordering associated with your breakfast.<br>

Alternatively, the **Letter Grade** or **Survey Ratings** have a rank ordering associated with it, as `ordinal data`. If you receive an A, this is higher than an A-. An A- is ranked higher than a B+, and so on... Ordinal variables frequently occur on rating scales from very poor to very good. In many cases we turn these ordinal variables into numbers, as we can more easily analyze them, but more on this later!

## Analyzing Quantitative Data

There are four main aspects to analyzing **Quantitative** data.

1. Measures of `Center`
2. Measures of `Spread`
3. The `Shape` of the data.
4. `Outliers`

> Analyzing **categorical data** has fewer parts to consider. Categorical data is analyzed usually be 
> 1. looking at the counts 
> 2. or proportion of individuals that fall into each group. 
> 
>For example if we were looking at the breeds of the dogs, we would care about how many dogs are of each breed, or what proportion of dogs are of each breed type.

### Measures of Center
Gives us an idea of an average element. There are three measures of center:
1. `Mean`
2. `Median`
3. `Mode`

#### The Mean
The mean is often called the **average** or the **expected value** in mathematics. 
> We calculate the mean by adding all of our values together, and dividing by the number of values in our dataset.

#### The Median
The `median` splits our data so that **50% of our values are lower and 50% are higher**. We found in this video that how we calculate the median depends on if we have an even number of observations or an odd number of observations.<br>

In order to compute the median we MUST sort our values first.

> **For odd values**: If we have an odd number of observations, the median is simply the number in the direct middle.

> **For even values**: If we have an even number of observations, the median is the average of the two values in the middle.

Whether we use the `mean` or `median` to describe a dataset is largely dependent on the **shape** of our dataset and if there are any **outliers**.

#### The Mode
The mode is the most frequently observed value in our dataset.<br>

> There might be multiple modes for a particular dataset, or no mode at all.

* **No Mode**: If all observations in our dataset are observed with the same frequency, there is no mode. If we have the dataset:
```
1, 1, 2, 2, 3, 3, 4, 4
```
There is no mode, because all observations occur the same number of times.

* **Many Modes**: If two (or more) numbers share the maximum value, then there is more than one mode. If we have the dataset:
```
1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9
```
There are two modes 3 and 6, because these values share the maximum frequencies at 3 times, while all other values only appear once.

### Notation Capital vs. Lower case letters for variables
**Random variables** are represented by `capital letters`. Once we observe an outcome of these random variables, we notate it as a lower case of the same letter. Like follows:

`X` (capital) is the whole set of the amount of time individuals spend on the website, like: 
```
[x1, x2, x3, x4, x5, ... , xn]
```
and `x` (lower) represents a single observation.<br>

As a quick recap, **capital letters** signify **random variables**. When we look at **individual instances** of a particular random variable, we identify these as **lowercase letters** with subscripts attach themselves to each specific observation.

![notation](images/notation.PNG)

### Measures of Spread
Gives us an idea of how the elements differ

