# Descriptive Statistics

- **Data**: Distinct pieces of information

## Data Types

### Quantitative vs. Categorical
- **Quantitative** data takes on numeric values that allow us to perform mathematical operations (like the number of dogs).
- **Categorical** are used to label a group or set of items (like dog breeds - Collies, Labs, Poodles, etc.).


### Categorical Ordinal vs. Categorical Nominal
- We can divide categorical data further into two types: **Ordinal** and **Nominal**.
- **Categorical Ordinal** data take on a ranked ordering (like a ranked interaction on a scale from `Very Poor` to `Very Good` with the dogs).
- **Categorical Nominal** data do not have an order or ranking (like the breeds of the dogs).

### Continuous vs. Discrete
- We can think of quantitative data as being either **continuous** or **discrete**.
- **Continuous** data can be split into smaller and smaller units, and still a smaller unit exists. An example of this is the age of the dog - we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with the age.
- **Discrete** data only takes on countable values. The number of dogs we interact with is an example of a discrete data type.

| **Data Types** |  |  |
| :-- | :-- | :-- |
| **Quantitative:** | **Continuous** | **Discrete** |
|  | Height, Age, Income | Pages in a Book, Trees in a Yard, Dogs at a Coffee Shop |
| **Categorical:** | **Ordinal** | **Nominal** |
|  | Letter Grade, Survey Rating | Gender, Marital Status, Breakfast Items |

## Summary Statistics

### **Four Aspects for Quantitative Data**

There are four main aspects to analyzing **Quantitative Data**:
1. Measures of `Center`
2. Measures of `Spread`
3. The `Shape` of data
4. `Outliers`

**Analyzing Categorical Data**

Analyzing categorical data has fewer parts to consider. **Categorical** data is analyzed usually by looking at the counts or proportion of individuals that fall into each group. For example, if we were looking at the breeds of the dogs, we would care about how many dogs are of each breed, or what proportion of dogs are of each breed type.

### Measures of Center

There are three measures of center:
1. **Mean**
2. **Median**
3. **Mode**

---
**The Mean**

The mean is often called the average or the **expected value** in mathematics. We calculate the mean by adding all of our values together and dividing by the number of values in our dataset.

In [1]:
# Mean
sum_num = 5 + 8 + 15 + 7 + 10 + 22 + 3 + 1 + 15
num_vals = 9
mean = sum_num / num_vals
print(mean)

9.555555555555555


---
**The Median**

The **median** splits out data so that 50% of our values are lower and 50% of our values are higher. We found in this video that how we calculate the median depends on if we have an even number of observations or an odd number of observations.

**Median for Odd Values**

If we have an **odd** number of observations, the **median** is simply the number in the **direct** middle. For example, if we have 7 observations, the median is the fourth value when our numbers are ordered from smallest to largest. If we have 9 observations, the median is the fifth value.

**Median for Even Values**

If we have an **even** number of observations, the **median** is the **average of the two values in the middle**. For example, if we have 8 observations, we average the fourth and fifth values together when our numbers are ordered from smallest to largest.

In order to compute the median, we _must_ sort our values first.

Whether we use the mean or median to describe a dataset is largely dependent on the **shape** of our dataset and if there are any **outlierss**.

In [2]:
# Median
nums = [1, 3, 5, 7, 8, 10, 15, 15, 22]
median = 8

nums2 = [1, 2, 3, 5, 7, 8, 10, 15, 15, 22]
median2 = 7.5

---
**The Mode**

The **mode** is the most frequently observed value in our dataset.

There might be multiple modes for a particular dataset, or no mode at all.

**No Mode**

If all observations in our dataset are observed with the same frequency, there is no mode. If we have the dataset `1, 1, 2, 2, 3, 3, 4, 4` there is no mode because all observations occur the same number of times.

**Many Modes**

If two (or more) numbers share the maximum value, then there is more than one mode. If we have the dataset `1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9` then the two modes are 3 and 6 because these value share the maximum frequencies at 3 times, which all the other values only appear once.

In [3]:
list_of_num = [3, 4, 4, 4, 4, 5, 8, 10, 12, 12, 20, 32]
sums = 3 + 4 + 4 + 4 + 4 + 5 + 8 + 10 + 12 + 12 + 20 + 32
avg = sums / 12
avg

9.833333333333334

In [4]:
list_of_num = [1, 3, 5, 7, 8, 10, 10, 15, 15, 22]
sums = 1 + 3 + 5 + 7 + 8 + 10 + 10 + 15 + 15 + 22
avg = sums / 10
avg

9.6

## Notation

Notation is a common language used to communicat mathematical ideas. **Think of notation as a universal language used by academic and industry professionals to convey mathematical ideas**. In the next videos, you might see things that seem confusing. Use the quizzes to assist with your understanding of the concepts.

You likely already know some notation. Plus, minus, multiply, division, and equal signs all have mathematical symbols that you are likely familiar with. Each of these symbols replaces an idea for how numbers interact with one another. In the coming concepts, you will be introduced to some additional ideas related to notation. Those you will not need to use notation to complete the project, it does have the following properties:

1. **Understanding how to correctly use notation makes you seem really smart**. Knowing how to read and write in notation is like learning a new language. A language that is used to convey ideas associated with mathematics.

2. **It allows you to read documentation, and implement an idea to your own problem**. Notation is used to convey how problems are solved all the time. One really popular mathematical algorithm that is used to solve some of the world's most difficult problems is known as Gradient Boosting.

3. **It makes ideas that are hard to say in words easier to convey**. Sometimes we just don't have the right words to say. For those situations, I prefer to use notation to convey the message. Similar to the way an emoji or meme might convey a feeling better than words, notation can convey an idea better than words. Usually those ideas are related to mathematics but can be applied elsewhere.

**Before Collecting Data**

Before collecting data, we usually start with a question, or many questions, that we would like to answer. The purpose of data is to help us in answering these questions.

**Random Variables**

A **random variable** is a placehold for the possible values of some process (mostly... the term 'some process' is a bit ambiguous). As we stated before, notation is useful in that it helps us take complex ideas and simplfy (often to a single letter or single symbol). We see random variables represented by capital letters (**X**, **Y**, or **Z** are common ways to represent a random variable).

We might have the random variable **X**, which is a holder for the possible values of the amount of time someone spends on our site. Or the random variable **Y**, which is a holder for the possible values of whether or not an individual purchases a product.

**X** is 'a holder' of the values that could possibly occur for the amount of time spent on our website. Any number from 0 to infinity really.

### Aggregations

An **aggregation** is a way to turn multiple numbers into fewer numbers (commonly one number).

**Summation** is a common aggregation. The notatio used to sum our values is a Greek symbol called sigma.

### Measures of Spread

**Measures of Spread** are used to provide us an idea of how spread out our data are from one another. Common measures of spread include:

1. **Range**
2. **Interquartile Range (IQR)**
3. **Standard Deviation**
4. **Variance**

**Histograms**

Histograms are the most common visual for quantitative data. Histograms are super useful to understanding the different aspects of quantitative data. 

**Calculating the 5 Number Summary**

The five number summary consists of 5 values:

1. **Minimum**: The smallest number in the dataset.
2. **Q1**: The value such that 25% of the data fall below.
3. **Q2**: The value such that 50% of the data fall below.
4. **Q3**: The value such that 75% of the data fall below.
5. **Maximum**: The largest value in the dataset.

Calculating each of these values is essentially just finding the medians of a bunch of different data in the dataset. The calculation of medians will continue to depend on whether we have an odd or even number of values.

**Range**

The **range** is then calculated as the difference between the **maximum** and the **minimum**.

**IQR**

The **interquartile range** is calculated as the difference between **Q3** and **Q1**.

In [5]:
dataset1 = [1, 1, 2, 3, 4, 5, 8, 8, 10, 12]
dataset2 = [1, 2, 3, 4, 5, 8, 8, 10, 12]

**Box Plot**

The **box plot** is useful for quickly comparing the spread of datasets.

**Standard Deviation and Variance**

The **standard deviation** is one of the most common measures for talking about the spread of data. It is defined as **the average distance of each observation from the mean**.

In [6]:
dataset1 = [10, 14, 10, 6]

mean = sum / values = 10

Standard Deviation:

- `10 - 10 = 0`
- `14 - 10 = 4`
- `10 - 10 = 0`
- `6 - 10 = -4`

### The Shape of Data


### Outliers