# Descriptive Statistics

## Learning Objectives

- Code summary statistics using NumPy and Pandas: mean, median, mode, max, min, quartile, inter-quartile range, variance, standard deviation, and correlation.
- Create basic data visualizations, including scatter plots, box plots, and histograms.
- Display characteristics and trends in a dataset using visualizations.
- Describe the bias and variance of statistical models.
- Identify a normal distribution within a dataset using summary statistics and data visualizations.

In [1]:
import pandas as pd
import seaborn as sns # visualization library
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact
plt.style.use('fivethirtyeight')

# This makes sure that graphs render in your notebook
%matplotlib inline

**Exercise.** Examine the Titanic Data Set

- Use pandas to read in the Titanic data set from ./data/titanic.csv

In [2]:
titanic_df = pd.read_csv('./data/titanic.csv')

- Print out the column names.

In [3]:
titanic_df.columns

Index(['survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked'],
      dtype='object')

- Print out the dimensions of the DataFrame using the `.shape` attribute.

In [4]:
titanic_df.shape

(891, 11)

- Print out the data types of the columns using the `.dtypes` attribute.

In [5]:
titanic_df.dtypes

survived      int64
pclass        int64
name         object
sex          object
age         float64
sibsp         int64
parch         int64
ticket       object
fare        float64
cabin        object
embarked     object
dtype: object

- Print out the first five rows of the data.

In [6]:
titanic_df.head(2)

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


- Count the number of occurrences of each value in the `pclass` column.

In [11]:
titanic_df.loc[:, 'pclass'].value_counts()

3    491
1    216
2    184
Name: pclass, dtype: int64

- Pull up descriptive statistics for each variable, including categorical variables.

In [13]:
titanic_df.describe(include='all')

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
count,891.0,891.0,891,891,714.0,891.0,891.0,891,891.0,204,889
unique,,,891,2,,,,681,,147,3
top,,,"Silvey, Mrs. William Baird (Alice Munger)",male,,,,CA. 2343,,B96 B98,S
freq,,,1,577,,,,7,,4,644
mean,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


- Which variables have missing values? (Don't confuse missing summary statistics with missing values of the variable!)

In [16]:
titanic_df.isnull().sum()

survived      0
pclass        0
name          0
sex           0
age         177
sibsp         0
parch         0
ticket        0
fare          0
cabin       687
embarked      2
dtype: int64

<a id="descriptive-statistics-fundamentals"></a>
## Descriptive Statistics Fundamentals

### Summation Notation and Python Code

#### Summing a constant

Suppose we wanted to sum the number 4 seven times.

**Notation:**

$$\sum_{i=1}^{7}{4}$$

**Python code:**

In [19]:
# Using base Python `sum`
sum([4, 4, 4, 4, 4, 4, 4])
sum([4] * 7)

28

In [20]:
# Using numpy array .sum() method
np.array([4] * 7).sum()

28

In [None]:
# Building the list using multiplication


**Exercise:** Use a for-loop to build a list of seven 4s.

In [25]:
total = 0
for _ in range(7):
    total += 4
total

28

---

In [27]:
# More compact approach: a list comprehension
[4 for _ in range(7)]
[num for num in range(7)]

[0, 1, 2, 3, 4, 5, 6]

In [28]:
# [num for num in range(7)] similar to
my_list = []
for num in range(7):
    my_list.append(num)

my_list

[0, 1, 2, 3, 4, 5, 6]

General notation for the sum of a constant $k$, $n$ times:
$$\sum_{i=1}^nk$$

#### Summing consecutive integers

The sum of all numbers from 1 up to and including $10$:

$$\sum_{i=1}^{10}i$$

**Exercise.**

- Use a list comprehension to create a list of all numbers from one to ten.

In [30]:
[num for num in range(1,11)]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

- Turn this list into a `numpy` array and use that array's `.sum` method to calculate its sum. (You should get `55`.)

In [34]:
np.array([num for num in range(1,11)]).sum()

55

#### Summing elements of an arbitrary sequence

The sum of all $x$ from the first $x$ entry to the $n$th $x$ entry:
$$\sum_{i=0}^nx_i$$

In [36]:
# Use the "fare" column of the `titanic` dataframe to compute the total fare paid by passengers
titanic_df.loc[:,'fare'].sum()

28693.9493

**Exercise.** Calculate the sum of the ages of the Titanic passengers.

In [42]:
titanic_df.loc[:,'age'].sum()

21205.17

#### Takeaways

- Mathematicians use $\Sigma$ ("capital sigma") to denote a sum.
- A list comprehension is a compact way to create a list using a `for` loop.
- NumPy arrays and Pandas series have `.sum` methods.

<a id="measures-of-central-tendency"></a>
### Measures of Central Tendency

- Mean
- Median
- Mode

#### Mean
The **mean** (average) of a set of numbers is their sum divided by their count:
$$\bar{X} =\frac 1n\sum_{i=1}^nx_i$$

#### Median
The median refers to the midpoint in a series of numbers:

$$ 0,1,2,[3],5,5,1004 $$

$$ 1,3,4,[4,5],5,5,7 $$

To find the median:

- Arrange the numbers in order from smallest to largest.
    - If there is an odd number of values, the middle value is the median.
    - If there is an even number of values, the average of the middle two values is the median.

#### Mode
The mode of a set of values is the value that occurs most often.
A set of values may have more than one mode, or no mode at all.

$$1,0,1,5,7,8,9,3,4,1$$ 

$1$ is the mode, as it occurs the most often (three times).

**Exercise.**

- Calculate the mean, median, and mode for each of the following sets of numbers "by hand." (You may use Python's `+` and `/` operators, but no higher-level functions/methods.)

75, 87, 49, 68, 75, 84, 94

In [56]:
my_list = [75, 87, 49, 68, 75, 84, 94]
my_list.sort()
sum = 0
total = 0
mode_value = my_list[3]
num_of_elements = len(my_list)
for num in my_list:
    total = total + 1
    sum += num
    
mean_value = sum / total
print (mean_value, mode_value)
# need to complete mode

76.0 75


- Calculate the mean, median, and mode for each of the following sets of numbers "by hand."

1, 1, 2, 3, 4, 5, 6, 7, 8, 1000

- For previous list of numbers, which "measure of central tendency" best captures the intuitive idea of a "typical" value in the set? Why do the others return "atypical" values?

median as see 1000 is skewing the mean, which is not accurate representation

- If you are worried about outliers in a data set throwing off your results for the typical case, would you use methods based on the mean or the median?

#### Measures of Central Tendency in Pandas

In [None]:
# Use pandas to find the mean of the titanic.fare series


In [None]:
# Use pandas to find the median of the titanic.fare series


In [None]:
# Use pandas to find the mode of the titanic.fare series


**Exercise.** The mean, median, and more are quite different in this case. What does that tell you about fares on the Titanic?

<a id="measures-of-dispersion-standard-deviation-and-variance"></a>
### Measures of Dispersion: Standard Deviation and Variance

![](./assets/images/variance_magnitude.png)

#### Definitions

- The **sample variance** of a data set is the average squared distance of each point from the mean, except that we divide by $n-1$ instead of $n$: $\frac{1}{n-1}\sum (X - \bar{X})^2$. (We divide by $n-1$ to make sample variance an "unbiased estimator" of true variance.)
- The **sample standard deviation** is the square root of the sample variance: $\sqrt{\frac{1}{n-1}\sum (X - \bar{X})^2}$
- The **sample standard error** of an estimator is an estimate of its precision.

#### Notes

- You can think of sample variance/standard deviation as an estimate of the true variance/standard deviation of your variable, and of sample standard error as an estimate of the true standard error of your estimator.
- Standard deviation is more interpretable than variance because it is on the same scale as the original data. (The square root "undoes" the squaring.)

**Exercise.** As sample size increases, estimators gets more precise, but the underlying distribution does not change. So what happens to the following quantities?

- variance (i.e., the true variance of the population)

- sample variance

- standard deviation

- sample standard deviation

- standard error

#### Calculating Sample Variance and Standard Deviation

In [None]:
# Take the first 5 rows of Titanic age data
first_five = titanic.loc[:4, 'age']

first_five

😱

**Exercise.**

- Use a built-in Pandas method to get the first five rows of the Titanic data. Then select the age column.

- **Challenge:** Why doesn't `.loc.` use the usual "up-to-but-not-including" Python convention for ranges?

---

In [None]:
# Calculate the mean "by hand"


In [None]:
# Calcualte the variance "by hand"


In [None]:
# Calculate the standard deviation "by hand"


In [None]:
# Calculate the mean, variance, and standard deviation using Pandas


<a id="our-first-model"></a>
## Our First Model
---

A statistical model is like a model car: it doesn't capture everything about its target, but it tries to capture what we care about for some purpose.

E.g. a model to predict lung cancer might use age and smoking history.

As an extremely simple model that we can use as a baseline, we can just use the mean of a variable to predict its value in a particular case.

In [None]:
# Get the `fare` column from the Titanic data and store it in variable `y`


In [None]:
# Create predictions y_pred (in this case just the mean of `y`)


In [None]:
# Find the average squared distance between each prediction and its actual value


This quantity is called "mean squared error" (MSE).

**Exercise.** The mean squared error is essentially equal to what descriptive statistic when our prediction is the mean? (Why only "essentially" equal?)

---

In [None]:
# Calculate the root mean squared error (RMSE), the square root of the MSE


<a id="a-short-introduction-to-model-bias-and-variance"></a>
## A Short Introduction to Model Bias and Variance 

- Model **bias** is systematic error that won't go away no matter how much data you collect.
- Model **variance** is random error that will fade away as you collect more data (and refit the model).
- What you should do next to improve your model depends on whether it has more of a bias problem or more of a variance model.

<a id="correlation-and-association"></a>
## Correlation and Association

The magnitude of the **(Pearson) correlation coefficient** for a pair of variables measures how well you could predict one as a linear function of the other; its sign indicates whether the tend to move in the same direction or in opposite directions.

![Example correlation values](./assets/images/correlation_examples.png)

**Exercise.** Based on the characterization of the correlation coefficient given above, why is its value 0 in the third row even though the two variables are clearly related?

<a id="codealong-correlation-in-pandas"></a>
## Correlation in Pandas

In [None]:
# Display the correlation matrix for all Titanic variables


In [None]:
# Use Seaborn to plot a heat map of the correlation matrix


In [None]:
# Take a closer look at survived and fare using a scatter plot


<a id="the-normal-distribution"></a>
## The Normal Distribution

### What is the Normal Distribution?

![](./assets/images/normal_distribution.png)

Symmetric "Bell curve"

![](./assets/images/normal_std.png)

#### Why do we care?

- Aggregated processes are distributed normally, regardless of their underlying distributions, as long as they aren't too correlated (central limit theorem). As a result, they show up a lot.
- They have some nice mathematical properties.
- Many models (e.g. linear regression) are designed for normally distributed data.

#### Plot a histogram of 1,000 samples from a random normal distribution:

<a id="skewness"></a>
###  Skewness
- Skewness is a measure of the asymmetry of the distribution of a random variable about its mean.
- Skewness can be positive or negative, or even undefined.
- Notice that the mean, median, and mode are the same when there is no skew.

![skewness](./assets/images/skewness---mean-median-mode.jpg)

#### Plot a lognormal distribution generated with NumPy.

In [None]:
# Plot a lognormal distribution generated with NumPy


Many algorithms rely on the mean being "typical." With skewed data, you may have to transform your data or set your algorithms to work with the median.

<a id="determining-the-distribution-of-your-data"></a>
## Determining the Distribution of Your Data

In [None]:
# Plot all variables in titanic using histograms


In [None]:
# Create boxplots of the variables.


The box encloses the middle 50% of the values. The whiskers extend out to the min and max, except that points more than 1.5 times as far from the median as the size of the box are considered "outliers" and plotted as individual points.

In [None]:
# Use showfliers=False to omit "outliers"


<a id="exercise"></a>
**Exercise.**

Which Titanic variables are normal? Which are skewed? How does these characteristics affect our data modeling?

<a id="topic-review"></a>
## Lesson Review
---

1. We covered several different types of summary statistics, what are they?
- We covered three different types of visualizations, which ones?
- Describe bias and variance and why they are important.
- What are some important characteristics of distributions?



# Projects

Due Mon. 5/14:

- [Unit 2 Project](https://git.generalassemb.ly/chi-ds-8/unit2_project)
- [Final Project Pt 1](https://git.generalassemb.ly/chi-ds-8/final_project/blob/master/requirements.md#proposal)

See [Final Project Tips](https://git.generalassemb.ly/chi-ds-8/final_project#tips)

# Questions?

# [Exit Tickets](https://docs.google.com/forms/d/e/1FAIpQLSfvZxGW3CHmZzIhNqz-5uzMeDTQHHDOH8G2ovTPbT5U6Q8KLA/viewform?usp=send_form)