## **Statistical Thinking in Python (Part 1)**

**Course Description**

After all of the hard work of acquiring data and getting them into a form you can work with, you ultimately want to make clear, succinct conclusions from them. This crucial last step of a data analysis pipeline hinges on the principles of statistical inference. In this course, you will start building the foundation you need to think statistically, to speak the language of your data, to understand what they are telling you. The foundations of statistical thinking took decades upon decades to build, but they can be grasped much faster today with the help of computers. With the power of Python-based tools, you will rapidly get up to speed and begin thinking statistically by the end of this course.

**Imports**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint as pp
import csv
from pathlib import Path

from sklearn.datasets import load_iris

**Pandas Configuration Options**

In [None]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

**Data Files Location**

* Most data files for the exercises can be found on the [course site](https://www.datacamp.com/courses/statistical-thinking-in-python-part-1)
    * [2008 election results (all states)](https://assets.datacamp.com/production/repositories/469/datasets/8fb59b9a99957c3b9b1c82b623aea54d8ccbcd9f/2008_all_states.csv)
    * [2008 election results (swing states)](https://assets.datacamp.com/production/repositories/469/datasets/e079fddb581197780e1a7b7af2aeeff7242535f0/2008_swing_states.csv)
    * [Belmont Stakes](https://assets.datacamp.com/production/repositories/469/datasets/7507bfed990379f246b4f166ea8a57ecf31c6c9d/belmont.csv)
    * [Speed of light](https://assets.datacamp.com/production/repositories/469/datasets/df23780d215774ff90be0ea93e53f4fb5ebbade8/michelson_speed_of_light.csv)

**Data File Objects**

In [None]:
data = Path.cwd() / 'data' / 'statistical_thinking_1'
elections_all_file = data / '2008_all_states.csv'
elections_swing_file = data / '2008_swing_states.csv'
belmont_file = data / 'belmont.csv'
sol_file = data / 'michelson_speed_of_light.csv'

**Iris Data Set**

In [None]:
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] + ['target'])

def iris_typing(x):
    types = {0.0: 'setosa',
             1.0: 'versicolour',
             2.0: 'virginica'}
    return types[x]

iris_df['species'] = iris_df.target.apply(iris_typing)
iris_df.head()

# Graphical exploratory data analysis

Look before you leap! A very important proverb, indeed. Prior to diving in headlong into sophisticated statistical inference techniques, you should first explore your data by plotting them and computing simple summary statistics. This process, called exploratory data analysis, is a crucial first step in statistical analysis of data. So it is a fitting subject for the first chapter of Statistical Thinking in Python.

## Introduction to exploratory data analysis

* Exploring the data is a crucial step of the analysis.
    * Organizing
    * Plotting
    * Computing numerical summaries
* This idea is known as exploratory data analysis (EDA)
* "Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone." - [John Tukey](https://en.wikipedia.org/wiki/John_Tukey)

In [None]:
swing = pd.read_csv(elections_swing_file)
swing.head()

* The raw data isn't particularly informative
* We could start computing parameters and their confidence intervals and do hypothesis test...
* ...however, we should graphically explore the data first

### Tukey's comments on EDA

Even though you probably have not read Tukey's book, I suspect you already have a good idea about his viewpoint from the video introducing you to exploratory data analysis. Which of the following quotes is not directly from Tukey?

1. Exploratory data analysis is detective work.
1. There is no excuse for failing to plot and look.
1. The greatest value of a picture is that it forces us to notice what we never expected to see.
1. It is important to understand what you can do before you learn how to measure how well you seem to have done it.
1. ~~**Often times EDA is too time consuming, so it is better to jump right in and do your hypothesis tests.**~~

### Advantages of graphical EDA

Which of the following is not true of graphical EDA?

1. It often involves converting tabular data into graphical form.
1. If done well, graphical representations can allow for more rapid interpretation of data.
1. ~~**A nice looking plot is always the end goal of a statistical analysis.**~~
1. There is no excuse for neglecting to do graphical EDA.

## Plotting a histogram

* always label the axes

In [None]:
bin_edges = [x for x in range(0, 110, 10)]
plt.hist(x=swing.dem_share, bins=bin_edges, edgecolor='black')
plt.xticks(bin_edges)
plt.yticks(bin_edges[:-1])
plt.xlabel('Percent of vote for Obama')
plt.ylabel('Number of Counties')
plt.show()

**Seaborn**

In [None]:
import seaborn as sns
sns.set()

In [None]:
plt.hist(x=swing.dem_share)
plt.xlabel('Percent of vote for Obama')
plt.ylabel('Number of Counties')
plt.show()

### Plotting a histogram of iris data

For the exercises in this section, you will use a classic data set collected by botanist Edward Anderson and made famous by Ronald Fisher, one of the most prolific statisticians in history. Anderson carefully measured the anatomical properties of samples of three different species of iris, *Iris setosa*, *Iris versicolor*, and *Iris virginica*. The full data set is [available as part of scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html). Here, you will work with his measurements of petal length.

Plot a histogram of the petal lengths of his 50 samples of Iris versicolor using matplotlib/seaborn's default settings. Recall that to specify the default seaborn style, you can use `sns.set()`, where `sns` is the alias that `seaborn` is imported as.

The subset of the data set containing the Iris versicolor petal lengths in units of centimeters (cm) is stored in the NumPy array `versicolor_petal_length`.

In the video, Justin plotted the histograms by using the `pandas` library and indexing the DataFrame to extract the desired column. Here, however, you only need to use the provided NumPy array. Also, Justin assigned his plotting statements (except for `plt.show()`) to the dummy variable `_`. This is to prevent unnecessary output from being displayed. It is not required for your solutions to these exercises, however it is good practice to use it. Alternatively, if you are working in an interactive environment such as a Jupyter notebook, you could use a `;` after your plotting statements to achieve the same effect. Justin prefers using `_`. Therefore, you will see it used in the solution code.

**Instructions**

* Import `matplotlib.pyplot` and `seaborn` as their usual aliases (`plt` and `sns`).
* Use `seaborn` to set the plotting defaults.
* Plot a histogram of the Iris versicolor petal lengths using `plt.hist()` and the provided NumPy array `versicolor_petal_length`.
* Show the histogram using `plt.show()`.

In [None]:
versicolor_petal_length = iris_df['petal length (cm)'][iris_df.species == 'versicolour']

In [None]:
plt.hist(versicolor_petal_length)
plt.show()

### Axis labels!

In the last exercise, you made a nice histogram of petal lengths of Iris versicolor, but **you didn't label the axes!** That's ok; it's not your fault since we didn't ask you to. Now, add axis labels to the plot using `plt.xlabel()` and `plt.ylabel()`. Don't forget to add units and assign both statements to `_`. The packages `matplotlib.pyplot` and `seaborn` are already imported with their standard aliases. This will be the case in what follows, unless specified otherwise.

**Instructions**

* Label the axes. Don't forget that you should always include units in your axis labels. Your y-axis label is just `'count'`. Your x-axis label is `'petal length (cm)'`. The units are essential!
* Display the plot constructed in the above steps using `plt.show()`.

In [None]:
plt.hist(versicolor_petal_length)
plt.xlabel('petal length (cm)')
plt.ylabel('count')
plt.show()

### Adjusting the number of bins in a histogram

The histogram you just made had ten bins. This is the default of matplotlib. The "square root rule" is a commonly-used rule of thumb for choosing number of bins: choose the number of bins to be the square root of the number of samples. Plot the histogram of Iris versicolor petal lengths again, this time using the square root rule for the number of bins. You specify the number of bins using the `bins` keyword argument of `plt.hist()`.

The plotting utilities are already imported and the seaborn defaults already set. The variable you defined in the last exercise, `versicolor_petal_length`, is already in your namespace.

**Instructions**

* Import `numpy` as `np`. This gives access to the square root function, `np.sqrt()`.
* Determine how many data points you have using `len()`.
* Compute the number of bins using the square root rule.
* Convert the number of bins to an integer using the built in `int()` function.
* Generate the histogram and make sure to use the `bins` keyword argument.
* Hit 'Submit Answer' to plot the figure and see the fruit of your labors!

In [None]:
# Compute number of data points: n_data
n_data = len(versicolor_petal_length)

# Number of bins is the square root of number of data points: n_bins
n_bins = np.sqrt(n_data)

# Convert number of bins to integer: n_bins
n_bins = int(n_bins)

# Plot the histogram
_ = plt.hist(versicolor_petal_length, bins=n_bins)

# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')

# Show histogram
plt.show()

## Plotting all of your data: Bee swarm plots

* Binning Bias: The same data may be interpreted differently depending on choice of bins
* Additionally, all of the data isn't being plotted; the precision of the actual data is lost in the bins
* These issues can be resolved with swarm plots
    * Point position along the y-axis is the quantitative information
    * The data are spread in x to make them visible, but their precise location along the x-axis is unimportant
    * No binning bias and all the data are displayed.
    * Seaborn & Pandas

In [None]:
sns.swarmplot(x='state', y='dem_share', data=swing)
plt.xlabel('state')
plt.ylabel('percent of vote for Obama')
plt.title('% of Vote per Swing State County')
plt.show()

### Bee swarm plot

Make a bee swarm plot of the iris petal lengths. Your x-axis should contain each of the three species, and the y-axis the petal lengths. A data frame containing the data is in your namespace as `df`.

For your reference, the code Justin used to create the bee swarm plot in the video is provided below:

```python
_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()
```

In the IPython Shell, you can use `sns.swarmplot`? or `help(sns.swarmplot)` for more details on how to make bee swarm plots using seaborn.

**Instructions**

* In the IPython Shell, inspect the DataFrame `df` using `df.head()`. This will let you identify which column names you need to pass as the `x` and `y` keyword arguments in your call to `sns.swarmplot()`.
* Use `sns.swarmplot()` to make a bee swarm plot from the DataFrame containing the Fisher iris data set, `df`. The x-axis should contain each of the three species, and the y-axis should contain the petal lengths.
* Label the axes.
* Show your plot.

In [None]:
sns.swarmplot(x='species', y='petal length (cm)', data=iris_df)
plt.xlabel('species')
plt.ylabel('petal length (cm)')
plt.show()

### Interpreting a bee swarm plot

Which of the following conclusions could you draw from the bee swarm plot of iris petal lengths you generated in the previous exercise? For your convenience, the bee swarm plot is regenerated and shown to the right.

**Instructions**

Possible Answers
1. All I. versicolor petals are shorter than I. virginica petals.
1. I. setosa petals have a broader range of lengths than the other two species.
1. __**I. virginica petals tend to be the longest, and I. setosa petals tend to be the shortest of the three species.**__
1. I. versicolor is a hybrid of I. virginica and I. setosa.

## Plotting all of your data: Empirical cumulative distribution functions

* x-value of an ECDF is the quantity being measured
* y-value is the fraction of data points that have a value smaller than the corresponding x-value
* Shows all the data and gives a complete picture of how the data are distributed

In [None]:
x = np.sort(swing['dem_share'])
y = np.arange(1, len(x)+1) / len(x)

plt.plot(x, y, marker='.', linestyle='none')
plt.xlabel('percent of vote for Obama')
plt.ylabel('ECDF')
plt.margins(0.02)  # keep data off plot edges

plt.show()

#### add annotations

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.margins(0.05)           # Default margin is 0.05, value 0 means fit

x = np.sort(swing['dem_share'])
y = np.arange(1, len(x)+1) / len(x)

ax.plot(x, y, marker='.', linestyle='none')
plt.xlabel('percent of vote for Obama')
plt.ylabel('ECDF')

ax.annotate('20% of counties had <= 36% vote for Obama', xy=(36, .2),
            xytext=(40, 0.1), fontsize=10, arrowprops=dict(arrowstyle="->", color='b'))

ax.annotate('75% of counties had < 0.5 vote for Obama', xy=(50, .75),
            xytext=(55, 0.6), fontsize=10, arrowprops=dict(arrowstyle="->", color='b'))

plt.show()

#### plot multiple ECDFs

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.margins(0.05)           # Default margin is 0.05, value 0 means fit

for state in swing.state.unique():
    x = np.sort(swing['dem_share'][swing.state == state])
    y = np.arange(1, len(x)+1) / len(x)
    ax.plot(x, y, marker='.', linestyle='none', label=state)

plt.xlabel('percent of vote for Obama')
plt.ylabel('ECDF')
plt.legend()

plt.show()

### Computing the ECDF

In this exercise, you will write a function that takes as input a 1D array of data and then returns the `x` and `y` values of the ECDF. You will use this function over and over again throughout this course and its sequel. ECDFs are among the most important plots in statistical analysis. You can write your own function, `foo(x,y)` according to the following skeleton:

```python
def foo(a,b):
    """State what function does here"""
    # Computation performed here
    return x, y
```
    
The function `foo()` above takes two arguments `a` and `b` and returns two values `x` and `y`. The function header `def foo(a,b):` contains the function signature `foo(a,b)`, which consists of the function name, along with its parameters. For more on writing your own functions, see [DataCamp's course Python Data Science Toolbox (Part 1)](https://www.datacamp.com/courses/python-data-science-toolbox-part-1)!

**Instructions**

* Define a function with the signature `ecdf(data)`. Within the function definition,
    * Compute the number of data points, `n`, using the `len()` function.
    * The **x**-values are the sorted data. Use the `np.sort()` function to perform the sorting.
    * The **y** data of the ECDF go from `1/n` to `1` in equally spaced increments. You can construct this using `np.arange()`. Remember, however, that the end value in `np.arange()` is not inclusive.  Therefore, `np.arange()` will need to go from `1` to `n+1`. Be sure to divide this by `n`.
    * The function returns the values `x` and `y`.

In [None]:
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y

### Plotting the ECDF

You will now use your `ecdf()` function to compute the ECDF for the petal lengths of Anderson's *Iris versicolor* flowers. You will then plot the ECDF. Recall that your `ecdf()` function returns two arrays so you will need to unpack them. An example of such unpacking is `x, y = foo(data)`, for some function `foo()`.

**Instructions**

* Use `ecdf()` to compute the ECDF of `versicolor_petal_length`. Unpack the output into `x_vers` and `y_vers`.
* Plot the ECDF as dots. Remember to include `marker = '.'` and `linestyle = 'none'` in addition to `x_vers` and `y_vers` as arguments inside `plt.plot()`.
* Label the axes. You can label the y-axis `'ECDF'`.
* Show your plot.

In [None]:
# Compute ECDF for versicolor data: x_vers, y_vers
x, y = ecdf(versicolor_petal_length)

# Generate plot
plt.plot(x, y, marker='.', linestyle='none')

# Label the axes
plt.xlabel('Versicolor Petal Length (cm)')
plt.ylabel('ECDF')

# Display the plot
plt.margins(0.02)  # keep data off plot edges
plt.show()

### Comparison of ECDFs

ECDFs also allow you to compare two or more distributions (though plots get cluttered if you have too many). Here, you will plot ECDFs for the petal lengths of all three iris species. You already wrote a function to generate ECDFs so you can put it to good use!

To overlay all three ECDFs on the same plot, you can use `plt.plot()` three times, once for each ECDF. Remember to include `marker='.'` and `linestyle='none'` as arguments inside `plt.plot()`.

**Instructions**

* Compute ECDFs for each of the three species using your `ecdf()` function. The variables `setosa_petal_length`, `versicolor_petal_length`, and `virginica_petal_length` are all in your namespace. Unpack the ECDFs into `x_set`, `y_set`, `x_vers`, `y_vers` and `x_virg`, `y_virg`, respectively.
* Plot all three ECDFs on the same plot as dots. To do this, you will need three `plt.plot()` commands. Assign the result of each to `_`.
* A legend and axis labels have been added for you, so hit 'Submit Answer' to see all the ECDFs!

In [None]:
virginica_petal_length = iris_df['petal length (cm)'][iris_df.species == 'virginica']
setosa_petal_length = iris_df['petal length (cm)'][iris_df.species == 'setosa']

# Compute ECDFs
x_set, y_set = ecdf(setosa_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
x_virg, y_virg = ecdf(virginica_petal_length)

# Plot all ECDFs on the same plot
plt.plot(x_set, y_set, marker='.', linestyle='none')
plt.plot(x_vers, y_vers, marker='.', linestyle='none')
plt.plot(x_virg, y_virg, marker='.', linestyle='none')

# Annotate the plot
plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()

## Onward toward the whole story

* Start with graphical eda!

**Coming up...**

* Thinking probabilistically
* Discrete and continuous distributions
* The power of hacker statistics using np.random()

# Quantitative exploratory data analysis

In the last chapter, you learned how to graphically explore data. In this chapter, you will compute useful summary statistics, which serve to concisely describe salient features of a data set with a few numbers.

## Introduction to summary statistics: The sample mean and median

* mean - average
    * heavily influenced by outliers
    * `np.mean()`
* median - middle value of the sorted dataset
    * immune to outlier influence
    * `np.median()`

### Means and medians

Which one of the following statements is true about means and medians?

**Possible Answers**

* ~~An outlier can significantly affect the value of both the mean and the median.~~
* **An outlier can significantly affect the value of the mean, but not the median.**
* ~~Means and medians are in general both robust to single outliers.~~
* ~~The mean and median are equal if there is an odd number of data points.~~

### Computing means

The mean of all measurements gives an indication of the typical magnitude of a measurement. It is computed using `np.mean()`.

**Instructions**

* Compute the mean petal length of Iris versicolor from Anderson's classic data set. The variable `versicolor_petal_length` is provided in your namespace. Assign the mean to `mean_length_vers`.

In [None]:
# Compute the mean: mean_length_vers
mean_length_vers = np.mean(versicolor_petal_length)

# Print the result with some nice formatting
print('I. versicolor:', mean_length_vers, 'cm')

#### with pandas.DataFrame

In [None]:
iris_df.groupby(['species']).mean()

## Percentiles, outliers and box plots

* The median is a special name for the 50th percentile
    * 50% of the data are less than the median
* The 25th percentile is the value of the data point that is greater than 25% of the sorted data
* percentiles are useful summary statistics and can be computed using `np.percentile()`

**Computing Percentiles**

```python
np.percentile(df_swing['dem_share'], [25, 50, 75])
```

![](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/statistical_thinking_1/box_plot.JPG)

* Box plots are a graphical methode for displying summary statistics
    * median is the middle line: 50th percentile
    * bottom and top line of the box represent the 25th & 75th percentile, repectively
    * the space between the 25th and 75th percentile is the interquartile range (IQR)
    * Whiskers extent a distance of 1.5 time the IQR, or the extent of the data, whichever is less extreme
    * Any points outside the whiskers are plotted as individual points, which we demarcate as outliers
        * There is no single definition for an outlier, however, being more than 2 IQRs away from the median is a common criterion.
        * An outlier is not necessarily erroneous
    * Box plots are a great alternative to bee swarm plots, becasue bee swarm plots become too cluttered with large data sets

In [None]:
all_states = pd.read_csv(elections_all_file)
all_states.head()

In [None]:
sns.boxplot(x='east_west', y='dem_share', data=all_states)
plt.xlabel('region')
plt.ylabel('percent of vote for Obama')
plt.show()

### Computing percentiles

### Comparing percentiles to ECDF

### Box-and-whisker plot

## Variance and standard deviation

### Computing the variance

### The standard deviation and the variance

## Covariance and Pearson correlation coefficient

### Scatter plots

### Variance and covariance by looking

### Computing the covariance

### Computing the Pearson correlation coefficient

# Thinking probabilistically-- Discrete variables

Statistical inference rests upon probability. Because we can very rarely say anything meaningful with absolute certainty from data, we use probabilistic language to make quantitative statements about data. In this chapter, you will learn how to think probabilistically about discrete quantities, those that can only take certain values, like integers. It is an important first step in building the probabilistic language necessary to think statistically.

## Probabilistic logic and statistical inference

### What is the goal of statistical inference?

### Why do we use the language of probablility?

## Random number generators and hacker statistics

### Generating random numbers using the np.random module

### The np.random module and Bernoulli trials

### How many defaults might we expect?

### Will the bank fail?

## Probability distributions and stories: The Binomial distribution

### Sampling out of the Binomial distribution

### Plotting the Binomial PMF

## Poisson processes and the Poisson distribution

### Relationship between Binomial and Poisson distribution

### How many no-hitters in a season?

### Was 2015 anomalous?

# Thinking probabilistically-- Continuous variables

In the last chapter, you learned about probability distributions of discrete variables. Now it is time to move on to continuous variables, such as those that can take on any fractional value. Many of the principles are the same, but there are some subtleties. At the end of this last chapter of the course, you will be speaking the probabilistic language you need to launch into the inference techniques covered in the sequel to this course.

## Probability density functions

### Interpreting PDFs

### Interpreting CDFs

## Introduction to the Normal distribution

### The Normal PDF

### The Normal CDF

## The Normal distribution: Properties and warnings

### Gauss and the 10 Deutschmark banknote

### Are the Belmont Stakes results Normally distributed?

### What are the chances of a horse matching or beating Secretariat's record?

## The Exponential distribution

### Matching a story and a distribution

### Waiting for the next Secretariat

### If you have a story, you can simulate it!

### Distribution of no-hitters and cycles

## Final thoughts and encouragement toward Statistical Thinking II