# Lecture 6 

* Exploratory Data Analysis
* Hypothesis Testing
* Bootstrap Sampling

___

# Exploratory Data Analysis

*A first look at the data*.

![image.png](attachment:image.png)
copyright: https://towardsdatascience.com/exploratory-data-analysis-eda-a-practical-guide-and-template-for-structured-data-abfbf3ee3bd9

<div class="alert alert-success">
    <b>Exploratory Data Analysis</b>
    
**Exploratory data analysis** or **EDA** is a critical first step in analyzing the data from an experiment. Here are the main reasons we use EDA:
* detection of mistakes
* checking of assumptions
* preliminary selection of appropriate models
* determining relationships among the explanatory variables, and
* assessing the direction and rough size of relationships between explanatory and outcome variables.

Loosely speaking, any method of looking at data that does not include formal statistical modeling and inference falls under the term exploratory data analysis.
</div>

Exploratory data analysis is generally cross-classified in two ways. First, each method is either 

1. **non-graphical**, or 
2. **graphical**. 

And second, each method is either 
* **univariate**, or 
* **multivariate** (usually just bivariate).

<div class="alert alert-info">
    <b>Types of EDA</b>
    
The four types of EDA are:
* univariate non-graphical
* multivariate non-graphical
* univariate graphical
* multivariate graphical
</div>

Non-graphical methods generally involve calculation of **summary statistics**, while graphical methods obviously summarize the data in a diagrammatic or pictorial way. 

* Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or more variables at a time to explore relationships. 
    * Usually our multivariate EDA will be bivariate (looking at exactly two variables), but occasionally it will involve three or more variables. 
    * *It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA before performing the multivariate EDA.*

## Univariate Data

The data that come from making a particular measurement on all of the subjects in a sample represent our observations for a single characteristic such as age, gender, speed at a task, or response to a stimulus. 

We should think of these measurements as representing a *sample distribution* of the variable, which in turn more or less represents the *population distribution* of the variable. 

The usual goal of univariate non-graphical EDA is to better appreciate the *sample distribution* and also to make some tentative conclusions about what population distribution(s) is/are compatible with the sample distribution. 

* Outlier detection is also a part of this analysis.

<div class="alert alert-info">
    <b>Population</b>
    
A **population** is a group of people, objects, events or observations that is being studied.
</div>

<div class="alert alert-info">
    <b>Parameters</b>
    
Often we are trying to assess some qualities or properties of that population. We call these **parameters**.
</div>

When the population is too large to directly measure the parameters of interest, then we try to draw inferences from a subset of the population.

<div class="alert alert-info">
    <b>Sample</b>
    
A **sample** from a population is a subset of the population that can be used to draw inferences about the parameters of interest.
</div>

* A sample is usually drawn randomly from the population.

* We usually require that each member of the sample is chosen independently from other members.

* Often, but not always, each member in the population is equally likely to be included in the sample.

<div class="alert alert-info">
    <b>Statistic</b>
    
A **statistic** is a measurement of a quality or property on a sample that is used to assess a parameter of the whole population.
</div>

When samples are small, the statistics often provide little or no information about the parameters.

* For example, consider the problem of determining whether a coin is fair or two-headed. The result of flipping a coin one time provides no useful information for determining that

When samples are larger, they generally more accurately represent the population.

In practice, when dealing with data, there are generally two cases that we will encounter:

1. When designing an experiment, the statistician can choose the sample size to balance between being able to generate a useful statistic and the cost of taking more samples.

2. Sometimes the experiment has already been carried out or is not under the control of the statistician. For instance, the statistician wants to assess something based on an existing survey or compare effects of a change in laws on a set of states. In this case, the sample size is fixed.

### Example: Effect of 1994-2004 Federal Assault Weapon Ban

In 1994, the United States Congress passed a ban on a variety of semiautomatic rifles that are sometimes referred to as "assault weapons". The ban was in effect for 10 years, from 1994-2004. ([State Firearm Laws](https://www.statefirearmlaws.org/resources))

It might be guessed that the goal of any gun ban is to reduced gun violence. Thus it is natural to assess whether the "assault weapon" ban had any effect on gun violence.

Fortunately, the Center for Disease Control's National Center for Health Statistics tracks firearm mortality at the state level. Visualizations of firearm mortality by state, along with links to download the data are available here:

https://www.cdc.gov/nchs/pressroom/sosmap/firearm_mortality/firearm.htm

Although this page does not have data prior to 2005, the data for 2005 should be similar to that before the ban because the ban was only on the **sale** of certain firearms. It would take many years for this ban to actually affect the availability of firearms.

Thus, we can use two sets of data on that page to measure the effect of the "assault weapons" ban:

* The 2005 data set represents firearm mortality after the ban had been in effect for a decade
* The 2014 data set represents firearm mortality after the ban had been seized for a decade

I have download this data and it is saved in the file called **"firearms-combined.csv"**.

**Make sure you have the CSV file wherever you are working on this notebook!**

Now let's read the data from the CSV file into a dataframe:

* Death rates are measured per 100,000 total population.

Let's access the sample values for columns "RATE-2005" and "RATE-2014":

In [None]:
# Note that I went directly to a numpy array here, instead of making a list first
# The reason for using a numpy array is that we want to apply numpy methods for 
# computing statistics further below!



Let's begin by plotting this data:

## Histograms

A common visualization is to look at a histogram of the data. Unlike the histograms we previously generated, this data takes on **real values**, not just integers. Fortunately, ```matplotlib``` has functions to do the hard work of making histograms for us:

Some styling will help make this more legible:

Each bar of the histogram represents a "bin" of data values. In fact, the counts and bin edges are returned by the hist function. We can easily change the number of bins to provide more resolution:

Let's add some information to make this more useful:

* **What *inferences* might you make from this plot?**

However, it does not make sense to make the number of bins very large compared to the data size.

## Summary Statistics

Summary statistics are values calculated from sample data that measure some characteristic about the data.

* **What is the most common summary statistic?**

The **average** or **sample mean**. I **strongly** prefer the word average for the statistic computed from a set of data. 

We will use the word **mean** to refer to a type of average for random phenomena, when we do not have specific samples for those values. 

* What does the **average** or **sample mean** mean?

    1. The value where most of the data "sits" is centered around
    
    2. The value that has minimum distance from every value
    
    3. Value most likely to occur
    
    4. Value that divides group into 2 sets of equal size 

Both ```pandas``` and ```numpy``` provide methods to calculate the average:

Other **summary statistics** are used to summarize a set of observations, the most common ones are:

1. **Average** - the value where most of the data "sits" is centered around

2. **Size** - number of observations in the sample data

3. **Count** - number of non-empty observations in the sample data

4. **Median** - the "middle number" of the sorted sample data values

5. **Standard deviation** - is a measure of dispersion; it measures the average distance between a single observation and the average value

6. **Quartiles** - the boundary values for the lowest, middle and upper quarters of the sample data

7. **Inter-Quantile Range (IQR)** - where the "middle fifty" percent of the data is

In ```pandas``` we can print a summary statistic table this way:

A good graphical descriptor that displays a few of these summary statistics is the **boxplot** or **whisker plot**:

![boxplot](https://www.simplypsychology.org/boxplot.jpg)

We can use the ```matplotlib``` to display a boxplot:

Or we can use built-in ```pandas``` graphic visualizations directly on dataframes:

* **What *inferences* might you make from this plot?**

The sample mean of the 2014 data set is larger than that for the 2005 data set. This may indicate that the overturn of the assault weapon ban in 2014 is associated with an increase in firearms mortality.

However, the difference is relatively small, as are the sample sizes (50).

By performing EDA, we have gathered a lot of information and we may want to start answering some questions that require statistical hypothesis testing and modeling. 

* For example, for the firearm law example, we may *hypothesize* that the observed average difference are just based on random sampling from the underlying population, that is that the ban did not have an effect on firearm mortality rate.

___

# Binary Hypothesis Testing

* The *null hypothesis* is that there is no real difference between the two data sets, and any differences are just based on random sampling from the underlying population.

So, let's **assume that the two samples are from the same population**. 

* By combining the samples (called **pooling**), we get a new subset of the original population, if the null hypothesis is true. Moreover, any sample from this better represents the original population than either of the samples.

* We can check whether the null hypothesis is true by checking how often samples from the pooled data set have a difference in means as large as the one observed.

<div class="alert alert-info">
    <b>Pooling</b>
    
**Pooling** describes the practice of gathering together small sets of data that are assumed to have been *drawn* from the same underlying population and using the combined larger set (the *pool*) to obtain a more precise estimate of that population.
</div>

## Sampling

**The big question:** to sample **with replacement** or **without replacement**?

<div class="alert alert-info">
    <b>Bootstrapping</b>
    
**Sampling with replacement** from a pooling set is called **bootstrapping** and is the most popular resampling technique. It is meant to better emulate independent sampling from the original population.
</div>

<div class="alert alert-info">
    <b>Permutations</b>
    
**Sampling without replacement** from a pooling set better emulates **permutation** tests, where we check every possible reordering of the data into samples. This will be discussed more later.
</div>

* Generally, *sampling without replacement* is more conservative (produces a higher $p$-value) than bootstrapping. 
* Bootstraping is **easy** and **most popular**, and we apply it here.

**The Bootstrap Idea:** The original sample approximates the population from which it was drawn. So *resamples* from this sample approximate what we would get if we took many samples from the population. The bootstrap distribution of a statistic, based on many resamples, approximates the sampling distribution of the statistic, based on many samples.

### Bootstrap Model 1

* How would we randomly choose from this data **with replacement**?

* And, if each resample is a new sample, which size should the resample have?

Recall that ```numpy.random``` has a similar method:

For a significance level of $\alpha = 0.05$, let's build a Bootstrap simulation to compute the probability of observing a mean difference of 0.63 or larger:

* **What is the conclusion?**

    * **Is the result statistically significant?** <!--No, because the p-value is larger than $\alpha=0.05$.-->
    * **Can we reject the null hypothesis?** <!--No, "we cannot reject the null hypothesis". -->
    * **Conclusion:** <!--The data suggests that the ban did not have an effect of firearm mortality rate.-->

### Bootstrap Model 2

A more reasonable bootstrap approach would be to randomly assign values from 2005 or 2014 **for each state** and then assess the difference:

In [None]:
# Alternatively: Use the Pandas library



Now, we want to a special kind of array indexing: **fancy indexing**.

For a significance level of $\alpha = 0.05$, let's build a Bootstrap simulation to compute the probability of observing a mean difference of 0.63 or larger:

* **What is the conclusion?**

    * **Is the result statistically significant?** <!--Yes, because the p-value is smaller than $\alpha=0.05$.-->
    * **Can we reject the null hypothesis?** <!--Yes, we reject the null hypothesis-->
    * **Conclusion:** <!--Under this interpretation, the restriction on assault weapons is associated with an increase in mean firearms morality.-->
    
<!--It depends on how you interpret the data!-->

### Distribution of the bootstrap mean-difference

Every time we create a bootstrap value for the difference of means, we create a new random value. Let's see how the bootstrap means are distributed by looking at a histogram of those values:

A few obervations:
    
1. The difference of means has a bell shape -- we saw that before. Why do you think that is?
2. Almost all of the values fall between -0.5 and +0.5. Thus, it is not surprising that getting a mean-difference as large as 0.6 is very rare.

**Topic for later:** The **Central Limit Theorem** (CLT) for sums says that if you keep drawing larger and larger samples and taking their sums, the sums form their own normal distribution (the sampling distribution), which approaches a normal distribution as the sample size increases. 

We can now consider the question: **what values of the mean-difference will make it such that we have 95\% confidence that we should ACCEPT the null hypothesis?**

So, the percentage of data lying below -0.5 is:

Similarly, the percentage lying above 0.5 is:

Another way to express this is that 99.80% of the data is between $[-0.5, 0.5]$.

This is an example of a **confidence interval**. 

* Confidence intervals offer an alternative to $p$-values that provide more information. 

* When we say a $x$% confidence interval, we usually mean the region such that $(100-x)/2$% of samples will fall below the confidence interval, and $(100-x)/2$% of samples will fall above the confidence interval. 

The confidence interval for a bootstrap statistic cannot be known exactly, but it can be estimated accurately given enough samples of the bootstrap statistic.

___

# Confidence Intervals

**Procedure for Estimating Confidence Interval for a Bootstrap Statistic**

1. Draw $N$ samples from the pooled data using replacement
2. For each sample(s), compute the desired statistic and store it
3. Sort all of the stored statistics
4. For confidence interval $x$%:
    * the lower bound of the confidence interval is the element in position $N(1-x)/2$
    * the upper bound of the confidence interval is the element in position $N-N(1-x)/2= N \times x/2$

**<font color=blue>Example 1:</font> Compute the 95% confidence interval for the example above.**

Find the **position** in the sorted sequence of the lower bound of the confidence interval:

Now find the **value** of the sorted data at that position. That is the lower end of our confidence region:

Finding the position of the upper bound of the confidence interval is most easily done using the position of the lower bound:

Thus, the 95% confidence interval is $[-0.31, 0.31]$.

**How can confidence intervals be used in place of $p$-values?** 
* Instead of conducting a binary hypothesis test with $\alpha=0.05$, we can compute the 95% confidence interval for the mean difference. Then we observe if the result lies within the 95% confidence interval.

The observed mean-difference value was 0.63. This falls outside the 95% confidence interval $[-0.31,0.31]$. The fact that the observed value is far outside the 95% confidence interval makes it likely that we could have used a stronger criteria (like 99% confidence intervals).