# Sampling Distributions Dance Party!
Get your dancing shoes ready! You are a DJ trying to make sure you are ready for a big party. You don’t have time to go through all the songs you can work with. Instead, you want to make sure that any sample of 30 songs from your playlist will get the party started. To do this, you will use the power of sampling distributions!

The dataset we are using for this project can be found here. For simplicity, we have removed some unnecessary columns.

Note that a **helper_function**.py file is loaded for you in the workspace. This file contains functions that you will use throughout this project. A **solution.py** file is also loaded for you in the workspace, which contains solution code for this project. We highly recommend that you complete the project on your own without checking the solution, but feel free to take a look if you get stuck or want to check your answers when you’re done!

In [3]:
from helper_functions import choose_statistic, population_distribution, sampling_distribution
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

##### Tasks

### Loading in the Data
1. You will be working with a dataset called **spotify_data.csv**. In **script.py**, use the `read_csv()` pandas function to load in **spotify_data.csv** into a variable called `spotify_data`.

<details>
<summary><b>Hint</b></summary>

To load in the data, write the following line of code:

```py
spotify_data = pd.read_csv('spotify_data.csv')
```
</details>

In [None]:
# task 1: load in the spotify dataset


2. Use the pandas `.head()` function to preview the `spotify_data`. If you need a 
reminder of how to use this function, click the hint below.

<details>
<summary><b>Hint</b></summary>

To preview a dataframe using the `.head()` function, do the following:

```py
print(dataframe_variable.head())
```
In **script.py**, our `dataframe_variable` is `spotify_data`.

In [None]:
# task 2: preview the dataset


3. For this project, we are going to focus on the `tempo` variable. This column gives the beats per minute (bpm) of each song in **spotify.csv**. The other columns in our dataset are:

    - `danceability`
    - `energy`
    - `instrumentalness`
    - `liveness`
    - `valences`
    
    For now, we are going to ignore these other columns.

    Create a variable called `song_tempos` that contains the `tempo` column data.

<details>
<summary><b>Hint</b></summary>

To save the `tempo` column data in a variable called `population`, do the following:

```py
population = spotify_data['tempo']
```

In [None]:
# task 3: select the relevant column


### Helper Functions
4. Let’s investigate the helper functions we will use in the following sections. A file called **helper_functions.py** should be opened in the workspace for you. It contains three functions: `choose_statistic()`, `population_distribution()`, and `sampling_distribution()`. The code in these functions is similar to what we saw in the previous lesson, but let’s explore these together.

    `choose_statistic()` allows us to choose a statistic we want to calculate for our sampling and population distributions. It contains two parameters:

    - `x`: An array of numbers
    - `sample_stat_text`: A string that tells the function which statistic to calculate on `x`. It takes on three values: “Mean”, “Minimum”, or “Variance”.
    
    `population_distribution()` allows us to plot the population distribution of a dataframe with one function call. It takes the following parameter:

    - `population_data`: the dataframe being passed into the function
    
    `sampling_distribution()` allows us to plot a simulated sampling distribution of a statistic. The simulated sampling distribution is created by taking random samples of some size, calculating a particular statistic, and plotting a histogram of those sample statistics. It contains three parameters:

    - `population_data`: the dataframe being sampled from
    - `samp_size`: the size of each sample
    - `stat`: the specific statistic being measured for each sample — either “Mean”, “Minimum”, or “Variance”
    
    Read through these functions in `helper_function.py` to familiarize yourself with them. Click the hint to see examples of `population_distribution()` and `sampling_distribution()` being used.

<details>
<summary><b>Hint</b></summary>

Here is an example of how to use `population_distribution()`:

```py
# example function use case
population_distribution(population)
```
Here is an example of how to use `sampling_distribution()`:

```py
# example function use case for sampling distribution of the mean
sampling_distribution(population, "Mean")
```

In [None]:

# task 4:

### Sampling Distribution Exploration
5. Now that our data is loaded into **script.py** and we have gone over the functions in **helper_functions.py** let’s start our sampling distributions exploration. Make sure to write your code in **script.py**.

    To start off, let’s use the `population_distribution()` function to graph distribution of `song_tempos`.

    When you click run, you should see a graph with the following title:
    ```py
    Population Distribution
    ```
    How would you describe this distribution?


<details>
<summary><b>Hint</b></summary>

To use the `population_distribution()` function, do the following:

```py
population_distribution(_____)
```
In the blank, you should put `song_tempos` because it is our population data.

The population distribution is approximately normal with a little bit of right-skewness.

In [None]:
# task 5: plot the population distribution with the mean labeled


6. Now let’s plot the sampling distribution of the sample mean with sample sizes of 30 songs. To do this, use the `sampling_distribution()` helper function.

    Once you hit run, you should see a graph with the following title:
    ```py
    Sampling Distribution of the Mean
    Mean of the Sample Means: {Mean of the Sample Means} 
    Population Mean: {Population Mean}
    ```

<details>
<summary><b>Hint</b></summary>

To use the `sampling_distribution()` function, do the following:

```py
sampling_distribution(_____, _____, _____)
```
In the first blank, you should put `song_tempos` because it is our population data. In the second blank, you should put 30 since we want each sample to be of size `30`. In the last blank, you should put `"Mean"` since we want to analyze the sampling distribution of the sample mean.

In [None]:
# task 6: sampling distribution of the sample mean


7. Compare your sampling distribution of the sample means to the population mean. Is the sample mean an unbiased or biased estimator of the population?

<details>
<summary><b>Hint</b></summary>

The mean is also an unbiased estimator as the mean of the sampling distribution of the mean is always approximately the same as the population mean.

In [None]:
# task 7:


8. Now let’s plot the sampling distribution of the sample minimum with sample sizes of 30 songs. To do this, use the `sampling_distribution()` helper function.

    Once you hit run, you should see a graph with the following title:
    ```py
    Sampling Distribution of the Minimum
    Mean of the Sample Minimums: {Mean of the Sample Minimums}
    Population Mean: {Population Mean}
    ```

<details>
<summary><b>Hint</b></summary>

To use the `sampling_distribution()` function, do the following:

```py
sampling_distribution(_____, _____, _____)
```
In the first blank, you should put `song_tempos` because it is our population data. In the second blank, you should put `30` since we want each sample to be of size 30. In the last blank, you should put `"Minimum"` since we want to analyze the sampling distribution of the sample minimum.

In [None]:
# task 8: sampling distribution of the sample minimum


9. Compare your sampling distribution of the sample minimums to the population minimum. Is the sample minimum an unbiased or biased estimator of the population?

<details>
<summary><b>Hint</b></summary>

Notice that the mean of the sample minimums is consistently much higher than the population minimum. Since you are looking for high-tempo songs for the party, this is actually a good thing! You will want to avoid having a lot of low-tempo songs.

In [None]:
# task 9:


10. Now let’s plot the sampling distribution of the sample variance with sample sizes of 30 songs. To do this, use the `sampling_distribution()` helper function.

    Once you hit run, you should see a graph with the following title:
    ```py
    Sampling Distribution of the Variance
    Mean of the Sample Variances: {Mean of the Sample Variances}
    Population Variance: {Population Variance}
    ```

<details>
<summary><b>Hint</b></summary>

To use the `sampling_distribution()` function, do the following:

```py
sampling_distribution(_____, _____, _____)
```
In the first blank, you should put `song_tempos` because it is our population data. In the second blank, you should put `30` since we want each sample to be of size 30. In the last blank, you should put `"Variance"` since we want to analyze the sampling distribution of the sample variance.

In [None]:

# task 10: sampling distribution of the sample variance


11. Compare your sampling distribution of the sample variance to the population variance. Does the sample variance appear to be an unbiased or biased estimator of the population?

    Click the hint for more information about sample variance.

<details>
<summary><b>Hint</b></summary>

The mean of the sample variances is consistently slightly less than the population variance, meaning it is a biased estimator. However, it is super close. Let’s dig into this.

We calculated the sample variance the same way we calculate population variance..

However, the formulas for sample variance and population variance are actually distinct. As we have seen, population variance is calculated as:

$$ population\,variance = \frac{\sum(observation - \\mu)^2}{n} $$
​ 
When we measure the sample variance using the same formula, it turns out that we tend to underestimate the population variance. Because of this, we divide by n-1 instead of n:

$$ sample\,variance = \frac{\sum(observation - sample\,mean)^2}{n - 1} $$

Using this formula, sample variance becomes an unbiased estimator of the population variance. Let’s apply this in the next task!

In [None]:
# task 11:

12. Go to line 17 in **helper_functions.py**. You should see the following line of code:
    ```py
    np.var(x)
    ```
    Change this to:
    ```py
    np.var(x, ddof=1)
    ```
    Adding this `ddof=1` parameter will divide our input by *n-1* instead of *n*, therefore applying the sample variance formula.

    After changing this line of code, run **script.py**. Does the sample variance appear to be an unbiased or biased estimator of the population?

<details>
<summary><b>Hint</b></summary>

By changing the way we calculate sample variance, we have made it an unbiased parameter.

In [None]:
# task 12:

### Calculating Probabilities
13. We have a good sense of some sample statistics now that we’ve investigated sampling distributions. Let’s take our analysis further by calculating probabilities.

    First, calculate the population mean and population standard deviation of `song_tempos`. Save these values in two separate variables called `population_mean` and `population_std`.

<details>
<summary><b>Hint</b></summary>

Use the .mean() and .std() NumPy methods.

In [None]:
# task 13: calculate the population mean and standard deviation


14. Use `population_mean` and `population_std` to calculate the standard error of the sampling distribution of the sample mean with a sample size of 30.

    Save this value in a variable called `standard_error`.

<details>
<summary><b>Hint</b></summary>

The formula for the standard error of the sample mean is:

$$ standard~error = \frac{standard~deviation}{square~root~of~the~sample~size}

​


In [None]:
# task 14: calculate the standard error


15. You are afraid that if the average tempo of the songs you randomly select is less than 140bpm that your party will not be enjoyable.

    Using `population_mean` and `standard_error` in a CDF, calculate the probability that the sample mean of 30 selected songs is less than 140bpm.

    Remember to print your result into the output terminal.

<details>
<summary><b>Hint</b></summary>

Use the `.cdf()` method from the SciPy library.

As a reminder to use the `.cdf()` method, do the following:

```py
print(stats.norm.cdf(value_of_interest, mean, standard_error))
```
In this case, our `value_of_interest` is 140; our mean is `population_mean`; our `standard_error` is also called `standard_error`.

In [None]:
# task 15: calculate the probability of observing an average tempo of 140bpm or lower from a sample of 30 songs


16. You know the party will be truly epic if the randomly sampled songs have an average tempo of greater than 150bpm.

    Using `population_mean` and `standard_error` in a CDF, calculate the probability that the sample mean of 30 selected songs is GREATER than 150bpm.

    Remember to print your result into the output terminal.

    Does this probability make you feel confident about the party?

<details>
<summary><b>Hint</b></summary>

Use the `.cdf()` method from the SciPy library.

As a reminder, to use the `.cdf()` method to calculate the probability of observing some value of interest or greater, do the following:

```py
print(1 - stats.norm.cdf(value_of_interest, mean, standard_error))
```
In this case, our `value_of_interest` is 150; our mean is `population_mean`; our `standard_error` is also called `standard_error`.

In [None]:
# task 16: calculate the probability of observing an average tempo of 150bpm or higher from a sample of 30 songs


### Extras
17. Awesome job! You are ready to throw an awesome party! If you want to do some more exploration of sampling distributions, here are some more opportunities:

    - Add another sample statistic to the `choose_statistic()` function in **helper_functions.py** — such as median, mode, or maximum.
    - Explore a different column of data from the **spotify_data.csv** dataset.
    - Use the sampling distribution of the sample minimum to estimate the probability of observing a specific sample minimum. For example, from the plot, what is the chance of getting a sample minimum that is less than 130bpm?
    Happy coding!

In [None]:
# EXTRA