# Principles of bootstrapping

Bootstrapping is, in some sense, the opposite of sampling from a population. Sampling treats your dataset as the population, and you generate a random subset. Bootstrapping treats your dataset as a sample and uses it to build up a theoretical population.

<center><img src="images/04.02.jpg"  style="width: 400px, height: 300px;"/></center>


# With or without replacement?

So far in the course, you've seen sampling with and without replacement. It's important to know when to use each method.

<center><img src="images/04.03.jpg"  style="width: 400px, height: 300px;"/></center>


# Generating a bootstrap distribution

The process for generating a bootstrap distribution is similar to the process for generating a sampling distribution; only the first step is different.

To make a sampling distribution, you start with the population and sample without replacement. To make a bootstrap distribution, you start with a sample and sample that with replacement. After that, the steps are the same: calculate the summary statistic that you are interested in on that sample/resample, then replicate the process many times. In each case, you can visualize the distribution with a histogram.

Here, `spotify_sample` is a subset of the `spotify_population` dataset. To make it easier to see how resampling works, a row index column called `'index'` has been added, and only the artist name, song name, and `danceability` columns have been included.

`spotify_sample` is available; `pandas`, `numpy`, and `matplotlib.pyplot` are loaded with their usual aliases.

In [1]:
# # Generate 1 bootstrap resample
# spotify_1_resample = spotify_sample.sample(frac=1, replace=True)

# # Calculate of the danceability column of spotify_1_resample
# mean_danceability_1 = np.mean(spotify_1_resample['danceability'])

# # Print the result
# print(mean_danceability_1)

In [1]:
# # Replicate this 1000 times
# mean_danceability_1000 = []
# for i in range(1000):
# 	mean_danceability_1000.append(
#         np.mean(spotify_sample.sample(frac=1, replace=True)['danceability'])
# 	)

# # Draw a histogram of the resample means
# plt.hist(mean_danceability_1000)
# plt.show()

# Bootstrap statistics and population statistics

Bootstrap distribution statistics can be used to estimate population parameters. But can you always rely on them to give an accurate estimate of an unknown population parameter?

Should the mean and the standard deviation of the bootstrap distribution both be used to estimate the corresponding values of the population?

- No, the mean of the bootstrap distribution will always be near the sample mean, which may not necessarily be very close to the population mean.

# Sampling distribution vs. bootstrap distribution

The sampling distribution and bootstrap distribution are closely linked. In situations where you can repeatedly sample from a population (these occasions are rare), it's helpful to generate both the sampling distribution and the bootstrap distribution, one after the other, to see how they are related.

Here, the statistic you are interested in is the mean `popularity` score of the songs.

`spotify_population` (the whole dataset) and `spotify_sample` (500 randomly sampled rows from `spotify_population`) are available; `pandas` and `numpy` are loaded with their usual aliases.

In [2]:
# mean_popularity_2000_samp = []

# # Generate a sampling distribution of 2000 replicates
# for i in range(2000) :
#     mean_popularity_2000_samp.append(
#     	# Sample 500 rows and calculate the mean popularity 
#     	np.mean(spotify_population["popularity"].sample(500, replace = False))
#     )

# # Print the sampling distribution results
# print(mean_popularity_2000_samp)

In [3]:
# mean_popularity_2000_boot = []

# # Generate a bootstrap distribution of 2000 replicates
# for i in range(2000):
#     mean_popularity_2000_boot.append(
#     	# Resample 500 rows and calculate the mean popularity     
#     	np.mean(spotify_sample.sample(n = 500, replace = True)["popularity"])
#     )

# # Print the bootstrap distribution results
# print(mean_popularity_2000_boot)

# Compare sampling and bootstrap means

To make calculation easier, distributions similar to those calculated from the previous exercise have been included, this time using a sample size of 5000.

`spotify_population`, `spotify_sample`, `sampling_distribution`, and `bootstrap_distribution` are available; `pandas` and `numpy` are loaded with their usual aliases.

In [4]:
# # Calculate the population mean popularity
# pop_mean = spotify_population["popularity"].mean()

# # Calculate the original sample mean popularity
# samp_mean = spotify_sample["popularity"].mean()

# # Calculate the sampling dist'n estimate of mean popularity
# samp_distn_mean = np.mean(sampling_distribution)

# # Calculate the bootstrap dist'n estimate of mean popularity
# boot_distn_mean = np.mean(bootstrap_distribution)

# # Print the means
# print([pop_mean, samp_mean, samp_distn_mean, boot_distn_mean])

Based on the four means you just calculated (`pop_mean`, `samp_mean`, `samp_distn_mean`, and `boot_distn_mean`), which statement is true?
- The sampling distribution mean is the best estimate of the true population mean; the bootstrap distribution mean is closest to the original sample mean.

# Compare sampling and bootstrap standard deviations

In the same way that you looked at how the sampling distribution and bootstrap distribution could be used to estimate the population mean, you'll now take a look at how they can be used to estimate variation, or more specifically, the standard deviation, in the population.

Recall that the sample size is 5000.

`spotify_population`, `spotify_sample`, `sampling_distribution`, and `bootstrap_distribution` are available; pandas and numpy are loaded with their usual aliases.

In [5]:
# # Calculate the population std dev popularity
# pop_sd = np.std(spotify_population["popularity"], ddof = 0)

# # Calculate the original sample std dev popularity
# samp_sd = np.std(spotify_sample["popularity"], ddof = 1)

# # Calculate the sampling dist'n estimate of std dev popularity
# samp_distn_sd = np.std(sampling_distribution, ddof = 1) * np.sqrt(5000) 

# # Calculate the bootstrap dist'n estimate of std dev popularity
# boot_distn_sd = np.std(bootstrap_distribution, ddof = 1) * np.sqrt(5000) 

# # Print the standard deviations
# print([pop_sd, samp_sd, samp_distn_sd, boot_distn_sd])

Based on the four results you just calculated (`pop_sd`, `samp_sd`, `samp_distn_sd`, and `boot_distn_sd`), which statement is true?
- The calculation from the bootstrap distribution is the best estimate of the population standard deviation.

# Confidence interval interpretation

When reporting results, it is common to provide a confidence interval alongside an estimate.

What information does that confidence interval provide?

- A range of plausible values for an unknown quantity.

# Calculating confidence intervals

You have learned about two methods for calculating confidence intervals: the quantile method and the standard error method. The standard error method involves using the inverse cumulative distribution function (inverse CDF) of the normal distribution to calculate confidence intervals. In this exercise, you'll perform these two methods on the Spotify data.

`spotify_population`, `spotify_sample`, and `bootstrap_distribution` are available; pandas and numpy are loaded with their usual aliases, and norm has been loaded from `scipy.stats`.

In [6]:
# # Generate a 95% confidence interval using the quantile method
# lower_quant = np.quantile(bootstrap_distribution, 0.025)
# upper_quant = np.quantile(bootstrap_distribution, 0.975)

# # Print quantile method confidence interval
# print((lower_quant, upper_quant))

In [7]:
# # Find the mean and std dev of the bootstrap distribution
# point_estimate = np.mean(bootstrap_distribution)
# standard_error = np.std(bootstrap_distribution,ddof = 1)

# # Find the lower limit of the confidence interval
# lower_se = norm.ppf(0.025, loc=point_estimate, scale=standard_error)

# # Find the upper limit of the confidence interval
# upper_se = norm.ppf(0.975, loc=point_estimate, scale=standard_error)

# # Print standard error method confidence interval
# print((lower_se, upper_se))