# Introduction to Sampling

Learn what sampling is and why it is so powerful. You’ll also learn about the problems caused by convenience sampling and the differences between true randomness and pseudo-randomness.

# 1. Sampling and point estimates

1.1 Reasons for sampling

Sampling is an important technique in your statistical arsenal. It isn't always appropriate though—you need to know when to use it and when to work with the whole dataset.

Which of the following is not a good scenario to use sampling?

Possible Answers:
- You've been handed one terabyte of data about error logs for your company's device. (True)

- You wish to learn about the travel habits of all Pakistani adult citizens.

- You've finished collecting data on a small study of the wing measurements for 10 butterflies.

- You are working to predict customer turnover on a big data project for your marketing firm.

Ten butterflies is a small dataset, so sampling isn't useful here.

1.2 Simple sampling with pandas

Throughout this chapter, you'll be exploring song data from Spotify. Each row of this population dataset represents a song, and there are over 40,000 rows. Columns include the song name, the artists who performed it, the release year, and attributes of the song like its duration, tempo, and danceability. You'll start by looking at the durations.

Your first task is to sample the Spotify dataset and compare the mean duration of the population with the sample.

In [3]:
import pandas as pd
spotify_population = pd.read_feather("C:\\Users\\yazan\\Desktop\\Data_Analytics\\8-Sampling in Python\Datasets\\spotify_2000_2020.feather")

In [4]:
# Sample 1000 rows from spotify_population
spotify_sample = spotify_population.sample(n=1000)

# Print the sample
print(spotify_sample)

# Calculate the mean duration in mins from spotify_population
mean_dur_pop = spotify_population['duration_minutes'].mean()

# Calculate the mean duration in mins from spotify_sample
mean_dur_samp = spotify_sample['duration_minutes'].mean()

# Print the means
print(mean_dur_pop)
print(mean_dur_samp)

       acousticness                  artists  danceability  duration_ms  \
31992        0.4260       ['Patrick Sweany']        0.4030     348707.0   
9563         0.8700          ['Alex Turner']        0.4360     247013.0   
3307         0.6580        ['Nature Sounds']        0.0884     570336.0   
13868        0.0391        ['Pretty Lights']        0.6400     344000.0   
22602        0.0151            ['21 Savage']        0.8840     220307.0   
...             ...                      ...           ...          ...   
781          0.0643                 ['Guts']        0.8740     313168.0   
31026        0.5680  ['Maroon 5', 'Cardi B']        0.8510     235545.0   
25378        0.8420          ['Johnny Cash']        0.4210     196733.0   
40942        0.8460        ['Ciaran Lavery']        0.6190     251960.0   
38678        0.2030        ['Brooks & Dunn']        0.6790     201427.0   

       duration_minutes  energy  explicit                      id  \
31992          5.811783   0.39