# Introduction to Sampling

Learn what sampling is and why it is so powerful. You’ll also learn about the problems caused by convenience sampling and the differences between true randomness and pseudo-randomness.

# 1. Sampling and point estimates

<b>1.1 Reasons for sampling</b>

Sampling is an important technique in your statistical arsenal. It isn't always appropriate though—you need to know when to use it and when to work with the whole dataset.

Which of the following is not a good scenario to use sampling?

Possible Answers:
- You've been handed one terabyte of data about error logs for your company's device. (True)

- You wish to learn about the travel habits of all Pakistani adult citizens.

- You've finished collecting data on a small study of the wing measurements for 10 butterflies.

- You are working to predict customer turnover on a big data project for your marketing firm.

Ten butterflies is a small dataset, so sampling isn't useful here.

<b>1.2 Simple sampling with pandas</b>

Throughout this chapter, you'll be exploring song data from Spotify. Each row of this population dataset represents a song, and there are over 40,000 rows. Columns include the song name, the artists who performed it, the release year, and attributes of the song like its duration, tempo, and danceability. You'll start by looking at the durations.

Your first task is to sample the Spotify dataset and compare the mean duration of the population with the sample.

In [10]:
import pandas as pd
import numpy as np
spotify_population = pd.read_feather("C:\\Users\\yazan\\Desktop\\Data_Analytics\\8-Sampling in Python\Datasets\\spotify_2000_2020.feather")

In [11]:
# Sample 1000 rows from spotify_population
spotify_sample = spotify_population.sample(n=1000)

# Print the sample
print(spotify_sample)

# Calculate the mean duration in mins from spotify_population
mean_dur_pop = spotify_population['duration_minutes'].mean()

# Calculate the mean duration in mins from spotify_sample
mean_dur_samp = spotify_sample['duration_minutes'].mean()

# Print the means
print(mean_dur_pop)
print(mean_dur_samp)

       acousticness                                            artists  \
31320       0.80200                                     ['The Whites']   
26473       0.35100                                    ['Trevor Hall']   
26239       0.00481                                     ['Jawga Boyz']   
10228       0.10400  ['Rae Sremmurd', 'Swae Lee', 'Slim Jxmmi', 'Tr...   
11071       0.75800                                    ['Johnny Gill']   
...             ...                                                ...   
35377       0.06200                        ['Diplo', 'Jonas Brothers']   
29331       0.00702                                    ['Project Pat']   
22256       0.30200                                        ['J. Cole']   
25205       0.43700                                      ['Nate Dogg']   
988         0.95200                            ['Scripture Lullabies']   

       danceability  duration_ms  duration_minutes  energy  explicit  \
31320         0.659     214307.0       

<b>1.3 Simple sampling and calculating with NumPy</b>

You can also use numpy to calculate parameters or statistics from a list or pandas Series.

You'll be turning it up to eleven and looking at the loudness property of each song.

In [16]:
# Create a pandas Series from the loudness column of spotify_population
loudness_pop = spotify_population['loudness']

# Sample 100 values of loudness_pop
loudness_samp = loudness_pop.sample(n=100)

# Calculate the mean of loudness_pop
mean_loudness_pop = np.mean(loudness_pop)

# Calculate the mean of loudness_samp
mean_loudness_samp = np.mean(loudness_samp)

# Print the means
print(mean_loudness_pop)
print(mean_loudness_samp)

-7.366856851353947
-7.470740000000002
