# Setup

In [1]:
from google.colab import files
from IPython.display import clear_output

In [2]:
!pip install -U -q kaggle

In [3]:
files.upload()
clear_output()

In [4]:
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

In [5]:
import numpy as np
import pandas as pd

# Sampling with replacement in NumPy

In [6]:
np.random.choice(a=12, size=12, replace=True)

array([ 8,  9,  9,  0,  5,  7, 10,  3,  7,  4,  3,  8])


# Sampling with replacement in Pandas
Let's load the [House Sales in King County, USA](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction) dataset. We will use only a small subset of this dataset.

In [7]:
!kaggle datasets download harlfoxem/housesalesprediction

Downloading housesalesprediction.zip to /content
  0% 0.00/780k [00:00<?, ?B/s]
100% 780k/780k [00:00<00:00, 107MB/s]


In [9]:
!unzip housesalesprediction.zip && rm -rf housesalesprediction.zip

Archive:  housesalesprediction.zip
  inflating: kc_house_data.csv       


In [10]:
df = pd.read_csv(
    'kc_house_data.csv',
    usecols=['bedrooms', 'bathrooms', 'sqft_living', 
             'sqft_lot', 'floors', 'price'],
    nrows=15
    )

In [11]:
df

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors
0,221900.0,3,1.0,1180,5650,1.0
1,538000.0,3,2.25,2570,7242,2.0
2,180000.0,2,1.0,770,10000,1.0
3,604000.0,4,3.0,1960,5000,1.0
4,510000.0,3,2.0,1680,8080,1.0
5,1225000.0,4,4.5,5420,101930,1.0
6,257500.0,3,2.25,1715,6819,2.0
7,291850.0,3,1.5,1060,9711,1.0
8,229500.0,3,1.0,1780,7470,1.0
9,323000.0,3,2.5,1890,6560,2.0


In [12]:
df.sample(n=15, replace=True)

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors
0,221900.0,3,1.0,1180,5650,1.0
13,400000.0,3,1.75,1370,9680,1.0
1,538000.0,3,2.25,2570,7242,2.0
7,291850.0,3,1.5,1060,9711,1.0
0,221900.0,3,1.0,1180,5650,1.0
4,510000.0,3,2.0,1680,8080,1.0
7,291850.0,3,1.5,1060,9711,1.0
1,538000.0,3,2.25,2570,7242,2.0
12,310000.0,3,1.0,1430,19901,1.5
13,400000.0,3,1.75,1370,9680,1.0


# How many lost samples/rows should you expect when sampling with replacement to create a bootstrapped dataset?
Let $N$ be the size of your original dataset. To create a boostrapped dataset, you need to do sampling with replacement $N$ times.<br>
For a single sampling with replacement, the probability that a particular row of data is not sampled from the original dataset is:
$$1 - \frac{1}{N}$$
For a bootstrapped dataset, the probability that a particular row of data is not sampled from the original dataset is:
$$\Big(1 - \frac{1}{N}\Big)^N$$
<br>
$$\lim_{N\to\infty}\Big(1 - \frac{1}{N}\Big)^N = e^{-1} = 3.6787$$
That means, the larger your dataset, the more likely the percentage of rows, lost/missing in the bootstrapped dataset, will be close to 36.787%.<br>
To illustrate this, let's load the full King County dataset.

In [13]:
df = pd.read_csv(
    'kc_house_data.csv',
    usecols=['bedrooms', 'bathrooms', 'sqft_living', 
             'sqft_lot', 'floors', 'price']
    )

In [14]:
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors
0,221900.0,3,1.0,1180,5650,1.0
1,538000.0,3,2.25,2570,7242,2.0
2,180000.0,2,1.0,770,10000,1.0
3,604000.0,4,3.0,1960,5000,1.0
4,510000.0,3,2.0,1680,8080,1.0


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   price        21613 non-null  float64
 1   bedrooms     21613 non-null  int64  
 2   bathrooms    21613 non-null  float64
 3   sqft_living  21613 non-null  int64  
 4   sqft_lot     21613 non-null  int64  
 5   floors       21613 non-null  float64
dtypes: float64(3), int64(3)
memory usage: 1013.2 KB


In [16]:
df_bootstrapped = df.sample(frac=1, replace=True)

In [18]:
len(df_bootstrapped.index.unique()) / len(df)

0.635312080692176

The bootstrapped dataset contains about 63.5% of the original samples/rows. That means about 36.5% of the rows from the original dataset was lost.