# Pig rations via bootstrap

Use the bootstrap (resampling with replacement) procedure to test
whether the observed differences are surprising in the distribution of
difference between new simulated samples.

First we need to get the measured data from the data file using the
Pandas library:

In [None]:
# Load the Numpy library for arrays.
import numpy as np
# Load the Pandas library for loading and selecting data.
import pandas as pd

We load the file containing the data:

In [None]:
# Read the data file containing pig ration data.
rations_df = pd.read_csv('data/pig_rations.csv')
# Show the first 5 rows.
rations_df.head()

Let us first select the rows containing data for ration B (we will get
the rows for ration A afterwards):

In [None]:
# Select ration B rows.
ration_b_df = rations_df[rations_df['ration'] == 'B']
#  Show the first five rows.
ration_b_df.head()

Finally for ration B, convert the weights to an array for use in the
simulation.

In [None]:
b_weights = np.array(ration_b_df['weight_gain'])
# Show the result.
b_weights

Select ration A rows, and get the weights as an array:

In [None]:
ration_a_df = rations_df[rations_df['ration'] == 'A']
a_weights = np.array(ration_a_df['weight_gain'])
# Show the result.
a_weights

We will use the `a_weights` and `b_weights` arrays for our simulation.
We are going to shuffle these weights, so we first *concatenate* the two
arrays (see <a href="#sec-concatenate" class="quarto-xref"><span
class="quarto-unresolved-ref">sec-concatenate</span></a>) so we can
shuffle them:

In [None]:
both = np.concatenate([a_weights, b_weights])
both

Now do the simulation:

In [None]:
import matplotlib.pyplot as plt

# set up the random number generator
rnd = np.random.default_rng()

# Set the number of trials
n_trials = 10_000

# An empty array to store the trial results.
results = np.zeros(n_trials)

# Do 10,000 experiments.
for i in range(n_trials):
    # Take a "resample" of 12 with replacement from both and put it in fake_a
    fake_a = rnd.choice(both, size=12)
    # Likewise to make fake_b
    fake_b = rnd.choice(both, size=12)
    # Sum the first "resample."
    fake_a_sum = np.sum(fake_a)
    # Sum the second "resample."
    fake_b_sum = np.sum(fake_b)
    # Calculate the difference between the two resamples.
    fake_diff = fake_a_sum - fake_b_sum
    # Keep track of each trial result.
    results[i] = fake_diff
    # End one experiment, go back and repeat until all trials are complete,
    # then proceed.
# Produce a histogram of trial results.
plt.hist(results, bins=25)
plt.xlabel('Second resample minus first')
plt.title('Distribution difference in sums of resamples')

From this histogram we see that a very small proportion of the trials
produced a difference between groups as large as that observed (or
larger). Python will calculate this for us with the following code:

In [None]:
# Determine how many of the trials produced a difference between resamples.
count_more = np.sum(results >= 38)
# Likewise for a difference of -38.
count_less = np.sum(results <= -38)
# Add the two together.
k = count_more + count_less
# Divide by number of trials to convert to proportion.
kk = k / n_trials
# Print the result.
print('Proportion of trials with either >=38 or <=-38:', kk)