# Chapter 1

- population = whole dataset
- sample = subset of population
    - sampling with replacement (for dependent event): `df["col"].sample(5, replace = True)`
    - sampling without replacement (for independent event): `df["col"].sample(5, replace = False)`

### Random numbers

```
# Generate 10 uniform random values between 0 to 5
from scipy.stats import uniform
uniform.rvs(0, 5, size=10)
# Generate binomial random values
from scipy.stats import binom
binom.rvs(1, 0.5, size=8) # 1 coin, flip 8 times, probability of success 50%
binom.rvs(8, 0.5, size=1) # 8 coins, flip 1 time, probability of success 50%
binom.rvs(3, 0.5, size=10) # 3 coins, flip 10 times, probability of success 50%

# Generate 10 random normal values with mean 161 and std of 7
from scipy.stats import norm
norm.rvs(161, 7, size=10)

# Generate 10 random poisson values with lambda of 8
from scipy.stats import poisson
poisson.rvs(8, size=10)

# Generate 10 random values from a t-distribution with 5 degrees of freedom
from scipy.stats import t
t.rvs(df=5, size=10)

# Generate 10 random values from a log-normal distribution with mean 1.5 and standard deviation 0.8
from scipy.stats import lognorm
lognorm.rvs(s=0.8, scale=1.5, size=10)

#### Alternatives ###########
import numpy as np
np.random.seed(42)
random_beta = np.random.beta(a=2, b=2, size=5000)
random_normal = np.random.normal(loc=2, scale=1.5, size=2)
random_uniform =  np.random.uniform(low=-3, high=3, size = 5000)

# Visualize
plt.hist(uniforms, bins = np.arange(-3,3.1,0.25))
plt.show()
```

# Chapter 2

- Systematic sampling : Sampling by taking every n-th element in a shuffled dataset
- Simple random sampling : Sampling by taking purely random rows in a dataset
- Stratified samping : sampling by keeping proportions of subgroups in account
- Weighted sampling : sampling by adding weights to subgroups to adjust relative probability of a row being sampled
- cluster sampling : First randomly pick subgroups of dataset, then randomly sample rows from those subgroups

### Sampling

```
# Visualize sampling distribution
df_sample = df.sample(n=10)
df_sample["col"].hist(bins=np.arange(59, 93, 2))
plt.show()
# Sampling with replacement (for dependent event)
df["col"].sample(5, replace = True)
# Sampling without replacement (for independent event)
df["col"].sample(5, replace = False)

# Simple random sampling
simple_sample = df.sample(n=5, random_state=42)

# Systematic sampling
sample_size = 5
pop_size = len(df)
interval = pop_size // sample_size
shuffled_df = df.sample(frac=1)
shuffled_df = shuffled_df.reset_index(drop=True).reset_index()
systematic_sample = shuffled_df.iloc[::interval]

# Stratified sampling
prop_stratified_sample = df.groupby("cat_col").sample(frac=0.1, random_state=42)
equal_stratified_sample = df.groupby("cat_col").sample(n=15, random_state=42)

# Weighted sampling
condition = df['cat_col'] == "Val"
df['weight'] = np.where(condition, 2, 1)
weighted_sample = df.sample(frac=0.1, weights="weight")

# Cluster sampling
category_list = list(df['cat_col'].unique())
import random
random_categories = random.sample(category_list, k=3)
subset_rows = df['cat_col'].isin(random_categories)
subset_df = df[subset_rows]
subset_df['cat_col'] = subset_df['cat_col'].cat.remove_unused_categories()
sample_cluster = subset_df.groupby("cat_col").sample(n=5, random_state=42)

# Visualize to make sure white noise so that sampling is random
sample_df.plot(x="col1", y="col2", kind="scatter")
plt.show()
```

# Chapter 3

- Higher the sample size more accurate the estimation
- Sampling distribution : Distribution of means of many samples
- As the sample size increases, the range of calculated sample means tends to decrease.
- Central Limit Theorem:
	- The distribution of the means gets closer to being normally distributed
	- The width of the distribution of means gets narrower with larger sample size
- Standard error
	- Standard deviation of the means of sample distributions
- The amount of variation in the sampling distribution is related to the amount of variation in the population and the sample size. This is another consequence of the Central Limit Theorem.
- Bootstrapping : use sampling with replacement from a sample to build theoretical population parameter by augmenting/simulating a larger artificial population dataset

### All possible combination

```
from itertools import product
die1 = die2 = die3 = die4 = die5 = [1,2,3,4,5,6,7,8]
dice = pd.DataFrame(list(product(die1, die2, die3, die4, die5 )), columns=['die1', 'die2', 'die3', 'die4', 'die5'])
```

### Simulate Dice Roll

```
# high depends on what type of dice you are playing with
# size depends on how many dice you want to play with
# If you consider playing with 10 coins as dices, then high = 2 (1 coin has 2 sides) and no_of_dice = 10
# If you consider playing with 5 ludo dices, then high = 6 (1 dice has 6 sides) and no_of_dice = 5
# replace=True for any previous occurrence is likely to occur again
import numpy as np
simulation = np.random.choice(list(range(1,high+1)), size=no_of_dice, replace=True)
```

### Satistics

```
import statistics
import numpy as np
# mode 
statistics.mode(df['col'])
# median 
statistics.median(df['col'])
# sample mean 
statistics.mean(df['col'])
# sample variance 
np.var(df['col'], ddof=1)
# sample standard deviation 
np.std(df['col'], ddof=1)
# population variance 
np.var(df['col'], ddof=0)
# population standard deviation 
np.std(df['col'], ddof=0)
# quantile 
np.quantile(df['col'], [0, 0.25, 0.5, 0.75, 1])
# iqr 
scipy.stats.iqr(df['col'])
from scipy.stats import norm
# Sort the values in the column for CDF and inverse CDF
sorted_col = np.sort(df['col'])
# Calculate PDF using a probability distribution (e.g., normal distribution)
pdf_values = norm.pdf(sorted_col, loc=df['col'].mean(), scale=df['col'].std())
# Calculate CDF
cdf_values = norm.cdf(sorted_col, loc=df['col'].mean(), scale=df['col'].std())
# Calculate Inverse CDF (Quantiles)
quantiles_values = norm.ppf(np.linspace(0, 1, len(sorted_col)), loc=df['col'].mean(), scale=df['col'].std())

# Visualize PDF, CDF, Inverse CDF
result_df = pd.DataFrame({
    'col': sorted_col,
    'PDF': pdf_values,
    'CDF': cdf_values,
    'Quantiles': quantiles_values
})
sns.lineplot(x='col', y='PDF', data=result_df)
sns.lineplot(x='col', y='CDF', data=result_df)
sns.scatterplot(x='col', y='Quantiles', data=result_df)
plt.show()
```

# Chapter 4

- population standard deviation : Standard error of bootstrap distribution * square root of sample size 
- If sampling has bias, then bootstrapping with that sample will also introduce bias.
- confidence interval : 
    - a range of unknown quantity around the mean with an estimate of standard deviation
    - `(mean - 1*std , mean + 1* std)`
- Probability Density function (PDF) = the bell curve, normal distribution
- Cumulative Density Function (CDF) = the S curve, area under the curve
- Inverse CDF = the flipped S curve, value of percentile for confidence interval

### Bootstrapping

```
# All possible combination of dice rolls
from itertools import product
die1 = die2 = die3 = die4 = die5 = [1,2,3,4,5,6,7,8]
dice = pd.DataFrame(list(product(die1, die2, die3, die4, die5 )), columns=['die1', 'die2', 'die3', 'die4', 'die5'])

# Bootstrapping example 1 : Simulating dice roll
sample_means_1000 = []
# Simulate 1000 turns of 4 dice rolls
for i in range(1000): 
    sample_means_1000.append( np.random.choice(list(range(1,7)), size=4, replace=True).mean() )
# Visualize distribution
plt.hist(sample_means_1000, bins=20)

# Bootstrapping example 2 : Resampling from existing sample
import numpy as np
sample_means = []
for i in range(1000):
	sample_means.append(np.mean(df.sample(frac=1, replace=True)['col']))
```