In [None]:
set.seed(1234) #reproducibility
##(1)##
standardized_data <- ... |>
    select(...) |>
    mutate(across(everything(), scale))

##(2)##
elbow_stats <- tibble(k = 1:10) |>
	                  rowwise() |>
                    mutate(... = list(kmeans(..., centers=k,nstart = ...))) |>
                    mutate(glanced = list(glance(...))) |> 
                    select(-...) |> 
                    unnest(glanced)
##(4)##
elbow_plot <- ... |>
    ggplot(aes(x = k, y = tot.withinss)) +
    geom_point(size = 2) +
    geom_line() +
    labs(x = "K",
         y = "Total within-cluster sum of squares",
         title = "Elbow Plot") +
    scale_x_continuous(breaks = 1:10) +
    theme(text = element_text(size = 20))

elbow_plot

##(5)##
optimal_kmeans_object <- kmeans(..., nstart = ..., centers = ...)
optimal_kmeans_object

##(6)##
cluster_assignments <- augment(optimal_kmeans_object, standardized_data)
cluster_assignments

##(7)##
... <- ... |>
 ggplot(aes(x=...,y=..., color = .cluster)) +
    geom_point() +
    labs(x = "...(standardized)", y = "...(standardized)", color = "Clusters", 
    title = "K-means clustering with ... clusters")



# BOOK SOLUTION #

In [None]:
#1#
penguin_clust_ks <- tibble(k = 1:9) |>
  rowwise() |>
  mutate(penguin_clusts = list(kmeans(standardized_data, nstart = 10, k)),
         glanced = list(glance(penguin_clusts)))
#2#
clustering_statistics <- penguin_clust_ks |>
  unnest(glanced)
#3#
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
  geom_point() +
  geom_line() +
  xlab("K") +
  ylab("Total within-cluster sum of squares") +
  scale_x_continuous(breaks = 1:9) + 
  theme(text = element_text(size = 12))

elbow_plot
#4#
... <- kmeans(..., nstart = ... , centers = ...)

#5#
... <- augment(.#4#, #standardized_data) |>
    ggplot(aes(x=..., y=..., color = .cluster)) + 
    geom_point(alpha = 0.5, size = 2) +
    labs(x="... Standardized", y= "... Standardized",
    color = "Clusters", title = "... vs. ...")
               
               
#########
... <- ... |>
    pivot_longer(cols = -.cluster, names_to = 'category', values_to = 'value')  |> 
    ggplot(aes(value, fill = .cluster)) +
        geom_density(alpha = 0.4, colour = 'white') +
        facet_wrap(~ category, scales = 'free') +
        theme_minimal() +
        theme(text = element_text(size = 20))

##(8)##
glance(optimal_kmeans_object)

**nstart:** To counter “bad” initialization, we use `nstart` argument in kmeans function to tells R the amount of times to run the algorithm with random starts, and returns the best one.

# What is inference?

- In inference questions deal with estimating the available data on their relationship to the a wider population
 
**point-estimate:** point estimate uses the sample drawn from the population to calculate the satistical value that can be used to estimate the population parameter

**sampling distribution:** from the population, all possible samples are drawn to calculate their point-estimates. The sampling distribution display these point estimates on a histogram. 

-**Quantitative (mean):** Since we collected representative sampling from the population, we can expect the mean of the sample mean centers around the population mean
- **By increasing the sample size**, we can decrease the sample variability because more sample means being more represenative toward the population. This can also provide us more reliable point estimates.
- The sampling distribution for both mean and proportions only becomes bell-shaped once the samle is large enough ~ 20

**representative sampling:** it is important to have a representative sample of the true population’s characteristics

In [None]:
SAMPLING DISTRIBUTION for SAMPLE PROPORTION
------------------------------------------------------------------------------------
set.seed(123)

samples <- rep_sample_n(..., size = ..., reps = ...)
samples

sample_estimates <- samples |>
  group_by(replicate) |>
  summarize(sample_proportion = sum(... == "...") / ...)
sample_estimates


sampling_distribution <- ggplot(sample_estimates, aes(x = sample_proportion)) +
  geom_histogram(fill = "dodgerblue3", color = "lightgrey", bins = 12) +
  xlab("Sample proportions")
sampling_distribution

sample_estimates |>
  summarize(mean = mean(sample_proportion))

In [None]:
SAMPLING DISTRIBUTION for SAMPLE MEAN
------------------------------------------------------------------------------------
set.seed(123)
#1#
samples <- rep_sample_n(..., size = ..., reps = ...)
samples

sample_estimates <- samples |>
  group_by(replicate) |>
  summarize(sample_mean = mean(...))
sample_estimates


sampling_distribution <- ggplot(sample_estimates, aes(x = sample_mean)) +
  geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
  xlab("Sample mean price per night ($)") +
  
sampling_distribution

# Bootstrapping

For a sample of size n, you:

1. Randomly select an observation from the original sample, which was drawn from the population
2. Record the observation’s value
3. Replace that observation
4. Repeat steps 1 - 3 (sampling with replacement) until you have  observations, which form a bootstrap sample n
    

> Steps 1-4 give you one bootstrap sample.When repeated several times, you get several bootstrap samples.R creates several bootstrap samples using rep_sample_n(size = n, replace = TRUE, reps = ...)
> 
5. Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the  observations in your bootstrap sample n
6. Repeat steps (1) - (5) many times to create a distribution of point estimates (the bootstrap distribution)
7. Calculate the plausible range of values around our observed point estimate.

In [None]:
#Making Bootstrap Samples
... <- ... |>
    rep_sample_n(size = ..., replace = TRUE, reps = ...) |>
    group_by(replicate) |>
    summarize(mean = mean(...))

#Making bootstrap distribution
boot_est_dist <- ggplot(..., aes(x = mean)) +
  geom_histogram(fill = "dodgerblue3") +
  xlab("Sample mean price per night ($)") +
  ggtitle("Bootstrap Distribution") +
  theme(text = element_text(size = 20))
boot_est_dist


#Confidence Levels
bounds <- ... |>
  select(mean) |>
  pull() |>
  quantile(c(0.025, 0.975))
bounds