In [None]:
%%capture
%%sh
chmod u+x ./helpful-script.sh
./helpful-script.sh setup

In [None]:
import otter
grader = otter.Notebook('GAI-E08.ipynb')

# Week 8 GenAI Learning Log

## Your Mission: Understand Populations, Samples, and Distributions

Welcome to your next GenAI learning module! 🎉 This week, we transition from probability to the core of statistical inference: **sampling and describing data**. Last week, you focused on **Probability**—the mathematics of chance. This assignment will guide you through the process of correctly gathering data (**sampling**), visualizing it (**distributions**), and summarizing it numerically (**summary statistics**). As always, your AI assistant is here to help you explore these concepts and prepare for hands-on practice in class. Let's dive in!

### What You'll Need

* Access to TerrierGPT or ChatGPT (or your preferred AI assistant)
* This notebook for recording your responses
* About 2-3 hours of focused exploration time (but not necessarily all at once!)

**Important:** This is a GREEN ZONE assignment. AI collaboration is not just allowed but encouraged!

## Part 1: Sampling Concepts (75 min)

**Your Mission:** Understand the foundational concepts of gathering data from large groups.

---

### Question 1.1: Population vs. Sample

Ask your AI assistant:

"I need to understand how we get data from the real world: **What's the difference between a population and a sample?**"

In the variable `pop_vs_sample`, define a **population** and a **sample**, and give one simple example that illustrates the difference (e.g., all students in a university vs. 100 students surveyed).

In [None]:
pop_vs_sample = ...

In [None]:
grader.check("q1.1")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 1.2: Why We Sample

Ask your AI assistant:

"I need to understand how we get data from the real world: **Why can't we always study the entire population?**"

In the variable `why_sample`, state two primary reasons (e.g., cost, feasibility, time) why studying the entire population is often impossible or impractical.

In [None]:
why_sample = ...

In [None]:
grader.check("q1.2")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 1.3: Good vs. Bad Samples

Ask your AI assistant:

"I need to understand how we get data from the real world: **What makes a good sample vs a bad sample?**"

In the variable `good_vs_bad_sample`, describe the key characteristic that determines a **good sample** (e.g., representativeness) and one example of a **bad sample** (e.g., convenience sample).

In [None]:
good_vs_bad_sample = ...

In [None]:
grader.check("q1.3")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 1.4: Sampling Methods

Ask your AI assistant:

"I need to understand how we get data from the real world: **What are different sampling methods and when do we use each?**"

In the variable `sampling_methods`, describe two different types of random sampling methods (e.g., simple random sampling, stratified sampling) and briefly explain when a data scientist would choose one over the other.

In [None]:
sampling_methods = ...

In [None]:
grader.check("q1.4")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 1.5: Sampling Method Exercise

Imagine two researchers want to sample BU students for a survey:

1.  **Ben** took every single BU student email address and randomly chose 1,000 to send his survey link.
2.  **Jerry** stood outside the CDS building from 10 am to 12 pm and asked students passing by to fill out his survey.

In the variable `sampling_comparison`, identify what Ben's and Jerry's sampling methods are called, and explain which method is better for achieving a **representative sample** and why.

In [None]:
sampling_comparison = ...

In [None]:
grader.check("q1.5")

In [None]:
!./helpful-script.sh save 1>/dev/null


--- 

## Part 2: Sampling Bias and Problems (60 min)

**Your Mission:** Identify potential pitfalls and errors that can undermine data quality.

---

### Question 2.1: What is Sampling Bias?

Ask your AI assistant:

"What can go wrong with sampling? **What is sampling bias and how does it happen?**"

In the variable `sampling_bias_def`, define **sampling bias** and explain one common way it can occur (e.g., self-selection bias, undercoverage).

In [None]:
sampling_bias_def = ...

In [None]:
grader.check("q2.1")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 2.2: Examples of Biased Samples

Ask your AI assistant:

"What can go wrong with sampling? **Give me examples of biased samples from real surveys or studies**"

In the variable `biased_example`, describe one concrete historical or theoretical example of a biased sample and explain why it failed to represent the population accurately.

In [None]:
biased_example = ...

In [None]:
grader.check("q2.2")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 2.3: The Effect of Response Rates

Ask your AI assistant:

"What can go wrong with sampling? **How do response rates affect sample quality?**"

In the variable `response_rate_effect`, explain the potential problem that arises when a survey has a low response rate (i.e., non-response bias).

In [None]:
response_rate_effect = ...

In [None]:
grader.check("q2.3")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 2.4: Representative Samples

Ask your AI assistant:

"What can go wrong with sampling? **What is a representative sample and why is it important?**"

In the variable `representative_sample_importance`, define a **representative sample** and explain why achieving one is the main goal of statistical sampling.

In [None]:
representative_sample_importance = ...

In [None]:
grader.check("q2.4")

In [None]:
!./helpful-script.sh save 1>/dev/null


--- 

## Part 3: Data Distributions (90 min)

**Your Mission:** Understand how to visualize and interpret the shape of your data.

---

### Question 3.1: What is a Distribution?

Ask your AI assistant:

"Now I want to understand how data is distributed: **What is a distribution and why do we care about the shape of data?**"

In the variable `distribution_meaning`, define a **data distribution** and explain why its shape is a critical piece of information for data analysis.

In [None]:
distribution_meaning = ...

In [None]:
grader.check("q3.1")

### Question 3.2: Histograms

Ask your AI assistant:

"Now I want to understand how data is distributed: **What do histograms tell us about our data?**"

In the variable `histogram_info`, explain what a **histogram** visualizes and list two specific things a data scientist can learn by looking at one (e.g., skewness, modality).

In [None]:
histogram_info = ...

In [None]:
grader.check("q3.2")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 3.3: Normal Distributions

Ask your AI assistant:

"Now I want to understand how data is distributed: **What are normal distributions and why are they important?**"

In the variable `normal_dist_importance`, describe the key visual characteristic of a **normal distribution** (the 'bell curve') and explain why it is so frequently used in statistics and data science.

In [None]:
normal_dist_importance = ...

In [None]:
grader.check("q3.3")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 3.4: Outliers

Ask your AI assistant:

"Now I want to understand how data is distributed: **What are outliers and how do they affect our analysis?**"

In the variable `outlier_effect`, define an **outlier** and briefly explain one way it can distort the conclusions drawn from a dataset.

In [None]:
outlier_effect = ...

In [None]:
grader.check("q3.4")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Exercise: Interpreting a Sample Distribution

The code block below generates a sample dataset. Run the code to visualize the data distribution before answering the next two questions.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set a seed for reproducibility
np.random.seed(42)

# Generating sample data
skewed_data = np.random.exponential(scale=20000, size=500) + 30000
salaries_data = np.append(skewed_data, [500000])

# Create and display the histogram
plt.figure(figsize=(10, 6))
plt.hist(salaries_data, bins=30, edgecolor='black', alpha=0.7)
plt.title('Distribution of Hypothetical Salaries (Right Skewed with Outlier)')
plt.xlabel('Salary ($)')
plt.ylabel('Frequency')
plt.axvline(np.median(salaries_data), color='red', linestyle='dashed', linewidth=1, label='Median')
plt.axvline(np.mean(salaries_data), color='green', linestyle='dashed', linewidth=1, label='Mean')
plt.legend()
plt.show()


### Question 3.5: Identifying Skewness

Based on the histogram generated above, what is the skewness of the `salaries_data` distribution (left, right or normally distributed), and how do the mean (green line) and median (red line) relate to each other in this skewed shape?

In [None]:
skewness_and_center = ...

In [None]:
grader.check("q3.5")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 3.6: Identifying the Outlier

Referring to the histogram, is there an obvious outlier present in the `salaries_data`? If yes, briefly describe where the outlier is located relative to the rest of the data. Also, explain the effect of the outlier on the mean and median.

In [None]:
outlier_location = ...

In [None]:
grader.check("q3.6")

In [None]:
!./helpful-script.sh save 1>/dev/null


--- 

## Part 4: Summary Statistics (55 min)

**Your Mission:** Learn how to summarize and quantify the key features of a data distribution.

---

### Question 4.1: Measures of Center

Ask your AI assistant:

"How do we summarize distributions with numbers? **What do mean, median, and mode tell us differently?**"

In the variable `center_measures_difference`, summarize the unique information provided by the **mean**, **median**, and **mode**.

In [None]:
center_measures_difference = ...

In [None]:
grader.check("q4.1")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 4.2: Measures of Spread

Ask your AI assistant:

"How do we summarize distributions with numbers? **How do we measure spread (range, standard deviation)?**"

In the variable `spread_measures`, define the **range** and the **standard deviation**, and explain what each one tells you about the variability of the data.

In [None]:
spread_measures = ...

In [None]:
grader.check("q4.2")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 4.3: Outliers' Effect on Summary Statistics (Theory)

Ask your AI assistant:

"How do we summarize distributions with numbers? **How do outliers affect different summary statistics?**"

In the variable `outlier_on_stats`, explain how an **outlier** affects the **mean** and how it affects the **median**.

In [None]:
outlier_on_stats = ...

In [None]:
grader.check("q4.3")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 4.4: Best Measure of Center (General)

Ask your AI assistant:

"How do we summarize distributions with numbers? **When is each measure of center most useful?**"

In the variable `best_measure_of_center`, explain which measure of center (**mean, median, or mode**) is generally considered the most robust (least affected by outliers) and why.

In [None]:
best_measure_of_center = ...

In [None]:
grader.check("q4.4")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Exercise: Applying Measures of Center to Skewed Data

The code block below is the same histogram from Part 3, using the **right-skewed salary data with an outlier**. Use this visualization to answer the next two questions regarding the central tendency.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set a seed for reproducibility
np.random.seed(42)

# Generate a right-skewed dataset (using exponential distribution, then scaling)
skewed_data = np.random.exponential(scale=20000, size=500) + 30000

# Introduce a single, extreme outlier
salaries_data = np.append(skewed_data, [500000])

# Create and display the histogram
plt.figure(figsize=(10, 6))
plt.hist(salaries_data, bins=30, edgecolor='black', alpha=0.7)
plt.title('Distribution of Hypothetical Salaries (Right Skewed with Outlier)')
plt.xlabel('Salary ($)')
plt.ylabel('Frequency')
plt.axvline(np.median(salaries_data), color='red', linestyle='dashed', linewidth=1, label='Median')
plt.axvline(np.mean(salaries_data), color='green', linestyle='dashed', linewidth=1, label='Mean')
plt.legend()
plt.show()


### Question 4.5: Best Center Metric for Skewed Data (Application)

Look closely at the histogram above, specifically at the positions of the **Mean** (green line) and the **Median** (red line). Which of these two metrics is the better measure for describing the *typical* salary in this specific dataset?

In [None]:
best_center_metric = ...

In [None]:
grader.check("q4.5")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 4.6: Justifying the Metric Choice

Explain **why** the metric you chose in Question 4.5 is superior for describing the center of the `salaries_data` distribution, specifically considering the presence of the skew and the extreme outlier.

In [None]:
justification_for_metric = ...

In [None]:
grader.check("q4.6")

In [None]:
!./helpful-script.sh save 1>/dev/null