# Data and Sampling Distributions

Cheat sheet for creating a proper, representative dataset.

**1️⃣  Define Your Objective**
* Clearly state what you want to measure or predict.
* Identify your **population of interest**.
  * Example: “All adults in New York City” for a city-wide poll.
* Determine the type of data needed: categorical, numerical, text, etc.

**2️⃣  Identify the Population**
* **Population** = full group you want to represent.
* Define **inclusion/exclusion** criteria.
* Check **feasibility**: Can you reach this population?

**3️⃣ Choose Sampling Method**

| Method       | When to Use                               | Notes / Pitfalls                                      |
|-------------|-------------------------------------------|-------------------------------------------------------|
| **Simple Random** | Small, well-defined populations        | Everyone has equal chance; requires full list of population |
| **Stratified**   | Subgroups exist (age, gender, region) | Sample proportionally from each group to reduce bias |
| **Cluster**      | Population is large and spread out    | Randomly select clusters, then sample within clusters |
| **Systematic**   | Ordered population list exists        | Pick every k-th individual; avoid periodicity bias  |
| **Convenience**  | Quick, cheap                          | Often biased; avoid for formal studies             |

**4️⃣ Determine Sample Size**

Influenced by:
* **Population size**: If the population is very large (like all adults in a country), you can often treat it as “effectively infinite” for sample size calculation.
* **Desired margin of error**: Maximum acceptable difference between sample estimate and true population value. Example: ±5% means your estimate should be within 5 percentage points of the true value.
* **Confidence level CL** (typically 95%): How sure you want to be that your sample reflects the true population.

Formula (approximate for large populations):

$n = \frac{Z^2 \cdot p \cdot (1-p)}{E^2}$

Where:
* $Z$ = z-score for confidence level (1.96 for 95%)
* $p$ = estimated proportion of success (0.5 if unknown)
* $E$ = margin of error

| Confidence Level (CL) | Z-score |
|-----------------|---------|
| 90%             | 1.645   |
| 95%             | 1.96    |
| 99%             | 2.576   |

Example:

You want to estimate the proportion of people who like a new app in a population of millions:
* $CL$ = $95\% \rightarrow Z = 1.96$
* $p$ = $0.5 (unknown)$
* $E$ = $0.05$

$n = \frac{1.96^2 \cdot 0.5 \cdot 0.5}{0.05^2}
= \frac{3.8416 \cdot 0.25}{0.0025}
= \frac{0.9604}{0.0025}
= 384.16 \approx 385$

→ So you need 385 people to estimate the proportion within ±5% at 95% confidence.


**5️⃣ Data Collection Best Practices**

* **Avoid nonresponse bias**: follow up with participants who don’t respond.
* **Ensure clarity**: questions or measurement methods should be unambiguous.
* **Randomize order** if applicable to reduce bias.
* **Record metadata**: time, source, collection method.


**6️⃣ Check Representativeness**
*	Compare sample vs. population demographics (age, gender, region, etc.).
*	If skewed, apply weighting: underrepresented groups get more influence.
*	Detect outliers or errors early.


**7️⃣ Data Cleaning & Validation**
* Handle missing values: impute, remove, or flag.
* Correct obvious errors (typos, impossible values).
* Standardize formats (dates, strings, categorical codes).

**8️⃣ Documentation**

Always note:
* Sampling method
* Inclusion/exclusion criteria
* Collection method
* Limitations / potential biases
