# Chapter 12 - Sample Surveys

* step from making statements about a small sample to statements about the entire population
* three underlying ideas

## Idea 1: Examine a Part of the Whole

* examine a smaller group of individuals than the entire population -- a **sample**
* if selected properly, a small sample can represent the entire population
* **sample surveys** ask questions of a small group in hopes of learning something about the entire population

### Bias

* selecting a sample fairly is difficult
* sampling methods that, by their nature, tend to over- or underemphasie some characteristics of the population are said to be **biased**
* conclusions based on biased samples are inherently flawed
* modern polls select individuals _at random_

## Idea 2: Randomize

* pretects us from the influences of _all_ th efeatures of our population by making sure that, _on average_, the sample looks like the rest of the population

## Idea 3: Sample Size

* how large does the sample need to be to reasonably represent the population?
* what matters isn't the _fraction_ or _percentage_ of the population, but _the number of individuals in the sample_
    * note: _percentage_ can matter if the population is small to begin with, but not when the sample is a small fraction of the population

## Does a Census Make Sense?

* a **census** is a sample of the entire population
* challenges:
  * difficult or impractical to complete
  * population is changing (even during census generation)
  * can be more complex than sampling

## Populations and Parameters

* **parameter** - an unknown / unknowable (?) number modeling some aspect of the population
* **statistic** - a summary found from data based on a sample

* we want statistics we compute from a sample to reflect the corresponding parameters accurately; a sample that does this is said to be **representative**

## Simple Random Samples

* how to draw a representative sample?
* ensure that every individual has an equal chance of being selected?
    * this is _necessary_, but not _sufficient_
    * example: School with equal number of males and females.  Flip a coin.  If heads, pick 100 females at random.  If tails, pick 100 males at random.  Each individual has an equal chance of being selected, but in either case, the sample will not be representative.
* instead, ensure that every possible _sample_ or _combination of individuals_ of the size intended has an equal chance of being selected
* a sample drawn this way is called a **simple random sample** or **SRS**
* the **sampling frame** is the list of individuals from which the sample will be drawn
* assign random numbers to individuals in sampling frame, then select those satisfying some rule
* samples drawn at random generally differ from one another; these differences lead to different values for the variables measured;  these sample-to-sample differences are called **sample variability** or **sampling error**

## Stratified Sampling

* involves slicing the population into homogeneous groups, called **strata**, before the sample is selected
* SRS is used within each stratum, then the results are combined
* most important benefit of strafied sampling is that samples taken within a stratum vary less, reducing sampling variability

## Cluster and Multistage Sampling

* splitting the population into _representative_ **clusters** can make sampling more practical
* can select one or more clusters at random and perform a census within each
* this sampling design is called **cluster sampling**
* if each cluster fairly represents the population, then cluster sampling will be unbiased

### Difference between Stratified and Cluster Sampling

* we stratify to ensure our sample represents different groups in the population
  * sample randomly within each strata
  * strata are internally _homogeneous_, but differ from one another
* we cluster to make sampling more practical or affordable
  * clusters are internally _heterogeneous_, each resembling the overall population
* boston cream pie example
  * cluster sampling: make multiple vertical slices - each slice is a cluster
  * stratified sampling: take some random tastes from the cake layer, from the cream layer, and from the chocolate layer, each layer is a strata
  

* sampling schemes that combine several methods are called **multistage samples**
* book e.g.: stratify by _section_ (to factor in increasing complexity by section), randomly select a chapter from each section, within each chapter, randomly select one or more pages (as clusters), finally perform an SRS of sentences within each page

## Systematic Samples

* true random sampling, particularly with a very large population, can be very expensive / time consuming
* systematic sampling aims to reduce that cost, while still generating a representative, unbiased sample
* example: survey every 10th person on a list of students
  * the order of the list must not be associated in any way with the response being evaluated
  * you must start the selection from a randomly selected individual
    * (why?) - if the order is _truly_ not associated with the response, it shouldn't matter if you start at the first entry.
    
* you must justify that the systematic method used _is not_ associated with any of the measured variables

## Step-by-Step Example : Sampling

* plan: state what you want to know
* identify the W's of the study
  * _why_ determines population and sampling frame
  * _what_ identifies parameter of interest and variables measured
  * _who_ is the sample we actually draw
* sampling plan: specify the sampling method, and sample size $n$; specify:
  * how the sample was actually drawn
  * what is the sampling frame
  * how was randomization perform
  * description should be complete enough to allow others to replicate
* sampling details: include how respondents were contacted, what incentives were offered to encourage response, how nonrespondents were treated, etc.
* conclusion: 
  * discuss all elements of sampling
  * include any special circumstances
  * if appropriate, include details of the specific questions asked
  * show a display of the data, provide and interpret the statistics from the sample, state conclusions

## Defining the "Who"

* define the group: it may be difficult to provide a clear definition of the group you want to study
* specify the sampling frame: usually, the sampling from is not the group you _really_ want to know about; it limits what your survey can find out
* note distinction between _target sample_ and _actual sample_: non-response is one of the main factors separating the two, and can be a major challenge
* as a result, the actual sample might not be representative of the sampling frame or the population

* each constraint can introduce biases into the final sample
* a careful study should address the question of how well each group matches the population of interest

## The Valid Survey

A valid survey yields the information we are seeking about the population we are interested in.

* Important preliminary questions
  * What do I want to know?
  * Am I asking the right respondents?
  * Am I asking the right questions?
  * What would I do with the answers if I had them; would they address the things I want to know?

* Know what you want to know: understand what you hope to learn and about whom you hope to learn it.
* Use the right frame: Have you identified the population of interest and sampled from it appropriately?
* Avoid non-response bias by only asking necessary questions.  (longer questionnaires yield fewer responses)
* Ask specific rather than general questions
* Ask for quantitative results when possible
* Be careful in framing questions: 
  * aim for understanding on the part of the respondent
  * realize that respondents often won't ask for clarification
  * respondents may answer dishonestly due to embarrassment, intimidation, insult, or to avoid offending
* Avoid phrases with double meanings to avoid confusion
* Be aware that subtle phrasing difference in questions can make a difference
* Be careful in phrasing answers; strive for neutral values

Consider doing a **pilot** survey, a trial run of the survey you eventually plan to give to a larger group, using a draft of the survey questions, administered to a small sample drawn from the same sampling frame you intend to use.

## Sampling Errors: How to Sample Badly

* bad sample designs yield worthless data
* many convenient forms of sampling can be seriously biased
* there is no way to correct for the bias from a bad sample

Some examples of poor sampling:

### Voluntary Response Samplling

* **voluntary response sample**: a large group of individuals is invited to respond, and all who do respond are counted
  * almost always biased
  * the sampling frame is often hard to define
* the sample is not representative, even though every individual in the population may have been offered the chance to respond.  the resulting **voluntary response bias** invalidates the survey

### Convenience Sampling

* In **convenience sampling** we simply include the individuals who are convenient for us to sample.
* e.g. an Internet convenience survey to measure internet access

### Incomplete Sampling Frame

* introduces bias because the individuals included may differ from the ones not in the frame

### Undercoverage

* in **undercoverage**, some portion of the population is not sampled at all or has a smaller representation in the sample

## What Can Go Wrong?

* watch out for nonrespondents
* work hard to avoid influencing responses

### How to Think About Biases

* look for biases in any survey you encounter
* spend your time and and resources reducing biases
* if you can, pilot-test your survey, then refine survey based on results
* always report your sampling methods in detail

## What Have We Learned

* [p. 298]