# Sampling data and simulations. 

In this lecture, we will discuss 
- What is a sampling frame and sampling error.
- Give examples of how to do simple random sampling and how to do systematic sampling. 
- Explain the differences between the two sampling types and when to choose one over the other. 
- Differences between clustering and convenience sampling. 


## Sampling definitions

In samples, we extract a subset of the population data because we would like to perform some inferential statistics on it. 

In other words, we want to infer some results of a population based on the analysis of the sample data. 

Why do this?  Well, in many cases, analyzing the entire population may not be feasible, or even necessary. 


### Sampling frame
A sampling frame is simply a list of individuals from which a sample is actually selected. 

This list may be a physical or concrete list.  For example, a list of students enrolled at a  University.

Also, this list may be a theoretical one, i.e. The list of students who may enroll in the University for the fall quarter. 



### Undercoverage

It is important to understand that your sampling frame should include everyone that has a chance of being selected for your sample. 

Failure to do this may lead to *undercoverage*.  This can skew your results and ultimately may lead to poor decision-making based on the results of your analysis. 

How can this happen?  For example, if the population list is incomplete or not up to date, then you may miss people that are added after you're received the data. 




### Sampling errors.

When taking samples from a population, you will almost certainly have errors, both in the mean and the percentages.

That is to say, when taking a sample mean, it will almost certainly be different than the population mean. 

Also, sample percentages will be different from population percentages. 

These types of errors are usually unavoidable and steps can be taken to account for them. 


### Non-sampling errors. 
These are errors that are avoidable and care should be taken to make sure that they don't occur. 

For example, using a sample list that has incomplete or bad data. 
It is important to make sure that everyone in the population that is supposed to be in your sample is actually there.  Additionally, it is important to make sure that individuals who are not supposed to be on that list are not included. 



### Causes of errors. 

- Sampling errors.  Caused by the fact that regardless of what you do, your sample will not perfectly represent the population. 

- Non-sampling errors.  Caused by many things, including
    - Poor sample design.
    - Sloppy data collection.
    - Inaccurate measurement instruments. 
    - Bias in data collection. 
    - Systemic problems introduced by the researchers. 
    


It is important to do our best to avoid non-sampling errors. 
We do this by avoiding undercoverage of our sampling frame.  

## Simulations. 

A simulation is defined as a 'numerical facsimile or representation of a real world phenomenon'. 

It is basically working through a pretended situation to see how it would compare to a real situation. 

## Simple random sampling

A definition of simple random sampling:
'A simple random sample of *n* measurements from a population is a subset of the population selected in such a manner that every sample of size *n*  has an equal chance of being selected.'

This is just a wordy way of saying that any individual in a population has the same chance of being selected for a sample as any other individual.

For example: If we have a list of the population of a class of students, and we want to take a sample n=5 (Remember that the lower case n stands for a sample), then a random sample means that all different permutations of the five students has an equal chance of being picked for the sample.






### How to pick a random sample.
In our previous example with students, there may be many good ways to pick a representative sample.
For example:

- We could number all of the individuals in the population with a unique number, such as a student ID.
- We could put all of the student ID numbers in a place where you can randomly draw from without looking, such as a hat. 
- We can then draw five ID's and use that for our sample frame. 


### Limits of random sampling. 

Although random sampling is a good way to draw a sample frame from a population, there are drawbacks to this method. 

- We need a list.  If we don't know ahead of time what your population will look like, then we can't create a list from it.

- The list has to be representative.  There's always the danger of undercoverage when compiling lists from a population. 


## Stratified sampling
With stratified sampling, we divide our list into groups, or *strata.*
We do this so that we can make sure that we have certain proportions of groups in the final sample. 
Next, we then perform random sampling on each strata. 

An example of strata might be dividing the population of studetns at a University into first year, second year, third year and fourth year students, or into undergraduate students, graduate students and post-doctoral researchers. 




### Limitations of stratified sampling. 
Oversampling one group means that your summary statistic is unbalanced.  For example, if one group has a smaller population size than the larger group, then that smaller group may be over-represented in the final sample frame. 

As with random sampling, you need a list to be able to stratify your groups. 
It may also be difficult to actually split your list into groups. 



## Systematic sampling. 

Systematic sampling can be done with or without a list. 
Do perform systematic sampling we do the following:
- Arrange the individuals of the population in some particular order, for example, alphabetic order. 
- Start by picking a random individual from the population. 
- Define a number *k*.
- Pick every *k*th member of the population for the sample. 


### Characteristics of Systematic Sampling. 
Systematic sampling doesn't work when there is a specific pattern to the data (for example, if the population is arranged in alternating genders, i.e. boy/girl, boy/girl). 
You can do systematic sampling in a clinical setting, for example, where there is no known periodicity. 



## Cluster Sampling.
Cluster sampling is effective when the population involves geographic locations. 

We use cluster sampling when a problem is localized to a particular geographic location. 


In cluster sampling, we begin by dividing the map into geographical areas. 
Then, we randomly pick clusters, or areas from the map. We take all of the people in the cluster for our sample. 



### Problems with cluster sampling. 

Sometimes, the individuals located in a cluster are all similar in a way that makes the problem hard to study. 

For example, if we're trying analyze statistical information to see if a particular factory is causing health problems, such as cancer in a population, if the population are all in lower income brackets, and perhaps there are other causes of pollution in that geographic area, then it will be difficult to extract out meaningful data for the problem that we are trying ti solve.



## Convenience and multi-stage sampling. 

Convenience sampling is generally used for *low risk* scenarios, for example, trying to find the best restaurant in the area next to a hotel. 

To do this, you might ask some of the guests or workers in the hotel for their recommendation. 
 
Note that these results are often unreliable. 

Defining convenience sampling:
- Using results or data that is easily obtained. 
- Can be useful if there are not a lot of resources available for the study. 
- Generally uses an already assembled group for the survey. 


### Problems with convenience sampling. 
- Usually a bias in each group.
- We often can miss important subpopulations (Addressed by stratified sampling). 
- Results can be severely biased. 


## Multi-stage sampling. 

Multi-stage sampling takes a combination of sampling strategies and combines them. 

For example:

- Stage 1.  Cluster sample of US states. 
- Stage 2.  Simple random sample of counties in each state. 
- Stage 3. Stratified sample of secondary schools in each county. 
- Stage 4.  Stratified sample of students in each classroom in the school. 

