# Chapter 11 - Data Collection

With recent increased capacity for data storage and large scale analyses, Big Data has become a hot topic in both data science and broader society. However, bigger data is not always better data. For example, difficulties or errors with measurement, data management, or data entry can lead to data that is incorrect or uninformative. The popular addage "garbage in, garbage out" refers to the use of incorrect or uninformative data leading to incorrect or uninformative analyses and results. In addition, the use of a large dataset that was not collected with the research question in mind may lead to misleading results if researchers are not cognizant of what types of data were collected and in what manner. It is important for data scientists to collect data using well-thought-out experimental design including proper measurement, sampling, and tailoring to the research question(s). In this chapter, we will explore several important topics in data collection: causality vs association, observational vs experimental studies, sampling, and biases.

## Causality vs Association

As scientists, we are often looking for patterns or relations between variables. When there exists a pattern between two variables, we call this an **association** [^*]. For example, time of day is associated with traffic on Lake Shore Drive or temperature outside is associated with number of people at Lake Michigan.

When we see that two variables X and Y are associated, we often wonder if one causes the other. There are 3 possible scenarios:
1. **Causation**: change in X causes change in Y (or vice versa)
2. Common response (confounding): some other variable Z causes change in both X and Y
3. Common outcome (colliding): changes in both X and Y cause change in some variable Z

Well-designed studies, which we will discuss further in the next section, can help distinguish between the three scenarios which are often depicted using causal graphs. A causal graph is a graph where each node depicts a variable and each edge is directed (an arrow) pointing in the direction of a cause. The figure below shows causal graphs as well as examples for all three scenarios.

<p style="text-align:center;">
<img src="/Users/amandakube/textbook-datascience-1/textbook/images/Causality.png" alt="Three Types of Association" width="500"/>
</p>

The first panel shows a causal association. When we see a causal association between X and Y we can depict it with an arrow from the cause to the effect. For example jumping in the lake is the direct cause of getting wet so the arrow is drawn from jumping in the lake to getting wet. 

The second panel shows an association between X and Y (the dotted line) that is present due to a confounding variable, Z. *Conditioning on* a counfounding variable is best practice to remove the false association between X and Y. Conditioning on a variable means looking at only one value of the conditioned variable. For example, suppose I have a dataset that contains information about things happening at the beach. I plot ice cream sales and shark attacks and see that there is a positive association such that as ice cream sales increase so do shark attacks. Should I conclude that ice cream attracts sharks? Thinking more deeply about the problem, I realize that shark attacks increase when the weather is warm because there are more people in the ocean. Ice cream sales also increase during warm weather, therfore both variables have a common cause, weather. When I condition on weather and only consider ice cream sales and shark attacks in the summer months, the association disappears.

In the last panel, we see that an association between X and Y is due to the collider variable Z. We see false associations between two variables X and Y when both are causes of a third variable Z and we are conditioning on Z[^**]. For example, looking only at hospitalized patient data (conditining on being hospitalized), we see a negative association between diabetes and heart disease such those who have diabetes are less likely to have heart disease. However, it is known that diabetes is a risk factor of heart disease not a protective factor, so we should see the opposite effect. This reversal in association occurs because we are only looking at hospitalized patients and both heart disease and diabetes are causes of hospitalization. Diabetes increases likelihood of heart disease and likelihood of hospitalization. Heart disease increases likelihood of hospitalization as well. If you are hopitalized for diabetes, it is less likely you also have heart disease. Therefore, those with diabetes in this sample of hospitalized patients have lower incidence of heart disease than those with diabetes in the general population, reversing the association between diabetes and heart disease. 

As colliding is a difficult concept to grasp, consider another example. Suppose your friend is complaining about a recent date. The person she went to dinner with was very good looking but had no sense of humor. Your friend comments that it seems all good-looking people have a bad sense of humor. You know that in reality looks and humor are not related. Your friend is conditioning on a collider by considering only people that she dates. She likely only dates people that meet a certain threshold of looks and humor. Those that are very good-looking don't need to have as good of a sense of humor to get a date whereas those who are less good-looking must have a better sense of humor creating a negative association between looks and humor that does not exist outside of her dating pool.

[^*]: An association is often referred to as a correlation. Correlations are discussed in more detail in Chapter 17.
[^**]: A more thorough discussion of colliders is beyond the scope of this book, but interested readers are referred to *The Book of Why* by Judea Pearl and Dana Mackenzie.

## Observational vs Experimental Studies

In most research questions or investigations, we are interested in finding an association that is causal (scenario 1 above). For example, "Is the COVID-19 vaccine effective?" is a causal question. The researcher is looking for an association between receiving the COVID-19 vaccine and catching COVID-19, but more specifically wants to show that the vaccine is the cause of reduction in probability of catching COVID-19. 

### Experimental Studies
There are 3 necessary and sufficient conditions [^***] to show that a variable X (for example vaccine) causes an outcome Y (such as not catching COVID-19). 
1. Temporal Precedence: We must show that X (the cause) happened before Y (the effect).
2. Non-spuriousness: We must show that the effect Y was not seen by chance.
3. No alternate cause: We must show that no other variable accounts for the relationship between X and Y.

The best way to show all three necessary and sufficient conditions, is by conducting an **experiment**. Experiments involve controllable factors which are measured and determined by the experimenter, uncontrollable factors which are measured but not determined by the experimentor, and experimental variability or noise which is unmeasured and uncontrolled. Controllable factors that the experimenter manipulates in his or her experiment are known as independent variables. In our vaccination example, the indpendent variable is receipt of vaccine. Uncontrollable factors that are hypothesized to depend on the independent variable are known as dependent variables. The dependent variable in the vaccination example is whether patients catch COVID-19. The experimentor cannot control whether participants catch the disease, but can measure whether or not they do, and it is hypothesized that catching the disease is dependent on vaccination status.

When conducting an experiment, it is important to have a comparison or **control group**. The control group is used to better understand the effect of the independent variable. For example, if all patients are given the vaccine, it would be impossible to measure whether the vaccine is effective as we would not know what would have happened if patients had not receive the vaccine. In order to measure the effect of the vaccine, the researcher must compare incidence rate of COVID-19 in patients that did not receive the vaccine to that of patients that did receive the vaccine. This comparison group of patients who did not receive the vaccine is the control group for the experiment. The control group allows the researcher to view an effect or association. When scientists say that the COVID-19 vaccine is 95% effective, this does not mean that only 5% of people who got the vaccine in their study caught COVID-19. That would not take into account the incidence rate of COVID-19 for those without a vaccine. Rather, 95% effective refers to having 95% lower incidence rate *compared to the control group.* Say Chicago has an incidence rate of 25% and a population of almost 3000000. Without the vaccine 750000 patients would be expected to catch COVID-19. If everyone was vaccinated the number expected to catch COVID-19 would drop to 37500, only 1.25% of the population. This is a large reduction! However, it is important that the researcher show that this effect is non-spurious and therefore important/significant. One way to do this is through **replication**, applying a treatment independently across two or more experimental subjects. In our example, the researcher may conduct the experiment on multiple patients or repeat the experiment for multiple groups of patients to show that the effect can be seen reliably. 

Now that the researcher has seen an association with temporal precedence by design and shown that it is non-spurious, they must be able to show there is no alternate cause for the association in order to prove causality. This can be done through **randomization**, random assignment of treatment to experimental subjects. Consider, a group of patients where all male patients are given the treatment and all female patients are in the control group. If an association is found, it would be unclear whether this association is due to the treatment or the fact that the groups were of differing sex. By randomizing experimental subjects to groups, researchers ensure there is no systematic difference between groups other than the treatment and therefore no alternate cause for the relationship between treatment and outcome. Another way of ensuring there is no alternate cause is by **blocking** or grouping similar experimental units together and assigning different treatments within such groups. Blocking is a way of dealing with sources of variability that are not of primary interest to the experimenter. In our vaccine example, the researcher may block on sex by grouping males together and females together and assigning treatments and controls within the different groups. Best practices are to block the largest and most salient sources of variability and randomize what is difficult or impossible to block. In our example blocking would account for variability introduced by sex whereas randomization would account for factors of variablity such as age or medical history which are more difficult to block.

### Observational Studies
Randomized experiments are considered the "Gold Standard" for showing a causal relationship. However, it is not always ethical or feasible to conduct a randomized experiment. Consider the following research question: Does living in Northern Chicago increase life expectancy? It would be infeasible to conduct an experiment which randomly allocates people to live in different parts of the city. Therefore, we must turn to observational data to test this question. Where experiments involve one or more factors/variables controlled by the experimentor (dose of a drug for example), in **observational studies** there is no effort or intention to manipulate or control the object of study. Researchers collect the data without interfering with the subjects. For example, researchers may conduct a survey gathering both health and neighborhood data or they may have access to administrative data from a local hospital. In these cases, the researchers are merely *observing* variables and outcomes.

There are two types of observational studies: retrospective studies and prospective studies. In a **retrospective study**, data is collected after events have taken place. This may be through surveys, historical data, or administrative records. An example of a retrospective study would be using administrative data from a hospital to study incidence of disease. In contrast, a **prospective study** identifies subjects beforehand and collects data as events unfold. For example, one might use a prospective study to study how personality traits develop in children by following a predetermined set of children through elementary school and giving them personality assessments each year.

[^***]: Necessary conditions must be met for the outcome of interest to occur. In this case, all 3 conditions must be met for the association to be causal. Sufficient conditions are conditions that, when met, produce the outcome of interest. In this case, meeting all three conditions makes the association causal.

## Sampling

In both experimental and observational studies, the goal is to come to a conclusion about a certain **population**. That population may be cancer patients, residents of Hyde Park, or students in this data science course. A survey of every **unit**, or individual member, of a population is known as a **census**. Often, it is not possible to collect data on every subject in a population. For example, I may be able to survey all students in this course, but it would be difficult to survey every Hyde Park resident. This is due to logistical issues, the amount of time it would take, and the expense. For these reasons, researchers often study a **sample** of the population and use that sample to gain information about the entire population through statistical inference. The numerical characteristic of the sample gained through statistical inference is known as a **statistic**. Statistics are used to estimate values of **parameters**, which are numerical characteristics of the entire population.

### Sampling Designs

In order to be able to generalize from the sample to the population as a whole, the sample must be **representatative** of the population. Otherwise, inference on that sample can produce misleading conclusions. For example, if a researcher is interested in understanding how those living in South Chicago view the University of Chicago and the researcher collects a sample of 50 freshmen attending the University of Chicago, the sample of freshmen most likely is not representative of the feelings of everyone living in South Chicago. Similarly, if a cancer researcher wants to know how well a drug works on cancer patients, but uses a sample consisting only of men under 30, any conclusions she might draw from the experiment cannot be generalized to the entire population but rather to only the population of men under 30. There are two **sampling designs**, or processes by which a sample is collected, that suffer from this lack of generalizability yet sometimes cannot be avoided. The first is known as a **convenience sample**. A convenience sample is, as its name suggests, a sample that is collected out of ease of access for the researchers. Looking through research in psychology in particular, many researchers collect a convenience sample of students from introductory psychology courses. Though this is an easy way of gathering a sample, it is not the most generalizable, as introductory psychology students are likely not representative of the broader population the researchers seek to understand. A second example of a sampling design that is not generalizable is the **voluntary response sample** where participants volunteer to be part of the study. Restaurant reviews provide a nice example of a voluntary response sample. Those with strong opinions of the restaurant (either positive or negative) are more likely to write reviews. Voluntary response samples oversample those who feel strongly about the topic being studied and undersample those who do not care as much. These samples are always **biased**, or not representative of the broader population.

There are several sampling designs that are meant to help collect a more representative sample. The first is a **simple random sample (SRS)**. In a simple random sample of size n, every group of n units in the population has an equal chance of being selected as the sample. This eliminates sampling bias by ensuring that portions of the population are not over- or under-sampled. In addition, a SRS allows a researcher to mathematically or computationally quantify variation due to sampling (aka the precision of a statistic). The downside of SRS is that it requires a **sampling frame**, or list of names or IDs of all units in a population. Aquiring such a sampling frame is impractical for large populations. 

The next sampling design is a **stratified random sample** which divides the population into sub-populations of similar units (called strata) and chooses a separate SRS for each stratum. This allows a researcher to gain more exact information than a SRS of the same size by ensuring that each stratum is equally represented in the sample. Many universities employ stratified random sampling when conducting surveys gauging student or factulty opinions. Those conducting the survey split the population of university students into strata by year (eg freshman, sophomore, junior, senior, super senior), then, take a simple random sample of students from each stratum. This sampling design works well when cases within a stratum are similar but there are large differences between strata. However, it has the same downside as a SRS as it too needs sampling frames for each stratum.

The next sampling design is **cluster sampling** which is commonly confused with stratified random sampling as both split the population into sub-populations. However, cluster sampling splits the population into clusters and takes a random sample *of* those clusters. Rather than taking a random sample *within* subpopulations as in stratified random sampling. This sampling design works well when there is small variation between clusters but large variation within clusters. Cluster sampling is commonly used for geographical and market research. For example, the head of a major department store may be interested in how well a particular product is selling. Rather than analyzing all sales for all stores across the whole country, the market research team would cluster sales by store and take a random sample of stores.

The image below depicts each type of sampling (in maroon) from a population of 100 (in grey). Panel a shows the full population before sampling. Panel b shows a simple random sample of 10 units. Panel c shows a stratified random sample of 10 units with 5 strata. Panel d shows a cluster sample of 10 units with 20 clusters.

<p style="text-align:center;">
<img src="/Users/amandakube/textbook-datascience-1/textbook/images/SamplingSchemes.png" alt="Image of Sampling Designs" width="500"/>
</p>

The last type of sampling we will discuss, **multistage sampling**, conducts sampling in stages and is often used for large nationwide samples of households or individuals. The major advantage of multistage sampling is that it does not require a complete sampling frame. For example, consider a polling company interested in obtaining a generalizable sample of American households. First, they might stratify households by state to ensure sampling from each state. Within states, they might cluster households by county and take a simple random sample of those counties. Lastly, they take a simple random sample of n households from each of the sampled counties. This sampling strategy is depicted below. Multistage sampling mixes stratified and cluster sampling in stages and as a result, the researcher never required a list of all households in the US, but rather a sampling frame of US counties and then a sampling frame of households within a much smaller subset of US counties.

<p style="text-align:center;">
<img src="/Users/amandakube/textbook-datascience-1/textbook/images/MultistageSampling.png" alt="Example of Multistage Sampling" width="600"/>
</p>

### Biases

Recall the introduction to this chapter where we stated that bigger data is not always better. This is often due to the sampling method used to gather that "Big Data". We have already discussed the need for representative samples to ensure generalization of the sample to the population. The bias introduced by oversampling some portions of the population over others is known as **selection bias**. However, this is not the only bias that can be introduced during the data collection process. Imagine sampling participants and emailing each participant a survey to complete. Some participants might not complete that survey. **Non-response bias** occurs when the people who decline to respond are different in some meaningful way than those who do respond. Perhaps you wish to study parenting and all single parents were too busy to complete the survey. Your study would be missing an important perspective. 

Turning attention from those who did not respond to those who did, their responses can suffer from **response bias**. Response bias can appear in multiple formats. Sometimes, participants have an incentive to respond in ways that might not be truthful, especially if questions are sensitive or embarrassing. For example, in a survey of campus sexual health, students might be embarrassed to report STIs and therefore trends in the data may be misleading. This can be influenced by the wording or tone of the questions as well as if participants have been ensured their data is private. Some response bias can be due more to boredom than truthfulness. For example, especially in long surveys, participants may care more about completing the task than completing the task well. Some participants may choose to select random answers, select the same answer for every question, or answer questions in a pattern. It is important for a researcher to consider the wording, tone, and length of a survey carefully and to check all surveys for possible response bias before analyzing data.