# Introduction to Probability and Data

https://www.coursera.org/learn/probability-intro/home/week/1

## Week 1 - Introduction to Data

Suggested readings and practice problems from OpenIntro Statistics, 3rd edition (a free online introductory statistics textbook co-authored by Dr. Cetinkaya-Rundel) for this week:

Suggested reading: Chapter 1, Sections 1.1 - 1.5

Practice exercises: End of chapter exercises in Chapter 1: 1.1, 1.3, 1.11, 1.13, 1.17, 1.19, 1.25, 1.27, 1.31

(Reminder: the solutions to the end of chapter exercises are at the end of the OpenIntro Statistics book)

# Table of Contents

1. [using stents to prevent strokes](#using stents to prevent strokes)
2. [Guided Practice 1.1](#Guided Practice 1.1)
3. [1.3.1 Populations and samples](#1.3.1 Populations and samples)
4. [1.1 Migraine and acupuncture](#1.1 Migraine and acupuncture)
5. [1.3 Air pollution and birth outcomes, study components](#1.3 Air pollution and birth outcomes, study components)
6. [1.11 Buteyko method, scope of inference](#1.11 Buteyko method, scope of inference)
7. [1.13 Relaxing after work](#1.13 Relaxing after work)
8. [1.17 Course satisfaction across sections](#1.17 Course satisfaction across sections)
9. [1.19 Internet use and life expectancy](#1.19 Internet use and life expectancy)
10. [1.25 Flawed reasoning](#1.25 Flawed reasoning)



## Introduction to data

It is helpful to put statistics in the context of a general process of investigation:

1. Identify a question or problem.
2. Collect relevant data on the topic.
3. Analyze the data.
4. Form a conclusion.

Statistics as a subject focuses on making stages 2-4 objective, rigorous, and efficient.
That is, statistics has three primary components: How best can we collect data? How
should it be analyzed? And what can we infer from the analysis?

## 1.1 Case study: using stents to prevent strokes  <a class="anchor" id="using stents to prevent strokes"></a>

Section 1.1 introduces a classic challenge in statistics: evaluating the efficacy of a medical treatment. Terms in this section, and indeed much of this chapter, will all be revisited later in the text. The plan for now is simply to get a sense of the role statistics can play in practice.

> Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Medical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993-1003. www.nejm.org/doi/full/10.1056/NEJMoa1105335. NY Times article reporting on the study: www.nytimes.com/2011/09/08/health/research/08stent.html.

In this section we will consider an experiment that studies effectiveness of stents in treating patients at risk of stroke. 1 Stents are devices put inside blood vessels that assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death. Many doctors have hoped that there would be similar benefits for patients at risk of stroke. 

We start by writing the principal question the researchers hope to answer:

Does the use of stents reduce the risk of stroke?

The researchers who asked this question collected data on 451 at-risk patients. Each volunteer patient was randomly assigned to one of two groups:

**Treatment group.** Patients in the treatment group received a stent and medical management. The medical management included medications, management of risk factors, and help in lifestyle modification.

**Control group.** Patients in the control group received the same medical management as the treatment group, but they did not receive stents.

Researchers randomly assigned 224 patients to the treatment group and 227 to the control group. In this study, the control group provides a reference point against which we can
measure the medical impact of stents in the treatment group.

Researchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment. The results of 5 patients are summarized in Table 1.1. Patient outcomes are recorded as “stroke” or “no event”, representing whether or not the patient had a stroke at the end of a time period.

| Patient        | group           | 0-30 days  | 0-365 days |
| -------------  |-------------    | -----      | -----      |
| 1              | treatment       | no event   | no event   |   
| 2              | treatment       | stroke     | stroke     |  
| 3              | treatment       | no event   | no event   |  
| ...            | ...             | ...        | ...        |  
| 450            | control         | no event   | no event   |  
| 451            | control         | no event   | no event   |  
Table 1.1: Results for five patients from the stent study.

Considering data from each patient individually would be a long, cumbersome path towards answering the original research question. Instead, performing a statistical data analysis allows us to consider all of the data at once. 

Table 1.2 summarizes the raw data in a more helpful way. In this table, we can quickly see what happened over the entire study.

For instance, to identify the number of patients in the treatment group who had a stroke within 30 days, we look on the left-side of the table at the intersection of the treatment and stroke: 33.

|           | 0-30   days      | |0-365     days   | |
| ----------|--------|-----------|--------|----------|
|           | stroke | no event  | stroke | no event |
| --------- |--------| ----------| -------| ---------|
| Treatment | 33     | 191       | 45     | 179      |
| control   | 13     | 214       | 28     | 199      |
| ----------| -------| ----------|--------|----------|  
| Total     | 46     | 405       | 73     | 378      | 
Table 1.2: Descriptive statistics for the stent study.

In [9]:
# Descriptive statistics for the stent study.
Treatment <- c(33, 191,45,179)
Control <- c(13, 214,28,199)

In [10]:
# Construct matrix for Descriptive statistics for the stent study :
stent_study <- matrix(c(Treatment, Control), nrow = 2, byrow = TRUE)

In [21]:
# Vectors region and titles, used for naming
samples <- c("Treatment", "Control")
outcomes <- c("0-30 days - stroke", "0-30 days -no event","0-365 days -stroke", "0-365 days -no event")

In [19]:
# Name the columns with outcomes
colnames(stent_study) <- outcomes

In [20]:
# Name the rows with titles
rownames(stent_study) <- samples

In [29]:
# Bind the totals column to the matrix :
Totals <- colSums(stent_study)

stent_study_totals <- rbind(stent_study, Totals)

stent_study_totals

Unnamed: 0,0-30 days - stroke,0-30 days -no event,0-365 days -stroke,0-365 days -no event
Treatment,33,191,45,179
Control,13,214,28,199
Totals,46,405,73,378


## Guided Practice 1.1 <a class="anchor" id="Guided Practice 1.1"></a>

Of the 224 patients in the treatment group, 45 had a stroke by the end of the first year. Using these two numbers, compute the proportion of patients in the treatment group who had a stroke by the end of their first year.

## Patients in the treatment group who had a stroke by the end of their first year.

In [36]:
treatment_patients_in_first_year_stroke <- stent_study_totals[1,3]
treatment_patients_in_first_year_stroke

In [37]:
# patients in the treatment group who did not have a stroke by the end of their first year.
treatment_patients_in_first_year_no_event <- stent_study_totals[1,4]
treatment_patients_in_first_year_no_event

In [40]:
# treatment_patients_in_first_year
treatment_patients_in_first_year <- treatment_patients_in_first_year_no_event + treatment_patients_in_first_year_stroke
treatment_patients_in_first_year

In [51]:
# Proportion who had a stroke in the treatment (stent) group: 45/224 = 0.20 = 20%.
Proportion_treatment_who_had_a_stroke <- treatment_patients_in_first_year_stroke / treatment_patients_in_first_year
Proportion_treatment_who_had_a_stroke

In [52]:
# Create number and format as percent rounded to one decimal place
Percentage_treatment_who_had_a_stroke <- paste(round((Proportion_treatment_who_had_a_stroke)*100,digits=1),"%",sep="")
Percentage_treatment_who_had_a_stroke

We can compute summary statistics from the table. A summary statistic is a single number summarizing a large amount of data. 3 For instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups.

## Patients in the control group who had a stroke by the end of their first year.

In [56]:
# Proportion who had a stroke in the control group: 28/227 = 0.12 = 12%.

control_patients_in_first_year_stroke <- stent_study_totals[2,3]
control_patients_in_first_year_no_event <- stent_study_totals[2,4]
control_patients_in_first_year <- control_patients_in_first_year_no_event + control_patients_in_first_year_stroke
Proportion_control_who_had_a_stroke <- control_patients_in_first_year_stroke / control_patients_in_first_year
Percentage_control_who_had_a_stroke <- paste(round((Proportion_control_who_had_a_stroke)*100,digits=1),"%",sep="")
Percentage_control_who_had_a_stroke


These two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke! This is
important for two reasons. First, it is contrary to what doctors expected, which was that stents would reduce the rate of strokes. Second, it leads to a statistical question: do the
data show a “real” di↵erence between the groups?

## 1.3.1 Populations and samples <a class="anchor" id="1.3.1 Populations and samples"></a>

Consider the following three research questions:

Guided Practice 1.7 For the second and third questions above, identify the target population and what represents an individual case

### 1. What is the average mercury content in swordfish in the Atlantic Ocean?

**target population** = all swordfish in the Atlantic ocean

**individual case** = a swordfish in the Atlantic ocean

### 2. Over the last 5 years, what is the average time to complete a degree for Duke undergraduate students?

**target population** = last 5 years of Duke undergraduate students who graduated (last 5 years of graduate students).

**individual case** = a Duke undergraduate students of the last 5 years

### 3. Does a new drug reduce the number of deaths in patients with severe heart disease?

**target population** = patients with severe heart disease

**individual case** = a patient with severe heart disease



## Exercises

### 1.1 Migraine and acupuncture. <a class="anchor" id="1.1 Migraine and acupuncture"></a>

A migraine is a particularly painful type of headache, which patients sometimes wish to treat with acupuncture. To determine whether acupuncture relieves migraine pain, researchers conducted a randomized controlled study where 89 females diagnosed with migraine headaches were randomly assigned to one of two groups: treatment or control. 

43 patients in the treatment group received acupuncture that is specifically designed to treat migraines. 46 patients in the control group received placebo acupuncture (needle insertion at non-acupoint locations). 24 hours after patients received acupuncture, they were asked if they were asked if they were pain free. Results are summarized in the contingency table below.

In [10]:
Treatment <- c(10, 33)
Control <- c(2, 44)
migraine_study <- matrix(c(Treatment, Control), nrow = 2, byrow = TRUE)
# Vectors used for naming
samples <- c("Treatment", "Control")
outcomes <- c("Pain-free", "Not Pain-free")
# Name the columns with outcomes
colnames(migraine_study) <- outcomes
# Name the rows with titles
rownames(migraine_study) <- samples
# Bind the totals column to the matrix :
Pain_Totals <- colSums(migraine_study)
Group_Totals <-rowSums(migraine_study)
migraine_study_totals1 <- cbind(migraine_study,Group_Totals)
migraine_study_totals2 <- rbind(migraine_study, Pain_Totals)

#migraine_study_totals <- rbind(migraine_study_totals1, migraine_study_totals2)
migraine_study_totals1

migraine_study_totals2

Unnamed: 0,Pain-free,Not Pain-free,Group_Totals
Treatment,10,33,43
Control,2,44,46


Unnamed: 0,Pain-free,Not Pain-free
Treatment,10,33
Control,2,44
Pain_Totals,12,77


In [12]:
combined_migraine_study_totals <- addmargins(migraine_study, FUN = list(Total = sum), quiet = TRUE)
combined_migraine_study_totals

Unnamed: 0,Pain-free,Not Pain-free,Total
Treatment,10,33,43
Control,2,44,46
Total,12,77,89


(a) (1) What percent of patients in the treatment group were pain free after recieving acupuncture ?

In [18]:
patients_in_the_treatment_group_pain_free_after_acupuncture <- combined_migraine_study_totals[1,1]
patients_in_the_treatment_group_total <- combined_migraine_study_totals[1,3]
Proportion_treatment_pain_free <- patients_in_the_treatment_group_pain_free_after_acupuncture / patients_in_the_treatment_group_total
Percentage_treatment_pain_free <- paste(round((Proportion_treatment_pain_free)*100,digits=1),"%",sep="")
Percentage_treatment_pain_free

(a) (1) What percent of patients in the control group were pain free after recieving acupuncture ?

In [19]:
patients_in_the_control_group_pain_free_after_acupuncture <- combined_migraine_study_totals[2,1]
patients_in_the_control_group_total <- combined_migraine_study_totals[2,3]
Proportion_control_pain_free <- patients_in_the_control_group_pain_free_after_acupuncture / patients_in_the_control_group_total
Percentage_control_pain_free <- paste(round((Proportion_control_pain_free)*100,digits=1),"%",sep="")
Percentage_control_pain_free

(b) At first glance does acupuncture appear to be an effective treatment for migraine ? Explain your reasoning.

Acupunture is not an effective treatment for migraine as only 23.3 % of the treatment group were pain free which left the majority with pain (76.7%). Although Acupunture appears to be more effective at removing pain than placebo acupunture.

(c) Does the data provide convincing evidence that there is a real pain reduction for those patients in the treatment group ? Or do you think the observed difference might be due to chance ?

The data provides some evidence that there is a real pain reduction for those patients in the treatment group but it is not convincing.

The observed difference is unlikely to be due solely to chance as the percentage of pain-free in the treatment group (almost 1 in 4) is significantly larger than the control group.

## 1.3 Air pollution and birth outcomes, study components.  <a class="anchor" id="1.3 Air pollution and birth outcomes, study components"></a>

Researchers collected data to examine the relationship between air pollutants and preterm births in Southern  California. During the study air pollution levels were measured by air quality monitoring stations. Specifically, levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in parts per hundred million, and coarse particulate matter (PM 10 ) in μg/m 3 . Length of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth. The analysis suggested that increased ambient PM 10 and,to a lesser degree, CO concentrations may be associated with the occurrence of preterm births. 

In this study, identify

(a) the cases,

143,196 eligible study subjects born in Southern California between 1989 and 1993.

(b) the variables and their types, and

Air Pollution Levels 

levels of carbon monoxide were recorded in parts per million - Continuous numerical variables.
nitrogen dioxide and ozone in parts per hundred million - Continuous numerical variables.
coarse particulate matter (PM 10 ) in μg/m 3 - Continuous numerical variables.

(c) the main research question.

“Is there an association between air pollution exposure and preterm births?”

## 1.11 Buteyko method, scope of inference. <a class="anchor" id="1.11 Buteyko method, scope of inference"></a>

Exercise 1.4 introduces a study on using the Buteyko shallow breathing technique to reduce asthma symptoms and improve quality of life. As part of this study 600 asthma patients aged 18-69 who relied on medication for asthma treatment were recruited and randomly assigned to two groups: one practiced the Buteyko method and the other did not. Those in the Buteyko group experienced, on average, a significant reduction in asthma symptoms and an improvement in quality of life. 

(a) Identify the population of interest and the sample in this study.

population of interest : asthma patients who relied on medication for asthma treatment

sample in this study : 600 asthma patients aged 18-69

(b) Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.

"If the patients in this sample, who are likely not randomly sampled, can be considered to be representative of all asthma patients aged 18-69 who rely on medication for asthma treatment, then the results are generalizable to the population defined above."

In my view this sentence is ambigious or tautological. There are two possibilties :

Either the sample is not randomly sampled from the general asthma patient population and the study cannot be considered representative in isolation.

Or  the sample is randomly sampled from the general asthma patient population and the study can be considered representative in isolation.

"Additionally, since the study is experimental, the findings can be used to establish causal relationships."

The study was an experiment which randomly divided patients into control and treatment groups; the results demonstrated a causal connection between the Buteyko method and reduction in asthma symptoms within the sample.

A causual connection can only be inferred through the law of large numbers ? 

https://en.wikipedia.org/wiki/Law_of_large_numbers

## 1.13 Relaxing after work. <a class="anchor" id="1.13 Relaxing after work"></a>

The 2010 General Social Survey asked the question, “After an average work day, about how many hours do you have to relax or pursue activities that you enjoy?” to a random sample of 1,155 Americans. The average relaxing time was found to be 1.65 hours. Determine which of the following is an observation, a variable, a sample statistic, or a
population parameter.

(a) An American in the sample. - Observation

(b) Number of hours spent relaxing after an average work day. - variable

(c) 1.65. -  a sample statistic (mean)

(d) Average number of hours all Americans spend relaxing after an average work day. - Population parameter
(mean).

## 1.17 Course satisfaction across sections. <a class="anchor" id="1.17 Course satisfaction across sections"></a>

A large college class has 160 students. All 160 students attend the lectures together, but the students are divided into 4 groups, each of 40 students, for lab sections administered by different teaching assistants. The professor wants to conduct a survey about how satisfied the students are with the course, and he believes that the lab section a student is in might affect the student’s overall satisfaction with the course.

(a) What type of study is this?

A prospective observational study.

(b) Suggest a sampling strategy for carrying out this study.

Use stratified sampling to randomly sample a fixed number of students, say 10, from each section for a total sample size of 40 students.

## 1.19 Internet use and life expectancy. <a class="anchor" id="1.19 Internet use and life expectancy"></a>

The following scatterplot was created as part of a study evaluating the relationship between estimated life expectancy at birth (as of 2014) and percentage of internet users (as of 2009) in 208 countries for which such data were available. 

(a) Describe the relationship between life expectancy and percentage of internet users.

A low life expectancy correlates with a low percentage of internet users.

(b) What type of study is this?

Retrospective Observational.

(c) State a possible confounding variable that might explain this relationship and describe its potential effect.

Data not available for internet use


(a) Positive, non-linear, somewhat strong. Countries in which a higher percentage of the population have access to the internet also tend to have higher average life expectancies, however rise in life expectancy trails off before around 80 years old. 

(b) Observational. 

(c) Wealth: countries with individuals who can widely afford the internet can probably also afford basic medical
care. (Note: Answers may vary.)

## 1.25 Flawed reasoning. <a class="anchor" id="1.25 Flawed reasoning"></a>


Identify the flaw(s) in reasoning in the following scenarios. Explain what the individuals in the study should have done differently if they wanted to make such strong conclusions.

(a) Students at an elementary school are given a questionnaire that they are asked to return after their parents have completed it. One of the questions asked is, “Do you find that your work schedule makes it difficult for you to spend time with your kids after school?” Of the parents who replied, 85% said “no”. Based on these results, the school officials conclude that a great majority of the parents have no difficulty spending time with their kids after school.

(b) A survey is conducted on a simple random sample of 1,000 women who recently gave birth, asking them about whether or not they smoked during pregnancy. A follow-up survey asking if the children have respiratory problems is conducted 3 years later, however, only 567 of these women are reached at the same address. The researcher reports that these 567 women are representative of all mothers.

(c) An orthopedist administers a questionnaire to 30 of his patients who do not have any joint problems and finds that 20 of them regularly go running. He concludes that running decreases the risk of joint problems.