# Introduction to Probability and Data

https://www.coursera.org/learn/probability-intro/home/week/1

## Week 1 - Introduction to Data

Suggested readings and practice problems from OpenIntro Statistics, 3rd edition (a free online introductory statistics textbook co-authored by Dr. Cetinkaya-Rundel) for this week:

Suggested reading: Chapter 1, Sections 1.1 - 1.5

Practice exercises: End of chapter exercises in Chapter 1: 1.1, 1.3, 1.11, 1.13, 1.17, 1.19, 1.25, 1.27, 1.31

(Reminder: the solutions to the end of chapter exercises are at the end of the OpenIntro Statistics book)

# Table of Contents

1. [using stents to prevent strokes](#using stents to prevent strokes)
2. [Guided Practice 1.1](#Guided Practice 1.1)
3. [1.3.1 Populations and samples](#1.3.1 Populations and samples)
4. [1.1 Migraine and acupuncture](#1.1 Migraine and acupuncture)



## Introduction to data

It is helpful to put statistics in the context of a general process of investigation:

1. Identify a question or problem.
2. Collect relevant data on the topic.
3. Analyze the data.
4. Form a conclusion.

Statistics as a subject focuses on making stages 2-4 objective, rigorous, and efficient.
That is, statistics has three primary components: How best can we collect data? How
should it be analyzed? And what can we infer from the analysis?

## 1.1 Case study: using stents to prevent strokes  <a class="anchor" id="using stents to prevent strokes"></a>

Section 1.1 introduces a classic challenge in statistics: evaluating the efficacy of a medical treatment. Terms in this section, and indeed much of this chapter, will all be revisited later in the text. The plan for now is simply to get a sense of the role statistics can play in practice.

> Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Medical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993-1003. www.nejm.org/doi/full/10.1056/NEJMoa1105335. NY Times article reporting on the study: www.nytimes.com/2011/09/08/health/research/08stent.html.

In this section we will consider an experiment that studies effectiveness of stents in treating patients at risk of stroke. 1 Stents are devices put inside blood vessels that assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death. Many doctors have hoped that there would be similar benefits for patients at risk of stroke. 

We start by writing the principal question the researchers hope to answer:

Does the use of stents reduce the risk of stroke?

The researchers who asked this question collected data on 451 at-risk patients. Each volunteer patient was randomly assigned to one of two groups:

**Treatment group.** Patients in the treatment group received a stent and medical management. The medical management included medications, management of risk factors, and help in lifestyle modification.

**Control group.** Patients in the control group received the same medical management as the treatment group, but they did not receive stents.

Researchers randomly assigned 224 patients to the treatment group and 227 to the control group. In this study, the control group provides a reference point against which we can
measure the medical impact of stents in the treatment group.

Researchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment. The results of 5 patients are summarized in Table 1.1. Patient outcomes are recorded as “stroke” or “no event”, representing whether or not the patient had a stroke at the end of a time period.

| Patient        | group           | 0-30 days  | 0-365 days |
| -------------  |-------------    | -----      | -----      |
| 1              | treatment       | no event   | no event   |   
| 2              | treatment       | stroke     | stroke     |  
| 3              | treatment       | no event   | no event   |  
| ...            | ...             | ...        | ...        |  
| 450            | control         | no event   | no event   |  
| 451            | control         | no event   | no event   |  
Table 1.1: Results for five patients from the stent study.

Considering data from each patient individually would be a long, cumbersome path towards answering the original research question. Instead, performing a statistical data analysis allows us to consider all of the data at once. 

Table 1.2 summarizes the raw data in a more helpful way. In this table, we can quickly see what happened over the entire study.

For instance, to identify the number of patients in the treatment group who had a stroke within 30 days, we look on the left-side of the table at the intersection of the treatment and stroke: 33.

|           | 0-30   days      | |0-365     days   | |
| ----------|--------|-----------|--------|----------|
|           | stroke | no event  | stroke | no event |
| --------- |--------| ----------| -------| ---------|
| Treatment | 33     | 191       | 45     | 179      |
| control   | 13     | 214       | 28     | 199      |
| ----------| -------| ----------|--------|----------|  
| Total     | 46     | 405       | 73     | 378      | 
Table 1.2: Descriptive statistics for the stent study.

In [9]:
# Descriptive statistics for the stent study.
Treatment <- c(33, 191,45,179)
Control <- c(13, 214,28,199)

In [10]:
# Construct matrix for Descriptive statistics for the stent study :
stent_study <- matrix(c(Treatment, Control), nrow = 2, byrow = TRUE)

In [21]:
# Vectors region and titles, used for naming
samples <- c("Treatment", "Control")
outcomes <- c("0-30 days - stroke", "0-30 days -no event","0-365 days -stroke", "0-365 days -no event")

In [19]:
# Name the columns with outcomes
colnames(stent_study) <- outcomes

In [20]:
# Name the rows with titles
rownames(stent_study) <- samples

In [29]:
# Bind the totals column to the matrix :
Totals <- colSums(stent_study)

stent_study_totals <- rbind(stent_study, Totals)

stent_study_totals

Unnamed: 0,0-30 days - stroke,0-30 days -no event,0-365 days -stroke,0-365 days -no event
Treatment,33,191,45,179
Control,13,214,28,199
Totals,46,405,73,378


## Guided Practice 1.1 <a class="anchor" id="Guided Practice 1.1"></a>

Of the 224 patients in the treatment group, 45 had a stroke by the end of the first year. Using these two numbers, compute the proportion of patients in the treatment group who had a stroke by the end of their first year.

## Patients in the treatment group who had a stroke by the end of their first year.

In [36]:
treatment_patients_in_first_year_stroke <- stent_study_totals[1,3]
treatment_patients_in_first_year_stroke

In [37]:
# patients in the treatment group who did not have a stroke by the end of their first year.
treatment_patients_in_first_year_no_event <- stent_study_totals[1,4]
treatment_patients_in_first_year_no_event

In [40]:
# treatment_patients_in_first_year
treatment_patients_in_first_year <- treatment_patients_in_first_year_no_event + treatment_patients_in_first_year_stroke
treatment_patients_in_first_year

In [51]:
# Proportion who had a stroke in the treatment (stent) group: 45/224 = 0.20 = 20%.
Proportion_treatment_who_had_a_stroke <- treatment_patients_in_first_year_stroke / treatment_patients_in_first_year
Proportion_treatment_who_had_a_stroke

In [52]:
# Create number and format as percent rounded to one decimal place
Percentage_treatment_who_had_a_stroke <- paste(round((Proportion_treatment_who_had_a_stroke)*100,digits=1),"%",sep="")
Percentage_treatment_who_had_a_stroke

We can compute summary statistics from the table. A summary statistic is a single number summarizing a large amount of data. 3 For instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups.

## Patients in the control group who had a stroke by the end of their first year.

In [56]:
# Proportion who had a stroke in the control group: 28/227 = 0.12 = 12%.

control_patients_in_first_year_stroke <- stent_study_totals[2,3]
control_patients_in_first_year_no_event <- stent_study_totals[2,4]
control_patients_in_first_year <- control_patients_in_first_year_no_event + control_patients_in_first_year_stroke
Proportion_control_who_had_a_stroke <- control_patients_in_first_year_stroke / control_patients_in_first_year
Percentage_control_who_had_a_stroke <- paste(round((Proportion_control_who_had_a_stroke)*100,digits=1),"%",sep="")
Percentage_control_who_had_a_stroke


These two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke! This is
important for two reasons. First, it is contrary to what doctors expected, which was that stents would reduce the rate of strokes. Second, it leads to a statistical question: do the
data show a “real” di↵erence between the groups?

## 1.3.1 Populations and samples <a class="anchor" id="1.3.1 Populations and samples"></a>

Consider the following three research questions:

Guided Practice 1.7 For the second and third questions above, identify the target population and what represents an individual case

### 1. What is the average mercury content in swordfish in the Atlantic Ocean?

**target population** = all swordfish in the Atlantic ocean

**individual case** = a swordfish in the Atlantic ocean

### 2. Over the last 5 years, what is the average time to complete a degree for Duke undergraduate students?

**target population** = last 5 years of Duke undergraduate students who graduated (last 5 years of graduate students).

**individual case** = a Duke undergraduate students of the last 5 years

### 3. Does a new drug reduce the number of deaths in patients with severe heart disease?

**target population** = patients with severe heart disease

**individual case** = a patient with severe heart disease



## Exercises

### 1.1 Migraine and acupuncture. <a class="anchor" id="1.1 Migraine and acupuncture"></a>

A migraine is a particularly painful type of headache, which patients sometimes wish to treat with acupuncture. To determine whether acupuncture relieves migraine pain, researchers conducted a randomized controlled study where 89 females diagnosed with migraine headaches were randomly assigned to one of two groups: treatment or control. 

43 patients in the treatment group received acupuncture that is specifically designed to treat migraines. 46 patients in the control group received placebo acupuncture (needle insertion at non-acupoint locations). 24 hours after patients received acupuncture, they were asked if they were asked if they were pain free. Results are summarized in the contingency table below.

In [10]:
Treatment <- c(10, 33)
Control <- c(2, 44)
migraine_study <- matrix(c(Treatment, Control), nrow = 2, byrow = TRUE)
# Vectors used for naming
samples <- c("Treatment", "Control")
outcomes <- c("Pain-free", "Not Pain-free")
# Name the columns with outcomes
colnames(migraine_study) <- outcomes
# Name the rows with titles
rownames(migraine_study) <- samples
# Bind the totals column to the matrix :
Pain_Totals <- colSums(migraine_study)
Group_Totals <-rowSums(migraine_study)
migraine_study_totals1 <- cbind(migraine_study,Group_Totals)
migraine_study_totals2 <- rbind(migraine_study, Pain_Totals)

#migraine_study_totals <- rbind(migraine_study_totals1, migraine_study_totals2)
migraine_study_totals1

migraine_study_totals2

Unnamed: 0,Pain-free,Not Pain-free,Group_Totals
Treatment,10,33,43
Control,2,44,46


Unnamed: 0,Pain-free,Not Pain-free
Treatment,10,33
Control,2,44
Pain_Totals,12,77


In [12]:
combined_migraine_study_totals <- addmargins(migraine_study, FUN = list(Total = sum), quiet = TRUE)
combined_migraine_study_totals

Unnamed: 0,Pain-free,Not Pain-free,Total
Treatment,10,33,43
Control,2,44,46
Total,12,77,89


(a) (1) What percent of patients in the treatment group were pain free after recieving acupuncture ?

In [18]:
patients_in_the_treatment_group_pain_free_after_acupuncture <- combined_migraine_study_totals[1,1]
patients_in_the_treatment_group_total <- combined_migraine_study_totals[1,3]
Proportion_treatment_pain_free <- patients_in_the_treatment_group_pain_free_after_acupuncture / patients_in_the_treatment_group_total
Percentage_treatment_pain_free <- paste(round((Proportion_treatment_pain_free)*100,digits=1),"%",sep="")
Percentage_treatment_pain_free

(a) (1) What percent of patients in the control group were pain free after recieving acupuncture ?

In [19]:
patients_in_the_control_group_pain_free_after_acupuncture <- combined_migraine_study_totals[2,1]
patients_in_the_control_group_total <- combined_migraine_study_totals[2,3]
Proportion_control_pain_free <- patients_in_the_control_group_pain_free_after_acupuncture / patients_in_the_control_group_total
Percentage_control_pain_free <- paste(round((Proportion_control_pain_free)*100,digits=1),"%",sep="")
Percentage_control_pain_free

(b) At first glance does acupuncture appear to be an effective treatment for migraine ? Explain your reasoning.

Acupunture is not an effective treatment for migraine as only 23.3 % of the treatment group were pain free which left the majority with pain (76.7%). Although Acupunture appears to be more effective at removing pain than placebo acupunture.

(c) Does the data provide convincing evidence that there is a real pain reduction for those patients in the treatment group ? Or do you think the observed difference might be due to chance ?

The data provides some evidence that there is a real pain reduction for those patients in the treatment group but it is not convincing.

The observed difference is unlikely to be due solely to chance as the percentage of pain-free in the treatment group (almost 1 in 4) is significantly larger than the control group.