Case-control studies help determine whether certain exposures are associated with outcomes such as developing cancer. The built-in dataset esoph contains data from a case-control study in France comparing people with esophageal cancer (cases, counted in ncases) to people without esophageal cancer (controls, counted in ncontrols) that are carefully matched on a variety of demographic and medical characteristics. The study compares alcohol intake in grams per day (alcgp) and tobacco intake in grams per day (tobgp) across cases and controls grouped by age range (agegp).

The dataset is available in base R and can be called with the variable name esoph:

`head(esoph)` <br>
You will be using this dataset to answer the following four multi-part questions (Questions 3-6).

You may wish to use the tidyverse package:

`library(tidyverse)` <br>
The following three parts have you explore some basic characteristics of the dataset.

Each row contains one group of the experiment. Each group has a different combination of age, alcohol consumption, and tobacco consumption. The number of cancer cases and number of controls (individuals without cancer) are reported for each group.

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.0
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Question 3a
How many groups are in the study?


In [3]:
head(esoph)

Unnamed: 0_level_0,agegp,alcgp,tobgp,ncases,ncontrols
Unnamed: 0_level_1,<ord>,<ord>,<ord>,<dbl>,<dbl>
1,25-34,0-39g/day,0-9g/day,0,40
2,25-34,0-39g/day,10-19,0,10
3,25-34,0-39g/day,20-29,0,6
4,25-34,0-39g/day,30+,0,5
5,25-34,40-79,0-9g/day,0,27
6,25-34,40-79,10-19,0,7


In [4]:
nrow(esoph)

## Question 3b
<b>How many cases are there? </b><br>
Save this value as `all_cases` for later problems.

In [7]:
all_cases <- sum(esoph$ncases)
all_cases

## Question 3c
<b>How many controls are there?</b> <br>
Save this value as `all_controls` for later problems.

In [10]:
all_controls <- sum(esoph$ncontrols)
all_controls

## Question 4a
What is the probability that a subject in the highest alcohol consumption group is a cancer case?

In [11]:
head(esoph)

Unnamed: 0_level_0,agegp,alcgp,tobgp,ncases,ncontrols
Unnamed: 0_level_1,<ord>,<ord>,<ord>,<dbl>,<dbl>
1,25-34,0-39g/day,0-9g/day,0,40
2,25-34,0-39g/day,10-19,0,10
3,25-34,0-39g/day,20-29,0,6
4,25-34,0-39g/day,30+,0,5
5,25-34,40-79,0-9g/day,0,27
6,25-34,40-79,10-19,0,7


In [12]:
max(esoph$alcgp)

In [27]:
df <- esoph %>% filter(alcgp >= '120+')
df

agegp,alcgp,tobgp,ncases,ncontrols
<ord>,<ord>,<ord>,<dbl>,<dbl>
25-34,120+,0-9g/day,0,1
25-34,120+,10-19,1,1
25-34,120+,20-29,0,1
25-34,120+,30+,0,2
35-44,120+,0-9g/day,2,3
35-44,120+,10-19,0,3
35-44,120+,20-29,2,4
45-54,120+,0-9g/day,4,4
45-54,120+,10-19,3,4
45-54,120+,20-29,2,3


In [26]:
sum(df$ncases)/sum(df$ncontrols)

### Answer

In [30]:
esoph %>%
filter(alcgp == "120+") 

agegp,alcgp,tobgp,ncases,ncontrols
<ord>,<ord>,<ord>,<dbl>,<dbl>
25-34,120+,0-9g/day,0,1
25-34,120+,10-19,1,1
25-34,120+,20-29,0,1
25-34,120+,30+,0,2
35-44,120+,0-9g/day,2,3
35-44,120+,10-19,0,3
35-44,120+,20-29,2,4
45-54,120+,0-9g/day,4,4
45-54,120+,10-19,3,4
45-54,120+,20-29,2,3


In [31]:
esoph %>%
filter(alcgp == "120+") %>%
summarize(ncases = sum(ncases), ncontrols = sum(ncontrols))

ncases,ncontrols
<dbl>,<dbl>
45,67


In [33]:
esoph %>%
filter(alcgp == "120+") %>%
summarize(ncases = sum(ncases), ncontrols = sum(ncontrols)) %>%
mutate(p_case = ncases / (ncases + ncontrols)) 

ncases,ncontrols,p_case
<dbl>,<dbl>,<dbl>
45,67,0.4017857


In [34]:
esoph %>%
filter(alcgp == "120+") %>%
summarize(ncases = sum(ncases), ncontrols = sum(ncontrols)) %>%
mutate(p_case = ncases / (ncases + ncontrols)) %>%
pull(p_case)

## Question 4b
What is the probability that a subject in the lowest alcohol consumption group is a cancer case?

In [36]:
min(esoph$alcgp)

In [37]:
esoph %>% filter(alcgp == "0-39g/day")

agegp,alcgp,tobgp,ncases,ncontrols
<ord>,<ord>,<ord>,<dbl>,<dbl>
25-34,0-39g/day,0-9g/day,0,40
25-34,0-39g/day,10-19,0,10
25-34,0-39g/day,20-29,0,6
25-34,0-39g/day,30+,0,5
35-44,0-39g/day,0-9g/day,0,60
35-44,0-39g/day,10-19,1,14
35-44,0-39g/day,20-29,0,7
35-44,0-39g/day,30+,0,8
45-54,0-39g/day,0-9g/day,1,46
45-54,0-39g/day,10-19,0,18


In [40]:
esoph %>% filter(alcgp == "0-39g/day") %>%
summarize(ncases = sum(ncases), ncontrols = sum(ncontrols))

ncases,ncontrols
<dbl>,<dbl>
29,415


In [41]:
esoph %>% filter(alcgp == "0-39g/day") %>%
summarize(ncases = sum(ncases), ncontrols = sum(ncontrols)) %>%
mutate(p_case = ncases/(ncases+ncontrols))

ncases,ncontrols,p_case
<dbl>,<dbl>,<dbl>
29,415,0.06531532


In [42]:
esoph %>% filter(alcgp == "0-39g/day") %>%
summarize(ncases = sum(ncases), ncontrols = sum(ncontrols)) %>%
mutate(p_case = ncases/(ncases+ncontrols)) %>%
pull(p_case)

## Question 4c
Given that a person is a case, what is the probability that they smoke 10g or more a day?

In [44]:
# Pr(case) * Pr(smoking a cigarette)

In [47]:
p_case <- sum(esoph$ncases)/(sum(esoph$ncases) + sum(esoph$ncontrols))
p_case

In [50]:
min(esoph$tobgp)

In [52]:
df <- esoph %>% filter(tobgp != "0-9g/day")
df

agegp,alcgp,tobgp,ncases,ncontrols
<ord>,<ord>,<ord>,<dbl>,<dbl>
25-34,0-39g/day,10-19,0,10
25-34,0-39g/day,20-29,0,6
25-34,0-39g/day,30+,0,5
25-34,40-79,10-19,0,7
25-34,40-79,20-29,0,4
25-34,40-79,30+,0,7
25-34,80-119,10-19,0,1
25-34,80-119,30+,0,2
25-34,120+,10-19,1,1
25-34,120+,20-29,0,1


In [53]:
nrow(df) # Groups smoking more than 10g

In [54]:
nrow(esoph)

In [56]:
# Pr(cig) :=
p_cig <- nrow(df)/nrow(esoph)
p_cig

In [57]:
#Required answer =:
p_cig*p_case

### Answer

In [58]:
tob_cases <- esoph %>%
  filter(tobgp != "0-9g/day") %>%
  pull(ncases) %>%
  sum()

tob_cases/all_cases

## Question 4d
Given that a person is a control, what is the probability that they smoke 10g or more a day?

In [63]:
tob_cases <- esoph %>%
  filter(tobgp != "0-9g/day") %>%
  pull(ncontrols) %>%
  sum()

tob_cases/all_controls