# Pearson's Chi-Squared Test

There is a nice built-in function in R to conduct the Pearson's Chi-Squared Test. In this write up we will demonstrate using this built-in function, as well as performing the test by calculating each step of the test. Before we start, let's try to replicate the gender vs trouble status data from the article.

In [1]:
gender = c(replicate(117, 'boys'), replicate(120, 'girls'))
trouble = c(replicate(46, 'trouble'), replicate(71, 'no trouble'), replicate(37, 'trouble'), replicate(83, 'no trouble'))
o.table = table(gender, trouble)
print(o.table)

       trouble
gender  no trouble trouble
  boys          71      46
  girls         83      37


## Using Built-in Function

In [2]:
xsq.test = chisq.test(gender, trouble, correct = FALSE)
print(xsq.test)
#xsq.test = chisq.test(o.table, correct = FALSE) #Inputting the data as a contingency table


	Pearson's Chi-squared test

data:  gender and trouble
X-squared = 1.8733, df = 1, p-value = 0.1711



There are two options of inputting the data into the `chisq.test` function. We can either input the two variables into the function (the `x=` and `y=` arguments), or simply supply the contingency table of the two variables (the variables `tbl` above). If `correct = TRUE`, the test will apply the Yates' correction for continuity. 

We can also extract the different outputs from the results of the `chisq.test` function. Suppose our input data is simply the two vectors of variables, and we are interested in obtaining the contingency tables. Instead of doing an extra step using the `table()` function, we can extract it directly from the output of the test.

In [3]:
xsq.test$observed   # observed counts (same as o.table above)
xsq.test$expected   # expected counts under the null
xsq.test$statistic  # test statistics
xsq.test$parameter  # the degrees of freedom
xsq.test$p.value    # p-value

       trouble
gender  no trouble trouble
  boys          71      46
  girls         83      37

Unnamed: 0,no trouble,trouble
boys,76.02532,40.97468
girls,77.97468,42.02532


## Using Basic Calculations
Even though the built-in function is simple to use, going through the basic calculations allow us to gain a deeper understanding of the testing procedure. Recall that the test statistics for the Pearson's Chi-Squared Test is
$$\chi^2 = \sum{\frac{{(observed - expected)}^2}{expected}}$$

where observed is the observed counts, and expected is the expected counts (when the two variables are independent). We can also express this formula using the joint probabilities:
$$\chi^2 = \sum N \frac{(observed.p - expected.p)^2}{expected.p}$$

where observed.p is the observed joint probability, expected.p is the expected joint probability, and N is the total count (refer to the article for more information).

In [4]:
gender.prob = table(gender)/length(gender)
trouble.prob = table(trouble)/length(trouble)
e.table = matrix(0, nrow = length(gender.prob), ncol = length(trouble.prob))
for(i in 1:length(gender.prob)){ #Create expected count table
  for(j in 1:length(trouble.prob)){
    e.table[i,j] = gender.prob[i] * trouble.prob[j] * length(gender)
  }
}
colnames(e.table) = c('no trouble', 'trouble')
rownames(e.table) = c('boys', 'girls')
print(o.table)

       trouble
gender  no trouble trouble
  boys          71      46
  girls         83      37


In [5]:
print(e.table) #Compare this with the output extracted above

      no trouble  trouble
boys    76.02532 40.97468
girls   77.97468 42.02532


In [6]:
test.stat = sum((o.table - e.table)^2/e.table) #The test statistics
print(test.stat)

[1] 1.873294


For the Pearson's Chi-Square test, we assume the test statistics has a $\chi^2$ distribution with degrees of freedom (c-1)(r-1). The critical value (assume $\alpha = 0.05$) can be found by using the following command:

In [7]:
crit.val = qchisq(p = 0.95, df = 1)
print(crit.val)

[1] 3.841459


Since the test statistics is less than the critical value, we failed to reject the null hypothesis. The p-value can be calculated by

In [8]:
pchisq(q = test.stat, df = 1 , lower.tail = F)

The p-value is larger than $\alpha = 0.05$, hence we fail to reject the null hypothesis.