## Example
Let's begin with an example. Suppose that you conduct an opinion survey amongst pilots. In the survey, you ask them basic demographics (gender, race, etc) and whether they agree/disagree with a statement on a scale.

You then have a set of categorical data with which you can compare responses to questions between demographics (or responses). Here we can compare pilot gender with the sense of discrimination within the field.

In [4]:
library("gmodels")

gender <- structure(c(2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 2L, 1L, 2L), .Label = c("Female", "Male"), class = "factor")

discrim <- structure(c(3L, 4L, 3L, 3L, 3L, 2L, 4L, 3L, 2L, 3L, 3L, 2L, 3L, 
3L, 3L, 3L, 2L, 3L, 4L, 3L), .Label = c("A Lot", "No", "Not at all", 
"Yes"), class = "factor")

CrossTable(gender, discrim, prop.c=FALSE, prop.r=FALSE, chisq=FALSE, prop.t=FALSE, prop.chisq=FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|-------------------------|

 
Total Observations in Table:  20 

 
             | discrim 
      gender |         No | Not at all |        Yes |  Row Total | 
-------------|------------|------------|------------|------------|
      Female |          2 |          4 |          2 |          8 | 
-------------|------------|------------|------------|------------|
        Male |          2 |          9 |          1 |         12 | 
-------------|------------|------------|------------|------------|
Column Total |          4 |         13 |          3 |         20 | 
-------------|------------|------------|------------|------------|

 


In this representation we see response counts for each combination on the categoricals. When we add each count for each possible combination, the sum equals the total number of responses (n=20). From the table we can see that there were two (2) counts of females who responded 'No' to feeling discrimination.

What we also see is row and column totals for each category. Following the rows, we can see that the total number of women is eight (8) and the total number of men is twelve (12). From the columns, there were a total of four (4) 'No,' thirteen (13) 'Not at all,' three (3) 'Yes,' and zero (0) 'A lot.'

It is here in the proportions of counts in rows versus columns that we determine if there is a relationship between the categoricals.

If we assume that the responses to the question of discrimination are independent of gender, then then the number of female responding 'Yes' should be equal to the fraction of females multiplied by the number responding 'Yes.'

$$\begin{align}
n_{f,y} =& 3 \frac{8}{20}\\
=& 1.2
\end{align}$$

However, from the table we can see that there are actually more than 1.2 females who responded 'Yes.' The question then becomes, is this more likely to be from the same distribution (ie sample is independent of m/f and from the same distribution) or are they from different distributions.


## Chi squared
The chi squared test of independence is calculated as follows:

$$\begin{align}
\chi^2_{cell} =& \frac{\left( observed - expected \right)^2}{expected} \\
\chi^2 =& \sum_{all} \chi^2_{cell}
\end{align}$$

This value is then compared against the chi squared distribution to determine how likely that the sample is independent of gender. The chi squared distribution is dependent on the number of degrees of freedom (dof) of the measurement. The number used in selecting the distribution for dof is:

$$\begin{align}
dof =& \left( number of rows - 1\right) \left( number of columns -1 \right)
\end{align}$$

For our sample, we may recalculate the table to include the expected values and the chi squared value of each cell as well as the likelihood that the sample is independent of m/f:

In [5]:
CrossTable(gender, discrim, prop.c=FALSE, prop.r=FALSE, chisq=TRUE, prop.t=FALSE, prop.chisq=TRUE, expected=TRUE)

“Chi-squared approximation may be incorrect”


 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
| Chi-square contribution |
|-------------------------|

 
Total Observations in Table:  20 

 
             | discrim 
      gender |         No | Not at all |        Yes |  Row Total | 
-------------|------------|------------|------------|------------|
      Female |          2 |          4 |          2 |          8 | 
             |      1.600 |      5.200 |      1.200 |            | 
             |      0.100 |      0.277 |      0.533 |            | 
-------------|------------|------------|------------|------------|
        Male |          2 |          9 |          1 |         12 | 
             |      2.400 |      7.800 |      1.800 |            | 
             |      0.067 |      0.185 |      0.356 |            | 
-------------|------------|------------|------------|------------|
Column Total |          4 |         13 |          3 |         20 | 
-------------|------------|----

For each cell we now have count, expected value, and chi squared value. From the test statistic, the likelihood that the sample is independent of gender is 0.468. In this case we cannot reject the NULL.

Note the warning provided by the test, "Chi-squared approximation may be incorrect." Given the number of degrees of freedom, this is a very small sample. A statistical power test would suggest that for a moderate effect size of 0.3, a sample of n=108 would be needed.

In [7]:
library("pwr")
pwr.chisq.test(w=0.3,N=NULL,df=2,sig.level=0.05,power=0.8)


     Chi squared power calculation 

              w = 0.3
              N = 107.0521
             df = 2
      sig.level = 0.05
          power = 0.8

NOTE: N is the number of observations
