# Comparing Group Membership 

**The $\chi^2$ Test**

The $\chi^2$ test can be used to compare two categorical variables and helps us answer questions like:

- Is whether or not a customer churns independent of their subscription plan?
- Are doctors less likely to smoke?
- Does playing on the home field give a soccer team an advantage?

In this lesson we will dive into how the test is performed.

<hr style="border:2px solid gray">

## The Chi-Square Contingency Table Test

The $\chi^2$ ('$k\bar{i}$') can be also be used in several other ways, but we will use what is referred to as the *contingency table test*, which lets us test the hypothesis that one group is independent of another. To do this, we will

1. Contingency table of observed values

1. Use stats.chi2_contingency to generate a contingency table of expected valueus, test-statistic and p-value based on the observed values. 

1. Is the p-value less than our alpha? Draw conclusions.

<div class="alert alert-block alert-info">
To manually compute $\chi^2$, we would also compute a contingency table of expected values. Then we would compute our test-statistic, $\chi^2 = \sum{\frac{(O - E)^2}{E}} $, where $O$ is the observed values, and $E$ is the expected values. You can see an example of this below, in the Bonus Content.
</div>

For this lesson, we will look at the dataset on cars that we explored previously.

____________________

**Example 1**

We will investigate the question of whether the cars drive is independent of transmission type.

- $H_{0}$ (Null Hypothesis): drive is independent of transmission type. 

- $H_{a}$ (Alternative Hypothesis): drive is dependent on transmission type. 

In [1]:
#imports
import pandas as pd
from scipy import stats
from pydataset import data
import numpy as np

mpg = data('mpg')
mpg['transmission'] = mpg.trans.str[:-4] # a little cleaning goes a long way
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,transmission
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,auto
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,manual
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,manual
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,auto
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,auto


In [2]:
mpg['drv'].value_counts()

f    106
4    103
r     25
Name: drv, dtype: int64

In [3]:
# eliminates (15) etc. from the end of trans values
mpg['trans'] = mpg['trans'].str[:-4]
mpg.trans.value_counts()

auto      157
manual     77
Name: trans, dtype: int64

First, create an observed crosstab, or contingency table, from the dataframe's two columns of interest. 

In [4]:
observed = pd.crosstab(mpg.drv, mpg.trans)
observed

trans,auto,manual
drv,Unnamed: 1_level_1,Unnamed: 2_level_1
4,75,28
f,65,41
r,17,8


The `chi2_contingency` function returns 4 items (*in this order*):

1. the **test statistic**: $\chi^2$
2. the **p-value**: the probability of seeing these proportions by chance
3. the **degrees of freedom**: equivalent to sample size minus 1
4. the contingency table of the **expected values**, which represents what the values would be if everything was proportional and there was no relationship between the 2 variables. 

In [5]:
alpha = 0.05
chi2, p, degf, expected = stats.chi2_contingency(observed)

Let's look at our results:

In [29]:
# print 'Observed Values' followed by a new line
print('Observed Values\n')

# print the values from the 'observed' dataframe
print(observed.values)

# print --- and then a new line, 'Expected Values', followed by another new line
print('---\nExpected Values\n')

# print the expected values array
print(expected.astype(int))

# print a new line
print('---\n')

# print the chi2 value, formatted to a float with 4 digits. 
print(f'chi^2 = {chi2:.4f}') 

# print the p-value, formatted to a float with 4 digits. 
print(f'p     = {p:.4f}')

Observed Values

[[75 28]
 [65 41]
 [17  8]]
---
Expected Values

[[69 33]
 [71 34]
 [16  8]]
---

chi^2 = 3.1368
p     = 0.2084


**Takeaways**

We can see by comparing the contingency tables that the observed values are very close to the expected values. We can confirm that, with the data available, there does not appear to be a significant relationship between type of transimission of type of drive. We fail to reject the null hypothesis

**Example 2**

We will now investigate the question of whether the car's 'class' is independent of number of cylinders. Number of cylinders, while it is represented numerically, is a discrete variable. It is not continuous and there are a limited number of options. 

- $H_{0}$ (Null Hypothesis): class is independent of cylinders. 

- $H_{a}$ (Alternative Hypothesis): class is dependent on cylinders.

First, create an observed crosstab, or contingency table, from the dataframe's two columns of interest. 

In [14]:
# used mpg['class'] instead of mpg.class because class is a reserved word
observed = pd.crosstab(mpg['class'], mpg.cyl)
observed

cyl,4,5,6,8
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2seater,0,0,0,5
compact,32,2,13,0
midsize,16,0,23,2
minivan,1,0,10,0
pickup,3,0,10,20
subcompact,21,2,7,5
suv,8,0,16,38


When we run a Chi-Square Contingency Test, we are testing whether there is a relationship between **at least** 2 categories of one variable with **at least** 2 categories of the other variable. The results do not tell us where the relationship lies, or that there is a relationship between all categories. We can do additional testing to figure out where the relationship lies, if necessary. 

Therefore, when we run a chi-square test with these 2 variables, the resulting p-value will tell us whether there is a relationship between at least 2 classes of vehicles with at least 2 types of cylinders. 

In [19]:
chi2, p, degf, expected = stats.chi2_contingency(observed)
p

1.5351076620141742e-20

In [16]:
alpha = 0.05

def eval_results(p, alpha, group1, group2):
    '''
    this function will take in the p-value, alpha, and a name for the 2 variables 
    you are comparing (group 1 and group 2)
    '''
    if p < alpha:
        return f'There exists some relationship between {group1} and the {group2}. (p-value: {p})'
    else:
        return f'There is not a significant relationship between {group1} and {group2}. (p-value: {p})'
        

In [17]:
eval_results(p, alpha, group1='class', group2='cylinders')

'There exists some relationship between class and the cylinders. (p-value: 1.5351076620141742e-20)'

Now we know there is *some* relationship. But where does that relationship exist? 

In [81]:
print("Expected DataFrame")
#This is changein an array to a formatted dataframe
print(pd.DataFrame(expected.astype('int'), index=observed.index, columns=observed.columns))
print("\n")
print("Observed DataFrame")
print(observed)

Expected DataFrame
cyl          4  5   6   8
class                    
2seater      1  0   1   1
compact     16  0  15  14
midsize     14  0  13  12
minivan      3  0   3   3
pickup      11  0  11   9
subcompact  12  0  11  10
suv         21  1  20  18


Observed DataFrame
cyl          4  5   6   8
class                    
2seater      0  0   0   5
compact     32  2  13   0
midsize     16  0  23   2
minivan      1  0  10   0
pickup       3  0  10  20
subcompact  21  2   7   5
suv          8  0  16  38


There appear to be more than expected results in 2-seater vehicles with 8-cylinders, compact cars with 4-cylinders, midsized cars with 6-cylinders, minivans with 6-cylinders, pickups with 8-cylinders, subcompact cars with 4-cylinders, and finally SUVs with 8-cylinders. 

If I want to identify where these are significant relationships, I can compare each of these "interesting groups", such as 2-seater vehicles vs. non-2-seater vehicles with 8-cylinders vs. non-8-cylinders.

In [82]:
# create a variable that is a 1 if a vehicle is a 2 seater, and a 0 otherwise.
mpg['class_2seater'] = (mpg['class'] == '2seater').astype('int')

# create a variable that is a 1 if the vehicle is 8 cylinders and a 0 otherwise. 
mpg['cyl_8'] = (mpg['cyl'] == 8).astype('int')

# generate a crosstab of these 2 new variables
observed = pd.crosstab(mpg['class_2seater'], mpg['cyl_8'])

observed

cyl_8,0,1
class_2seater,Unnamed: 1_level_1,Unnamed: 2_level_1
0,164,65
1,0,5


In [83]:
# run chi-square test
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [84]:
# evaluate results
eval_results(p, alpha, group1='2 seater cars', group2='8 cylinders')

There exists some relationship between 2 seater cars and the 8 cylinders. (p-value: 0.0030157710558452104)


Compare one more interesting group: SUVs and 8-cylinders. We have already created the 8-cylinder boolean variable, so we just need to create a boolean variable for SUV. 

In [85]:
# create a new variable that is a 1 if the class is an SUV and a 0 otherwise. 
mpg['class_SUV'] = (mpg['class'] == 'suv').astype('int')

# generate a crosstab
observed = pd.crosstab(mpg.class_SUV, mpg.cyl_8)
observed

cyl_8,0,1
class_SUV,Unnamed: 1_level_1,Unnamed: 2_level_1
0,140,32
1,24,38


In [86]:
# run chi-square test
chi2, p, degf, expected = stats.chi2_contingency(observed)

# evaluate results
eval_results(p, alpha, group1='SUVs', group2='8 cylinders')

There exists some relationship between SUVs and the 8 cylinders. (p-value: 8.702491537516895e-10)


I can repeat these steps to evaluate each interesting group I came up with in my analysis of the expected vs. the observed to verify significant relationships. If the stakes are low, I can also choose to verify the groups that seem the least obvious (taking into account the difference of the values AND the sample size), and if those are significant, then we can assume probably pretty safely, that the other interesting groups are significant. 

<hr style="border:2px solid gray">

## Exercises

Continue working in your `hypothesis_testing` notebook.

1. Answer with the type of stats test you would use (assume normal distribution): 

    - Do students get better test grades if they have a rubber duck on their desk?
    - Does smoking affect when or not someone has lung cancer? 
    - Is gender independent of a person’s blood type?
    - A farming company wants to know if a new fertilizer has improved crop yield or not
    - Does the length of time of the lecture correlate with a students grade? 
    - Do people with dogs live in apartments more than people with cats? 


2. Use the following contingency table to help answer the question of whether using a macbook and being a codeup student are independent of each other.

    |               &nbsp;     | Codeup Student | Not Codeup Student |
    | --------------------- | -------------- | ------------------ |
    | Uses a Macbook        | 49             | 20                 |
    | Doesn't Use A Macbook | 1              | 30                 |


3. Choose another 2 categorical variables from the `mpg` dataset and perform a $chi^2$ contingency table test with them. Be sure to state your null and alternative hypotheses.


4. Use the data from the employees database to answer these questions:

    - Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)
    - Is an employee's gender independent of whether or not they are or have been a manager?

<hr style="border:2px solid gray">

## Bonus Content

### Manual Calculation, Example 1

For this example, we will look at the dataset on cars that we explored previously.

As we did above, we will investigate the question of whether the cars drive is independent of transmission type.

- $H_{0}$ (Null Hypothesis): drive is independent of transmission type. 

- $H_{a}$ (Alternative Hypothesis): drive is dependent on transmission type. 

In [30]:
mpg = data('mpg')
mpg['transmission'] = mpg.trans.str[:-4] # a little cleaning goes a long way
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,transmission
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,auto
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,manual
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,manual
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,auto
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,auto


**Expected Values**

To begin with, we will calculate the values we would expect to see if the two groups are independent.

For each subgroup, we calculate the proportion of the total that it is, then multiply each subgroups proportion by the proportion from every other subgroup to determine the expected values.

To start with, we'll calculate the proportions for transmission type:

In [4]:
n = mpg.shape[0]

transmission_proportions = mpg.transmission.value_counts() / n
transmission_proportions

auto      0.67094
manual    0.32906
Name: transmission, dtype: float64

This tells us that cars with automatic transmissions make up ~ 67% of the total, and cars with manual transmissions make up ~ 33% of the total.

Now we'll do the same for drive types.

In [5]:
drive_proportions = mpg.drv.value_counts() / n
drive_proportions

f    0.452991
4    0.440171
r    0.106838
Name: drv, dtype: float64

To find the overall proportions, we multiply all the combinations of proportions together.

For example, to find the expected proportion of automatic drive cars with 4-wheel drive, we would multiply those two proportions together.

$$ .67 * .44 = .2984 $$

So we would expect about 29.84% of the total cars to be automatic and 4-wheel drive.

Below we show some code that will loop through all of the proportions and perform this calculation for all combinations of groups.

In [6]:
expected = pd.DataFrame()

for transmission_group, t_prop in transmission_proportions.iteritems():
    for drive_group, d_prop in drive_proportions.iteritems():
        expected.loc[drive_group, transmission_group] = t_prop * d_prop

expected.sort_index(inplace=True)
expected

Unnamed: 0,auto,manual
4,0.295328,0.144843
f,0.30393,0.149061
r,0.071682,0.035156


If we wanted to convert these proportions to expected number of values, we can multiply by the total number of observations:

In [7]:
expected *= n
expected

Unnamed: 0,auto,manual
4,69.106838,33.893162
f,71.119658,34.880342
r,16.773504,8.226496


**Observed Values**

Now we have the expected proportions, we need to calculate the actual proportions so that we can compare them. to do this, we'll use the `crosstab` function from pandas.

In [8]:
observed = pd.crosstab(mpg.drv, mpg.transmission)
observed

transmission,auto,manual
drv,Unnamed: 1_level_1,Unnamed: 2_level_1
4,75,28
f,65,41
r,17,8


**Calculate Chi-Square**

Now we can calculate our test statistic, $\chi^2$

In [9]:
chi2 = ((observed - expected)**2 / expected).values.sum()
chi2

3.136769245971112

We also need to find our degrees of freedom for the distribution. The degrees of freedom are given by:

$$ (\mbox{nrows} - 1) \times (\mbox{ncols} - 1) $$

Where nrows and ncols are the number of rows and columns in our contingency table.

In [10]:
nrows, ncols = observed.shape

degrees_of_freedom = (nrows - 1) * (ncols - 1)

Now, based on the test statistic and degrees of freedom, we could lookup the corresponding p-value from a pre-calculated table, or use `scipy`'s chi2 distribution.

In [11]:
stats.chi2(degrees_of_freedom).sf(chi2)

0.20838152534979645

With this high of a p-value, we fail to reject our null hypothesis.

### Manual Calculation, Example 2

**Observed Values**

Suppose we have the following contingency table:

| &nbsp;   | Product A | Product B |
| -------- | --------- | --------- |
| Churn    | 100       | 50        |
| No Churn | 120       | 28        |

And we want to know if a customer churning is independent of which product offering they have.

**Expected Values**

We have all the information that we need to run a $\chi^2$ test, because we can calculate the population proportions from the above table.

1. Find the proportions for Product A, Product B, Churn, and No Churn

    | &nbsp;   | Product A | Product B | &nbsp; |
    | -------- | --------- | --------- | ---    |
    | Churn    | 100       | 50        | 150    |
    | No Churn | 120       | 28        | 148    |
    |          | 220       | 78        | 298    |
    
1. Calculate the proportions

    - Product A = 220 / 298 = .738
    - Product B = 78 / 298 = .262
    - Churn = 150 / 298 = .503
    - No churn = 148 / 298 = .497
    
1. Multiply these together to produce a contingency table of expected values

    First we calculate proportions:

    | &nbsp;   | Product A | Product B |
    | ------   | --------- | --------- |
    | Churn    | 0.372     | 0.132     |
    | No Churn | 0.367     | 0.130     |
    
    Then we can also see the actual expected number:
    
    |          | Product A | Product B |
    | -------  | --------- | --------- |
    | Churn    | 110.7     | 39.3      |
    | No Churn | 109.3     | 38.7      |

**Calculate Chi-Square**

1. Calculate the test statistic and compute a p-value

In [32]:
index = ['Churn', 'No Churn']
columns = ['Product A', 'Product B']

observed = pd.DataFrame([[100, 50], [120, 28]], index=index, columns=columns)
n = observed.values.sum()

expected = pd.DataFrame([[.372, .132], [.367, .130]], index=index, columns=columns) * n

chi2 = ((observed - expected)**2 / expected).values.sum()

nrows, ncols = observed.shape

degrees_of_freedom = (nrows - 1) * (ncols - 1)

p = stats.chi2(degrees_of_freedom).sf(chi2)

print('Observed')
print(observed)
print('---\nExpected')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
          Product A  Product B
Churn           100         50
No Churn        120         28
---
Expected
          Product A  Product B
Churn       110.856     39.336
No Churn    109.366     38.740
---

chi^2 = 7.9656
p     = 0.0048


In [2]:
from scipy import stats

In [3]:
distr = stats.binom(618, .1)

In [7]:
distr.pmf(70)

0.0282555937033103