# Multiple Response Contingency Tables

In [12]:
import numpy as np
import numpy.random as rand
import pandas as pd

In [46]:
import sys
# let us import local app packages
PACKAGE_PARENT = '../..'
sys.path.append(os.path.normpath(PACKAGE_PARENT))

In [24]:
%load_ext autoreload
%autoreload 2

## A basic table:

Contingency tables, also known as "crosstabs", are used to summarize and compare the relationship between two or more experimental factors in a data set. For example, if we conducted a survey asking people 1) their age and 2) their favorite television show, we might want to see whether or not younger people prefer different shows than older people do.

Here's a brief example using simulated data:

In [16]:
age = np.random.choice(['18-36','37-54','55+'], size = 2000, p = [0.3,0.4,0.3]);
favorite_show = np.random.choice(['NCIS','House of Cards','Westworld'], size = 2000, p = [0.2,0.4,0.4])
survey_results = pd.DataFrame({"age": age, "favorite_show": favorite_show})
survey_results.index.name = "respondent_id"
survey_results.head(10)

Unnamed: 0_level_0,age,favorite_show
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,37-54,NCIS
1,55+,House of Cards
2,55+,Westworld
3,37-54,Westworld
4,18-36,NCIS
5,55+,Westworld
6,18-36,Westworld
7,55+,House of Cards
8,18-36,NCIS
9,18-36,House of Cards


In [39]:
from statsmodels.stats.contingency_tables import Table

table =  Table.from_data(survey_results)
print(table)

Contingency Table: 
favorite_show  House of Cards  NCIS  Westworld
age                                           
18-36                     223   117        252
37-54                     319   158        317
55+                       247   114        253


The above table makes it easy to lookup, for example, how many 18-36-year-olds prefer House of Cards.

With this table in hand we can perform a chi-squared test to evaluate whether or not there is a relationship between age and favorite show:

In [45]:
independence_result = table.test_nominal_association()
print(independence_result)

Contingency Table Independence Result:
chi-squared statistic: 1.6138440777244998
degrees of freedom: 4
p value: 0.8063017941381443



As you can see, the p value is (relatively) close to one, indicating that we have very weak evidence to suspect that age affects favorite tv show. That's unsurprising because the data were randomly generated.

## A table with a relationship:

Now let's try again but let's generate data in such a way that we expect a relationship between our factors.

In [51]:
age = np.random.choice(['less than 18','19-36','37+'], size = 2000, p = [0.3,0.4,0.3]);
survey_results = pd.DataFrame({"age": age})
survey_results.index.name = "respondent_id"

def weighted_choice(age):
    # weight tuples: (snapchat, instagram, facebook)
    weights = {'less than 18': (.5, .4, .1),
               '19-36': (.3, .3, .4),
               '37+': (.1, .2, .7)}
    choices = ("snapchat", "instagram", "facebook")
    favorite_network = np.random.choice(choices, p=weights[age])
    return favorite_network

favorite_social_network = survey_results.age.apply(weighted_choice)
survey_results['favorite_social_network'] = favorite_social_network
survey_results.head(10)

Unnamed: 0_level_0,age,favorite_social_network
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,less than 18,instagram
1,19-36,instagram
2,37+,facebook
3,less than 18,instagram
4,19-36,facebook
5,less than 18,snapchat
6,19-36,snapchat
7,19-36,facebook
8,37+,instagram
9,19-36,facebook


In [52]:
from statsmodels.stats.contingency_tables import Table

table =  Table.from_data(survey_results)
print(table)

Contingency Table: 
favorite_social_network  facebook  instagram  snapchat
age                                                   
19-36                         335        239       263
37+                           411        121        52
less than 18                   58        221       300


In [53]:
independence_result = table.test_nominal_association()
print(independence_result)

Contingency Table Independence Result:
chi-squared statistic: 468.4061928530731
degrees of freedom: 4
p value: 0.0



As expected, here our p-value is so low that it rounds to zero. I.e., the contingency table provides extremely strong evidence to suspect that people's age is related to their favorite social network.

## A table with multiple response factors

Most statistical tests that are designed to work with categorical data expect that the categories are mutually exclusive, i.e. each observation will have only one category at a time. For example, if a survey asks "are you a french citizen?" the answer will be either "yes" or "no" or maybe "I don't know" but it'll never be "both" or "neither".

In the real world, especially with survey data, questions often allow multiple answers. For example, a question might ask "which of the following movies have you seen: (Star Wars, The Godfather, Top Five)?" and it would be perfectly valid to answer "Star Wars **and** Top Five".

We are still free to build a contingency table showing the answers to multiple response questions, but these tables violate key assumptions that are required for chi-square tests of independence. So we cannot use traditional chi-square tests to evaluate whether the answers to the questions are independent.

Instead,stats models provides a different type of contingency table class that will automatically apply independence tests that are valid when applied to multiple response data.

We will walk through an example using a data set extracted from a survey of undecided swing state of voters taken before at the 2016 presidential election. The data set is available in statsmodels as `presidential2016`.

In [58]:
import statsmodels.api as sm
from statsmodels.datasets import presidential2016

data = sm.datasets.presidential2016.load_pandas()
data.data.head()

Unnamed: 0,Hillary_Clinton,Donald_Trump,Jill_Stein,Gary_Johnson,None_Of_The_Above,I_Probably_Wont_Vote,Hillary_Clinton_is_involved_in_many_coverups,Trump_changes_his_positions_all_of_the_time,Hillary_Clinton_lied_to_the_families_of_Americans_killed_in_Benghazi,Trump_is_a_successful_businessman,Trumps_temper_could_get_the_country_into_trouble,I_wish_another_candidate_had_won_the_primary,Need_to_do_more_research,Dont_like__any_candidate,Not_sure_which_candidate_shares_my_values,Waiting_for_debates
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


The first question asked "Which candidate do you think you are most likely to ultimately vote for in the 2016 presidential election?" The answers are in columns 1 through 6, with a '1' representing 'selected' and '0' representing 'not selected'. Each respondent was only allowed to select 1 candidate, making this a traditional single select categorical variable. 

The second question asked "Please check all of the statements you believe are true" and presented respondents with a list of assertions (shown in columns 7 through 11). Respondents could check as many options as they liked, making this a multiple response question.

The third question asked "Please select any factors that contribute to your not being sure who you'll vote for". The reasons shown in columns 12 through 16 were shown and respondents could pick as many as they liked, making this also a multiple response question.

### Single response versus multiple response

The appropriate statistical tests are somewhat different depending on whether we are comparing two multiple response variables vs. whether we are comparing one multiple response variable versus one single response variable.

We will start with a single response versus a multiple response by comparing the first survey question versus the second.

In [63]:
from statsmodels.stats.contingency_tables import Factor, MRCVTable

In [65]:
rows_factor = Factor(data.data.iloc[:, :6], data.data.columns[:6], "expected_choice", orientation="wide")
columns_factor = Factor(data.data.iloc[:, 6:11], data.data.columns[6:11], "believe_true", orientation="wide")
multiple_response_table = MRCVTable([rows_factor,], [columns_factor])
pairwise_chis = multiple_response_table._calculate_pairwise_chi2s_for_MMI_item_response_table(rows_factor, columns_factor)
pairwise_chis

Hillary_Clinton_is_involved_in_many_coverups                            27.493592
Trump_changes_his_positions_all_of_the_time                             36.511000
Hillary_Clinton_lied_to_the_families_of_Americans_killed_in_Benghazi    33.065276
Trump_is_a_successful_businessman                                       19.305086
Trumps_temper_could_get_the_country_into_trouble                        16.238789
dtype: float64