## IS6 in Python:  Relationships Between Categorical Variables–Contingency Tables (Chapter 3)

### Introduction and background

This document is intended to assist students in undertaking examples shown in the Sixth Edition of Intro Stats (2022) by De Veaux, Velleman, and Bock. This pdf file as well as the associated ipynb reproducible analysis source file used to create it can be found at (INSERT WEBSITE LINK HERE).

### Chapter 3: Relationships Between Categorical Variables–Contingency Tables
#### Section 3.1: Contingency Tables

In [20]:
#Read in libraries. These are all the libraries we need for this chapter
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [28]:
#Table 3.1, page 68

okcupid = pd.read_csv("http://nhorton.people.amherst.edu/is5/data/OKCupid_CatsDogs.csv")
#Clean dataframe
okcupid = okcupid.rename(columns = okcupid.iloc[0])
okcupid = okcupid.drop(labels = 0)

#Make contingency table
table = pd.crosstab(index = okcupid["CatsDogsBoth"], columns = okcupid["Gender"], margins = True, margins_name = "Total")
table

Gender,F,M,Total
CatsDogsBoth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Has Both,897,577,1474
Has cats,3412,2388,5800
Has dogs,3431,3587,7018
Total,7740,6552,14292


#### New functions explained:
Note that in this example, data_frame is called okcupid

1. data_frame.rename(columns = ...): Rename axes labels. In this example, we wanted to rename the columns, so we used "columns" argument.
2. data_frame.drop(labels = ...): Remove rows/columns using indexes or column labels. In this example, we wanted to remove the first row, so we specified the index to be 0.

Note that in this example, pandas is called pd

3. pandas.crosstab(index = [series], columns = [series], margins = True/False, margins_name = "string"): Compute a contingency table, where index specifies values to group by rows, columns specifies values to group by columns, margins specifies whether to add row/column margins (subtotals) or not, and margins_name lets us rename the margin label.

In [40]:
#Table 3.2, page 69

okcupid = pd.read_csv("http://nhorton.people.amherst.edu/is5/data/OKCupid_CatsDogs.csv")
#Clean dataframe
okcupid = okcupid.rename(columns = okcupid.iloc[0])
okcupid = okcupid.drop(labels = 0)

#Make contingency table, colunmn percent format
table = pd.crosstab(index = okcupid["CatsDogsBoth"], columns = okcupid["Gender"], margins = True, margins_name = "Total",
                    normalize = "columns")
table

Gender,F,M,Total
CatsDogsBoth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Has Both,0.115891,0.088065,0.103135
Has cats,0.440827,0.364469,0.405821
Has dogs,0.443282,0.547466,0.491044


#### Same crosstab() function, new argument:

To make a contingency table, formatted as column percents, we still use the same crosstab() function. Very convenient!
Note that in this example, pandas is called pd

pandas.crosstab(..., normalize = {True, False} or {"all", "index", "columns"} or {0,1}): Everything in is the same as the crosstab() function introduced above, with the addition of normalize. Normalize allows us to divide all values by the sum of values.
- If passed ‘all’ or True, will normalize over all values.
- If passed ‘index’ will normalize over each row.
- If passed ‘columns’ will normalize over each column.
- If margins is True, will also normalize margin values.

In [37]:
#Table 3.3, page 69

okcupid = pd.read_csv("http://nhorton.people.amherst.edu/is5/data/OKCupid_CatsDogs.csv")
#Clean dataframe
okcupid = okcupid.rename(columns = okcupid.iloc[0])
okcupid = okcupid.drop(labels = 0)

#Make contingency table, row percent format
table = pd.crosstab(index = okcupid["CatsDogsBoth"], columns = okcupid["Gender"], margins = True, margins_name = "Total",
                    normalize = "index")
table

Gender,F,M
CatsDogsBoth,Unnamed: 1_level_1,Unnamed: 2_level_1
Has Both,0.608548,0.391452
Has cats,0.588276,0.411724
Has dogs,0.488886,0.511114
Total,0.541562,0.458438


In [39]:
#Table 3.3, page 69

okcupid = pd.read_csv("http://nhorton.people.amherst.edu/is5/data/OKCupid_CatsDogs.csv")
#Clean dataframe
okcupid = okcupid.rename(columns = okcupid.iloc[0])
okcupid = okcupid.drop(labels = 0)

#Make contingency table, table percent format
table = pd.crosstab(index = okcupid["CatsDogsBoth"], columns = okcupid["Gender"], margins = True, margins_name = "Total",
                    normalize = "all")
table

Gender,F,M,Total
CatsDogsBoth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Has Both,0.062762,0.040372,0.103135
Has cats,0.238735,0.167086,0.405821
Has dogs,0.240064,0.25098,0.491044
Total,0.541562,0.458438,1.0


#### Example 3.1: Exploring Marginal Distributions
