## Pandas Tutorial 13: Crosstab Tutorial
Crosstab (contingency table) displays the frequency distribution of one variable in rows and another in columns. The Pandas `crosstab()` method is commonly used to generate these tables, which are essential in survey analysis and business analytics.

#### Topics covered:
* **What is a crosstab (contingency table)?**
* **Creating a contingency table with `crosstab()`**
* **Accessing Pandas Crosstab documentation in Jupyter Notebook**

This tutorial will guide you through generating and using contingency tables for insightful data analysis.

In [12]:
import pandas as pd
import numpy as np
!pip install xlrd>=2.0.1

In [5]:
df = pd.read_excel("survey.xls")
df

Unnamed: 0,Name,Nationality,Sex,Age,Handedness
0,Kathy,USA,Female,23,Right
1,Linda,USA,Female,18,Right
2,Peter,USA,Male,19,Right
3,John,USA,Male,22,Left
4,Fatima,Bangadesh,Female,31,Left
5,Kadir,Bangadesh,Male,25,Left
6,Dhaval,India,Male,35,Left
7,Sudhir,India,Male,31,Left
8,Parvir,India,Male,37,Right
9,Yan,China,Female,52,Right


## Using `crosstab()` to Create a Contingency Table
The `crosstab()` function generates a contingency table that shows the frequency distribution of two categorical variables - in this case, `Nationality` and `Handedness`.

**Key Features:**
* **Rows** (`df.Nationality`): Displays the distribution of `Nationality`.
* **Columns** (`df.Handedness`): Shows the frequency of left- or right-handedness for each nationality.
* **Result**: A table that summarizes the relationship between the two categorical variables.

This is useful for comparing and analyzing categorical data.

In [6]:
# Creates a contingency table showing the frequency of Handedness for each Nationality
pd.crosstab(df.Nationality, df.Handedness)

Handedness,Left,Right
Nationality,Unnamed: 1_level_1,Unnamed: 2_level_1
Bangadesh,2,0
China,2,1
India,2,1
USA,1,3


## Using `crosstab()` for Sex and Handedness
The `crosstab()` function generates a contingency table that shows the frequency distribution of `Sex` and `Handedness`.

**Key Features:**
* **Rows** (`df.Sex`): Displays the distribution of sexes.
* **Columns** (`df.Handedness`): Shows the frequency of left- or right-handedness for each sex.
* **Result**: A table summarizing the relationship between `Sex` and `Handedness`.

This allows for comparison and analysis of categorical data across these two variables.

In [7]:
# Creates a contingency table showing the frequency of Handedness for each Sex
pd.crosstab(df.Sex, df.Handedness)

Handedness,Left,Right
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,2,3
Male,5,2


## Using `crosstab()` with Margins

The `margins=True` argument adds row and column totals (margins) to the contingency table, making it easier to analyze the overall distribution.

**Key Features:**
* **Rows** (`df.Sex`): Displays the distribution of sexes.
* **Columns** (`df.Handedness`): Shows the frequency of handedness.
* `margins=True`: Adds row and column totals
* **Result**: A table with totals for each category and overall totals.

This is useful for quickly summarizing the total counts in each category.

In [8]:
# Creates a contingency table with totals (margins) for rows and columns
pd.crosstab(df.Sex, df.Handedness, margins=True)

Handedness,Left,Right,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,2,3,5
Male,5,2,7
All,7,5,12


## Using `crosstab()` with Multiple Variables and Margins

The `crosstab()` function can handle multiple variables in the columns. Here, it displays the frequency distribution of `Sex`, `Handedness`, and `Nationality`, with totals for each.

**Key Features**:
* **Rows**(`df.Sex`): Displays the distribution of sexes.
* **Columns**(`[df.Handedness, df.Nationality]`): Shows a multi-level column distribution for handedness and nationality.
* `margins=True`: Adds row and column totals.
* **Result**: A multi-variablw contingency table with overall totals.

This is useful for analyzing relationships between multiple categorical variables.

In [9]:
# Creates a contingency table with Sex, Handedness, and Nationality, including totals (margins)
pd.crosstab(df.Sex, [df.Handedness, df.Nationality], margins=True)

Handedness,Left,Left,Left,Left,Right,Right,Right,All
Nationality,Bangadesh,China,India,USA,China,India,USA,Unnamed: 8_level_1
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Female,1,1,0,0,1,0,2,5
Male,1,1,2,1,0,1,1,7
All,2,2,2,1,1,1,3,12


## Using `crosstab()` with Multiple Row Variables and Margins
The `crosstab()` function allows multiple variables in the rows. This example shows the frequency distribution of `Nationality` and `Sex` against `Handedness`, with totals for each.

**Key Features**:
* **Rows** (`[df.Nationality, df.Sex]`): Displays a multi-level row index for nationality and sex.
* **Columns** (`[df.Handedness]`): Shows handedness distribution.
* **`margins=True`**: Adds row and column totals.
* **Result**: A multi-level contingency table with overall totals.

This is ideal for analyzing relationships between multiple categorical variables in the rows.

In [10]:
# Creates a contingency table with Nationality and Sex as rows, and Handedness as columns, including totals (margins)
pd.crosstab([df.Nationality, df.Sex], [df.Handedness], margins=True)

Unnamed: 0_level_0,Handedness,Left,Right,All
Nationality,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bangadesh,Female,1,0,1
Bangadesh,Male,1,0,1
China,Female,1,1,2
China,Male,1,0,1
India,Male,2,1,3
USA,Female,0,2,2
USA,Male,1,1,2
All,,7,5,12


## Using `crosstab()` with Row Normalization

The `normalize='index'` argument normalizes the values in the contingency table by the row, showing the proportion of each handedness category within each sex.

**Key Features**:
* **Rows**(`df.Sex`): Displays the distribution of sexes.
* **Columns**(`df.Handedness`): Shows the proportion of handedness within each sex category.
* `normalize='index'`: Normalizes the values row-wise, displaying the percentages within each sex group.
* **Result**: A normalized table where the sum of each row equals 1 (or 100%).

This is useful for understanding the relative proportions within each group.

In [11]:
# Creates a normalized contingency table by row (Sex)
pd.crosstab(df.Sex, df.Handedness, normalize='index')

Handedness,Left,Right
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.4,0.6
Male,0.714286,0.285714


## Using `crosstab()` with Aggregation

The `values` argument allows you to specify a numerical column (in this case, `Age`), and the `aggfunc=np.average` computes the average value for each combination of `Sex` and `Handedness`.

**Key Features**:
* **Rows**(`df.Sex`): Displays the distribution of sexes.
* **Columns**(`df.Handedness`): Shows handedness.
* `values=df.Age`: Uses the `Age` column for aggregation.
* `aggfunc=np.average`: Calculates the average age for each combination of sex and handedness.
* **Result**: A table showing the average age for each group.

This is useful for performing summary statistics across categorical groups.

In [13]:
# Creates a contingency table showing the average age for each combination of Sex and Handedness
pd.crosstab(df.Sex, df.Handedness, values=df.Age, aggfunc=np.average)

Handedness,Left,Right
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,44.5,31.0
Male,31.2,28.0
