# Counting with Crosstabs

### Exploring mental health survey data

We will be using the mental health survey data found on [Kaggle datasets][1]. This dataset is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace.

[1]: https://www.kaggle.com/osmi/mental-health-in-tech-survey

In [None]:
import pandas as pd
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 200
mh = pd.read_csv('../data/mental_health.csv')
mh.head(3)

### Data Dictionary
The data dictionary will help you understand the questions asked behind the data collected on each column.

In [None]:
mh_dd = pd.read_csv('../data/dictionaries/mental_health_dd.csv')
mh_dd

## Frequency counting with a Series
Previously, we learned how to count the frequency of values in a single column of data (a Series) with the `value_counts` method. The relative frequencies are returned by setting `normalize` to `True`.

In [None]:
mh['country'].value_counts()

In [None]:
mh['country'].value_counts(normalize=True).round(3)

## Counting the mental health occurrences by country
We cannot use `value_counts` in order to count the frequencies of values appearing in two or more columns as it is a Series-only method. Instead we can use either `groupby`, `pivot_table`, or `crosstab`. Let's see an example of each one.

### `groupby`

In [None]:
mh.groupby(['country', 'treatment']).size()

### `pivot_table`

In [None]:
mh.pivot_table(index='country', columns='treatment', aggfunc='size', fill_value=0)

### `pd.crosstab`

In [None]:
pd.crosstab(index=mh['country'], columns=mh['treatment'])

### Comments on each method

### `groupby`
The `groupby` method is straightforward but by leaves the data "long" so it is more difficult to decipher. Also, notice that if a combination of values does not appear in the dataset, such as Austria and Yes, then no row is present in the returned Series. See the "Extra" section below to see how to pivot this.

### `pivot_table`
We get the shape of the table that we want with `pivot_table`. Notice that it isn't necessary to specify the `values` parameter when the aggregating function is 'size'. The `fill_value` parameter is not necessary in this specific instance, but would be useful if some combination did not appear in the dataset.

### `crosstab`
The `crosstab` function is built specifically for this situation. By default, it counts the frequency of occurrence between the given columns. Unfortunately, it is a function and not a method, so we must specify each parameter as a Series.

## Which method to choose
Either method is acceptable, but for display purposes, pivoting the data is better in this situation. `pivot_table` and `crosstab` would be preferable.

## Relative frequencies - only available with `crosstab`
The big benefit of using the `crosstab` function is its ability to return relative frequencies with the `normalize` parameter. This isn't easily doable with `groupby` or `pivot_table`. You can normalize over the rows, columns or all the data.

For instance, to find the percentage of people who have sought treatment in each country, you can normalize across each row like this.

In [None]:
pd.crosstab(index=mh['country'], columns=mh['treatment'], normalize='index').round(2)

To find the percentage that each country represents per column do this:

In [None]:
pd.crosstab(index=mh['country'], columns=mh['treatment'], normalize='columns').round(3)

Find the relative frequency against all the data. For instance, 2.1% of all respondents are Germans who have not received mental health treatment.

In [None]:
pd.crosstab(index=mh['country'], columns=mh['treatment'], normalize='all').round(3)

You can add margins as well.

In [None]:
pd.crosstab(index=mh['country'], columns=mh['treatment'], 
            normalize='all', margins=True).round(3)

### Commonly called contingency tables
The table of frequency counts is commonly called a [contingency table][1]. We can test whether one group differs through another by applying a chi-squared test to the counts. See the Extra section for how to apply this test in Python.

[1]: https://en.wikipedia.org/wiki/Contingency_table

### `crosstab` is almost unnecessary in pandas
It's important to know that `crosstab` and `pivot_table` are extremely similar and `crosstab` would be basically unnecessary if `pivot_table` had an easy way to normalize counts. The only case that necessitates `crosstab` is when creating a contingency table that normalizes the counts. Otherwise, you can use `pivot_table`.

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Do people with a family history of mental illness seek treatment more often than those who do not?</span>

### Exercise 2
<span  style="color:green; font-size:16px">Find the total number and ratio of employees that seek treatment for companies that provide health benefits vs those that do not.</span>

### Exercise 3
<span  style="color:green; font-size:16px">You can provide a list of multiple columns to both the `index` and `columns` parameters of the `crosstab` function. Put country and number of employees in the index and benefits and treatment in the columns. It's probably easier to make separate list variables first.</span>