# Counting with Crosstabs

In this chapter, we explore the pandas `crosstab` function, which produces very similar results as the `pivot_table` method, but allows us to count occurrences in a cleaner way and with more functionality.

### Exploring mental health survey data

We will be using mental health survey data found on [Kaggle datasets][1]. This dataset is from a 2014-2015 survey that measured the attitude towards mental health and frequency of mental health disorders in the tech workplace. Let's read in the dataset and output the number of rows and columns.

[1]: https://www.kaggle.com/osmi/mental-health-in-tech-survey

In [None]:
import pandas as pd
mh = pd.read_csv('../data/mental_health.csv')
mh.head(3)

In [None]:
mh.shape

### Data Dictionary

The data dictionary will help you understand the questions asked behind the data collected on each column. Some of the column descriptions are larger than the default 50 character setting. We change the option `'display.max_colwidth'` before outputting the data dictionary.

In [None]:
pd.set_option('display.max_colwidth', 100)
mh_dd = pd.read_csv('../data/dictionaries/mental_health_dd.csv')
mh_dd

### Converting columns to categorical

All of the questions appear to have a limited number of discrete answer choices. Here, we find the number of unique values in each column is limited, with age being the only exception.

In [None]:
mh.nunique()

Let's convert all of the string columns (only year and age are numeric) to categorical. Instead of converting each column individually, we can convert an entire selection of the DataFrame and overwrite the old columns at the same time. This will save memory and help performance when grouping.

In [None]:
mh.loc[:, 'gender':] = mh.loc[:, 'gender':].astype('category')
mh.dtypes

## Frequency counting with a Series

Previously, we learned how to count the frequency of values of a single column of data as a Series with the `value_counts` method. Let's review this by finding the number of survey respondents by country. 

In [None]:
mh['country'].value_counts()

The relative frequencies are returned by setting `normalize` to `True`.

In [None]:
mh['country'].value_counts(normalize=True).round(3)

## Counting the mental health occurrences by country

If we are interested in counting the co-occurrence of values appearing in two or more columns, we can use the DataFrame `value_counts`, `groupby`, or `pivot_table` methods. Let's see examples of counting the occurrences of seeking mental health treatment (the `'treatment'` column) by country.

### Counting frequency with  `groupby`

Use both country and treatment as grouping columns and then aggregate any column with `size`.

In [None]:
mh.groupby(['country', 'treatment']).agg(count=('age', 'size'))

### Unobserved categories still appear in the result.

Even if one of the combinations of country and treatment does not exist (such as France and Yes), it will still appear in the result. This is because both of these columns were converted to categorical and all categories, by default, appear in the result. To change this behavior so that only those groups that are **observed**, set the `observed` parameter to `True`. The result will now not include France with Yes, reducing the total number of groups from 16 to 15.

In [None]:
len(mh.groupby(['country', 'treatment'], observed=True).size())

### Counting frequency with `pivot_table`

Alternatively, we can count frequencies using `pivot_table` just how we did in the previous chapter.

In [None]:
mh.pivot_table(index='country', columns='treatment', aggfunc='size')

## Counting frequency with the `crosstab` function

The `crosstab` function is built specifically for the situation of counting co-occurrences of values between two or more columns. The name comes from **cross tabulation** which is the more generic term used in data analysis outside of pandas. They are also known as [contingency tables][1].

Unfortunately, `crosstab` is a function and NOT a method. This means it is not bound to any DataFrame object, but must be accessed directly from `pd`. It has many of the same parameter names as the `pivot_table` method and is used similarly. Since it is not bound to any DataFrame object, you must set its parameters to Series and not strings. By default, it will compute the size of each group so there is no need to set the `aggfunc` parameter.

[1]: https://en.wikipedia.org/wiki/Contingency_table

In [None]:
pd.crosstab(index=mh['country'], columns=mh['treatment'])

### Relative frequencies - only available with `crosstab`

The result is identical to what the `pivot_table` method produced. You might be wondering why there is a need to even know about this function. There is one big benefit of using the `crosstab` function, and that is its ability to return relative frequencies with the `normalize` parameter. This isn't easily doable with `groupby` or `pivot_table`. The `crosstab` function allows you to normalize over the rows, columns, or all of the data. For instance, to find the relative frequency of people who have sought treatment in each country, you can normalize across each row like this. The rows should all sum to 100%.

In [None]:
pd.crosstab(index=mh['country'], 
            columns=mh['treatment'], 
            normalize='index').round(3) * 100

Set the `normalize` parameter to the string `'columns'` to return the relative frequency in the other direction. The returned DataFrame informs us that of all the respondents seeking treatment, 15.4% were from the United Kingdom.

In [None]:
pd.crosstab(index=mh['country'], 
            columns=mh['treatment'], 
            normalize='columns').round(3) * 100

It's possible to find the relative frequency against all of the data by setting the `normalize` parameter to `'all'`. From the returned DataFrame, 2.2% of all respondents are Germans who have not received mental health treatment.

In [None]:
pd.crosstab(index=mh['country'], 
            columns=mh['treatment'], 
            normalize='all').round(3) * 100

### Adding margins

You can add margins as well by setting the `margins` parameter to `True`. Here, we go back to raw counts and add margins for all rows and columns.

In [None]:
pd.crosstab(index=mh['country'], columns=mh['treatment'], margins=True).round(3)

When normalizing the data, the margins calculated depend on the direction of the normalization. Here, we add margins when normalizing down the columns. We can use this margin to determine the degree to which each country is overrepresented (or underrepresented) in each treatment category. For instance, 65.7% of the respondents were from the United States. Of those respondents seeking treatment, 68.9% were from the United States informing us that respondents from the United States were overrepresented in that category.

In [None]:
pd.crosstab(index=mh['country'], columns=mh['treatment'], 
            normalize='columns', margins=True).round(3)

## Normalizing other aggregations

While `pd.crosstab` is most often used for frequency, it's possible to supply it another column to aggregate. Let's read in the City of Houston dataset for this example.

In [None]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Here, we calculate the total salary by department and sex with `pd.crosstab` passing the aggregating column, `salary`, as a Series to the `values` parameter and setting `aggfunc` to the string aggregation name, `'sum`'.

In [None]:
pd.crosstab(index=emp['dept'], columns=emp['sex'],
            values=emp['salary'], aggfunc='sum')

This is typically done with the `pivot_table` method, which produces the exact same result.

In [None]:
emp.pivot_table(index='dept', columns='sex', values='salary', aggfunc='sum')

As we saw above, `crosstab` has the ability to normalize over rows, columns, and the entire table. It's possible to normalize over any aggregation provided, not just frequency (the default). Here, we find the percentage of the total female (and male) salaries by department by setting `normalize` to `'columns'`. It informs us, that out of the total of female salaries, 3.8% are from the fire department, 24.8% are from the police department, etc... It provides a distribution of each column over the index. The columns total to 100%.

In [None]:
pd.crosstab(index=emp['dept'], columns=emp['sex'],
            values=emp['salary'], aggfunc='sum', 
            normalize='columns').round(3) * 100

### `crosstab` is almost unnecessary in pandas

It's important to know that `crosstab` and `pivot_table` are very similar and `crosstab` would be unnecessary if `pivot_table` had an easy way to normalize the values across groups. Since it does not, `crosstab` is still valuable.

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Do people with a family history of mental illness seek treatment more often than those who do not?</span>

### Exercise 2
<span  style="color:green; font-size:16px">Find the total number and ratio of employees that seek treatment for companies that provide health benefits vs those that do not.</span>

### Exercise 3
<span  style="color:green; font-size:16px">You can provide a list of multiple columns to both the `index` and `columns` parameters of the `crosstab` function. Put country and number of employees in the index and benefits and treatment in the columns. It's probably easier to make separate list variables first.</span>

### Exercise 4

<span style="color:green; font-size:16px">Read in the bikes dataset and find the distribution of total trip duration by gender and events. Normalize over all groups. You should be able to answer the question, "From the total of all trip durations, what percent were done by males on a clear day?".</span>