# Crosstabs

`pandas.crosstab()` is another function that builds summary data

The pandas crosstab function builds a cross-tabulation table that can show the frequency with which certain groups of data appear. 



## Import the data

For this demo, only interested in the following subset of car manufacturers
>
> "toyota","nissan","mazda", "honda", "mitsubishi", "subaru", "volkswagen", "volvo"
>

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

import pandas as pd
import seaborn as sns

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df_raw = pd.read_csv(filepath_or_buffer='../Data/cars.csv',
                     header=None, names=headers, na_values="?" )

# Define a list of models that we want to review
models = ["toyota","nissan","mazda", "honda", "mitsubishi", "subaru", "volkswagen", "volvo"]

# Create a copy of the data with only the top 8 manufacturers
df = df_raw[df_raw.make.isin(models)].copy()

## How many different body styles these car makers made

The crosstab function can operate on numpy arrays, series or columns in a dataframe. 

Here, `df.make` is set to be the crosstab index and `df.body_style` is the crosstab’s columns. 

Pandas does that work behind the scenes to count how many occurrences there are of each combination. 

For example, in this data set Volvo makes 8 sedans and 3 wagons.

In [None]:
pd.crosstab(df['make'], df['body_style'])

### Use a groupby followed by an unstack to get the same results

In [None]:
df.groupby(['make', 'body_style'])['body_style'].count().unstack().fillna(0)

### Use a pivot_table followed by an unstack to get the same results

In [None]:
df.pivot_table(index=df['make'], columns=df['body_style'], aggfunc={'body_style':len}, fill_value=0)

## Add Subtotals

use the `margins` keyword:

In [None]:
pd.crosstab(index=df['make'], columns=df['num_doors'], margins=True, margins_name="Total")

## Add aggregation

use the `aggfunc` parameter

specify the columns using the `values` parameter

In [None]:
pd.crosstab(index=df['make'], columns=df['body_style'], values=df.curb_weight, aggfunc='mean').round(0)

## Normalize

The precentage time each combination occurs

In [None]:
pd.crosstab(index=df['make'], columns=df['body_style'], normalize=True)

>
>The table above shows that 2.3% of the total population are Toyota hardtops and 6.25% are Volvo sedans.
>

### Normalize on colums only

In [None]:
pd.crosstab(index=df['make'], columns=df['body_style'], normalize='columns')

>
> This table shows that 50% of the convertibles are made by Toyota and the other 50% by Volkswagen.
>

### Normalize on rows only

In [None]:
pd.crosstab(index=df['make'], columns=df['body_style'], normalize='index')

>
> The above table shows that of the Mitsubishi cars in this dataset, 69.23% are hatchbacks and the remainder (30.77%) are sedans.
>

# Grouping

An extremely useful feature is to pass in multiple dataframe columns and pandas does all the grouping for you. 

For instance, to see how the data is distributed by front wheel drive (fwd) and rear wheel drive (rwd), 
 - include the `drive_wheels` column by including it in the list of valid columns in the second argument to the crosstab .

In [None]:
cols = [ df['body_style'], df['drive_wheels'] ]

pd.crosstab(index=df['make'], columns=cols)

# Group the index

Included the specific rownames and colnames to include in the output. This is purely for display purposes but can be useful if the column names in the dataframe are not very specific.

Use `dropna=False` at the end of the function call. This is =to make sure to include all the rows and columns even if they had all 0’s. 

If it was not include it, then the final Volvo, two door row would have been omitted from the table.


In [None]:
cols = [ df['body_style'], df['drive_wheels'] ]
idx  = [ df['make'], df['num_doors'] ]

pd.crosstab(index=idx, columns=cols,
            rownames=['Auto Manufacturer', "Doors"],
            colnames=['Body Style', "Drive Type"],
            dropna=False)

# Visualizing

Create a heatmap using `seaborn.heatmap()` function.

In [None]:
cols = [ df['body_style'], df['drive_wheels'] ]
idx  = [ df['make'], df['num_doors'] ]
crosstab = pd.crosstab([df.make, df.num_doors], [df.body_style, df.drive_wheels])

sns.heatmap(data=crosstab, cmap="YlGnBu", annot=True, cbar=True)

# Cheat - Sheet

<img style="float: center;" width="1440" src="../Images/crosstab.png">