# Pandas : Crosstab - cross tabulation of two (or more) factors

## Resources

* [pandas.crosstab](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html)
* [Pivot table](https://en.wikipedia.org/wiki/Pivot_table)
* [imdb 5000 movie dataset](https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset)

## Official Pandas doc

>Compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

## Pivot Table

> A pivot table is a table of statistics that summarizes the data of more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.

> Pivot tables are a technique in data processing.

## Use cases

* Data summary
* Data aggregation
* Grouping
* Quick Reports
* Data patterns

## Step 1: Import Pandas and read data

In [1]:
import pandas as pd
df = pd.read_csv("../csv/movie_metadata.csv")

## Step 2: Select data for the crosstab

In [2]:
df.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [3]:
df.head().T

Unnamed: 0,0,1,2,3,4
color,Color,Color,Color,Color,
director_name,James Cameron,Gore Verbinski,Sam Mendes,Christopher Nolan,Doug Walker
num_critic_for_reviews,723,302,602,813,
duration,178,169,148,164,
director_facebook_likes,0,563,0,22000,131
actor_3_facebook_likes,855,1000,161,23000,
actor_2_name,Joel David Moore,Orlando Bloom,Rory Kinnear,Christian Bale,Rob Walker
actor_1_facebook_likes,1000,40000,11000,27000,131
gross,7.60506e+08,3.09404e+08,2.00074e+08,4.48131e+08,
genres,Action|Adventure|Fantasy|Sci-Fi,Action|Adventure|Fantasy,Action|Adventure|Thriller,Action|Thriller,Documentary


In [4]:
df.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [5]:
df2 = df.iloc[[2, 4, 9, 12, 13, 14, 20, 23, 25, 30,  34, 50, 79], :]

# Step 3: Create crosstab table

In [6]:
# simple usage
pd.crosstab(df2['director_name'], df2['country'])

country,Australia,Canada,New Zealand,UK,USA
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Baz Luhrmann,1,0,0,0,0
Brett Ratner,0,1,0,0,0
David Yates,0,0,0,1,0
Gore Verbinski,0,0,0,0,2
Jon Favreau,0,0,0,1,0
Marc Forster,0,0,0,1,0
Peter Jackson,0,0,2,0,1
Sam Mendes,0,0,0,2,0


In [7]:
# change row and column names
pd.crosstab(df2['director_name'], df2['country'], rownames=['director'], colnames=['country'])

country,Australia,Canada,New Zealand,UK,USA
director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Baz Luhrmann,1,0,0,0,0
Brett Ratner,0,1,0,0,0
David Yates,0,0,0,1,0
Gore Verbinski,0,0,0,0,2
Jon Favreau,0,0,0,1,0
Marc Forster,0,0,0,1,0
Peter Jackson,0,0,2,0,1
Sam Mendes,0,0,0,2,0


# Crosstab: normaliza or show percentage per row or total

In [8]:
# Show percentage - global - normalize=True
pd.crosstab(df2['director_name'], df2['country'], normalize=True)

country,Australia,Canada,New Zealand,UK,USA
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Baz Luhrmann,0.083333,0.0,0.0,0.0,0.0
Brett Ratner,0.0,0.083333,0.0,0.0,0.0
David Yates,0.0,0.0,0.0,0.083333,0.0
Gore Verbinski,0.0,0.0,0.0,0.0,0.166667
Jon Favreau,0.0,0.0,0.0,0.083333,0.0
Marc Forster,0.0,0.0,0.0,0.083333,0.0
Peter Jackson,0.0,0.0,0.166667,0.0,0.083333
Sam Mendes,0.0,0.0,0.0,0.166667,0.0


In [9]:
# Show percentage - per index - normalize='index'
pd.crosstab(df2['director_name'], df2['country'], normalize='index')

country,Australia,Canada,New Zealand,UK,USA
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Baz Luhrmann,1.0,0.0,0.0,0.0,0.0
Brett Ratner,0.0,1.0,0.0,0.0,0.0
David Yates,0.0,0.0,0.0,1.0,0.0
Gore Verbinski,0.0,0.0,0.0,0.0,1.0
Jon Favreau,0.0,0.0,0.0,1.0,0.0
Marc Forster,0.0,0.0,0.0,1.0,0.0
Peter Jackson,0.0,0.0,0.666667,0.0,0.333333
Sam Mendes,0.0,0.0,0.0,1.0,0.0


In [10]:
# Show total - margins=True
pd.crosstab(df2['director_name'], df2['country'], margins=True)

country,Australia,Canada,New Zealand,UK,USA,All
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Baz Luhrmann,1,0,0,0,0,1
Brett Ratner,0,1,0,0,0,1
David Yates,0,0,0,1,0,1
Gore Verbinski,0,0,0,0,2,2
Jon Favreau,0,0,0,1,0,1
Marc Forster,0,0,0,1,0,1
Peter Jackson,0,0,2,0,1,3
Sam Mendes,0,0,0,2,0,2
All,1,1,2,5,3,12


In [11]:
# Combining totals and percentage
pd.crosstab(df2['director_name'], df2['country'], margins=True, normalize=True)

country,Australia,Canada,New Zealand,UK,USA,All
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Baz Luhrmann,0.083333,0.0,0.0,0.0,0.0,0.083333
Brett Ratner,0.0,0.083333,0.0,0.0,0.0,0.083333
David Yates,0.0,0.0,0.0,0.083333,0.0,0.083333
Gore Verbinski,0.0,0.0,0.0,0.0,0.166667,0.166667
Jon Favreau,0.0,0.0,0.0,0.083333,0.0,0.083333
Marc Forster,0.0,0.0,0.0,0.083333,0.0,0.083333
Peter Jackson,0.0,0.0,0.166667,0.0,0.083333,0.25
Sam Mendes,0.0,0.0,0.0,0.166667,0.0,0.166667
All,0.083333,0.083333,0.166667,0.416667,0.25,1.0


In [12]:
# Combining totals and percentage per row
pd.crosstab(df2['director_name'], df2['country'], margins=True,  normalize='index')

country,Australia,Canada,New Zealand,UK,USA
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Baz Luhrmann,1.0,0.0,0.0,0.0,0.0
Brett Ratner,0.0,1.0,0.0,0.0,0.0
David Yates,0.0,0.0,0.0,1.0,0.0
Gore Verbinski,0.0,0.0,0.0,0.0,1.0
Jon Favreau,0.0,0.0,0.0,1.0,0.0
Marc Forster,0.0,0.0,0.0,1.0,0.0
Peter Jackson,0.0,0.0,0.666667,0.0,0.333333
Sam Mendes,0.0,0.0,0.0,1.0,0.0
All,0.083333,0.083333,0.166667,0.416667,0.25


# Pandas crosstab multiple columns

In [13]:
pd.crosstab([df2['director_name'], df2['genres']], df2['country'])

Unnamed: 0_level_0,country,Australia,Canada,New Zealand,UK,USA
director_name,genres,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Baz Luhrmann,Drama|Romance,1,0,0,0,0
Brett Ratner,Action|Adventure|Fantasy|Sci-Fi|Thriller,0,1,0,0,0
David Yates,Adventure|Family|Fantasy|Mystery,0,0,0,1,0
Gore Verbinski,Action|Adventure|Fantasy,0,0,0,0,1
Gore Verbinski,Action|Adventure|Western,0,0,0,0,1
Jon Favreau,Adventure|Drama|Family|Fantasy,0,0,0,1,0
Marc Forster,Action|Adventure,0,0,0,1,0
Peter Jackson,Action|Adventure|Drama|Romance,0,0,1,0,0
Peter Jackson,Adventure|Fantasy,0,0,1,0,1
Sam Mendes,Action|Adventure|Thriller,0,0,0,2,0


# Simulate pandas crosstab with Group By

In [14]:
cols = ['director_name', 'country']
df2.groupby(cols)[cols].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,director_name,country
director_name,country,Unnamed: 2_level_1,Unnamed: 3_level_1
Baz Luhrmann,Australia,1,1
Brett Ratner,Canada,1,1
David Yates,UK,1,1
Gore Verbinski,USA,2,2
Jon Favreau,UK,1,1
Marc Forster,UK,1,1
Peter Jackson,New Zealand,2,2
Peter Jackson,USA,1,1
Sam Mendes,UK,2,2


# Pandas crosstab use values from another column

In [15]:
import numpy as np
pd.crosstab(df2['director_name'], df2['country'], values=df2.imdb_score, aggfunc=np.average)

country,Australia,Canada,New Zealand,UK,USA
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Baz Luhrmann,7.3,,,,
Brett Ratner,,6.8,,,
David Yates,,,,7.5,
Gore Verbinski,,,,,6.9
Jon Favreau,,,,7.8,
Marc Forster,,,,6.7,
Peter Jackson,,,7.35,,7.9
Sam Mendes,,,,7.3,


In [17]:
import numpy as np
pd.crosstab(df2['director_name'], df2['country'], values=df2.imdb_score, aggfunc=np.average, margins=True)

country,Australia,Canada,New Zealand,UK,USA,All
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Baz Luhrmann,7.3,,,,,7.3
Brett Ratner,,6.8,,,,6.8
David Yates,,,,7.5,,7.5
Gore Verbinski,,,,,6.9,6.9
Jon Favreau,,,,7.8,,7.8
Marc Forster,,,,6.7,,6.7
Peter Jackson,,,7.35,,7.9,7.533333
Sam Mendes,,,,7.3,,7.3
All,7.3,6.8,7.35,7.32,7.233333,7.258333
