# CS-6570 Lecture 3

**Dylan Zwick**

*Weber State University*

In this lecture, we'll go through a whirlwind introduction to Pandas, the most commonly used data management library in Python. We will not cover, or even come close to covering, the entire library. Also, there are many aspects and facets of Pandas that you'll learn and internalize only by using it. However, the lecture should - ideally - give you a starting point.

First, some notational conventions. Almost always the name of the library "pandas" is abbreviated as pd when it's imported, and pd is used to reference it afterwards. The same idea applies, mutatis mutandis, with NumPy, PyPlot, Seaborn, and Statsmodel:

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as sm

### Pandas Basics

The dataset with which we'll play around is the "royal line" dataset, which was created from public sources and contains family history information about Elizabeth II, the Queen of England at the time the dataset was compiled. Note a few things about the command below:
* It uses the read_csv command from pandas, which is used to read in "comma separated value" files. This is a very common format for storing tabular data, as it's not tied to a particular program like, for example, Excel files are. However, Pandas also has functionality for reading in pretty much any type of data format commonly found in practice.
* Currently, I'm in a folder called "Lecture Notes", which is a subfolder of the folder "CS-6570". Another subfolder of "CS-6570" is "Datasets", where the royal_line.csv data file can be found. My command starts in the "Lecture Notes" folder, goes one up, then goes into the "Datasets" folder, then accesses the data file. Your command may be different depending on where you decide to place your data file.
* The basic data object is Pandas is a "dataframe", which is created by the read_csv command. Typically, a dataframe is denoted with the abbreviation "df", which we do here.

In [7]:
df = pd.read_csv('../Datasets/royal_line.csv')

We can then take a look at this dataframe using the "print" command, which will be default print the first five and last five rows of the dataframe.

In [9]:
print(df)

        ID               first_name last_name sex             title  \
0        1                 Victoria   Hanover   F  Queen of England   
1        2  Albert Augustus Charles       NaN   M            Prince   
2        3   Victoria Adelaide Mary       NaN   F    Princess Royal   
3        4               Edward_VII    Wettin   M   King of England   
4        5          Alice Maud Mary       NaN   F          Princess   
...    ...                      ...       ...  ..               ...   
3004  3005                    Emily   Scobell   F               NaN   
3005  3006             John Sanford   Scobell   M               Sir   
3006  3007                    James  Cartland   M               NaN   
3007  3008                    Flora       NaN   F               NaN   
3008  3009                      NaN  Cartland   F               NaN   

       birth_date                          birth_place   death_date  \
0     24 MAY 1819     Kensington,Palace,London,England  22 JAN 1901   
1    

If we just want to check out the first $n$ values, we can use the "head" function:

In [11]:
print(df.head())
print(df.head(12))

   ID               first_name last_name sex             title   birth_date  \
0   1                 Victoria   Hanover   F  Queen of England  24 MAY 1819   
1   2  Albert Augustus Charles       NaN   M            Prince  26 AUG 1819   
2   3   Victoria Adelaide Mary       NaN   F    Princess Royal  21 NOV 1840   
3   4               Edward_VII    Wettin   M   King of England   9 NOV 1841   
4   5          Alice Maud Mary       NaN   F          Princess  25 APR 1843   

                           birth_place   death_date  \
0     Kensington,Palace,London,England  22 JAN 1901   
1  Schloss Rosenau,Near Coburg,Germany  14 DEC 1861   
2     Buckingham,Palace,London,England   5 AUG 1901   
3     Buckingham,Palace,London,England   6 MAY 1910   
4     Buckingham,Palace,London,England  14 DEC 1878   

                           death_place  
0  Osborne House,Isle of Wight,England  
1     Windsor Castle,Berkshire,England  
2   Friedrichshof,Near,Kronberg,Taunus  
3     Buckingham,Palace,London

And similarly the "tail" function returns the last $n$ values:

In [13]:
print(df.tail(15))

        ID             first_name last_name sex  title  birth_date  \
2994  2995       Bertram (Bertie)  Cartland   M  Major         NaN   
2995  2996  Mary Hamilton (Polly)   Scobell   F    NaN  5 SEP 1877   
2996  2997                 Ronald  Cartland   M    NaN  3 JAN 1907   
2997  2998         Anthony (Tony)  Cartland   M    NaN  4 JAN 1912   
2998  2999                  Edith  Palairet   F    NaN         NaN   
2999  3000              Mary Anne  Hamilton   F    NaN         NaN   
3000  3001                 Andrew  Hamilton   M    NaN         NaN   
3001  3002                 George   Scobell   M    NaN         NaN   
3002  3003               Melloney   Scobell   F    NaN         NaN   
3003  3004                    NaN   Scobell   M    NaN         NaN   
3004  3005                  Emily   Scobell   F    NaN         NaN   
3005  3006           John Sanford   Scobell   M    Sir        1879   
3006  3007                  James  Cartland   M    NaN         NaN   
3007  3008          

You may have noticed in the printed dataframe an additional column of numbers located to the far left of the data. For instance, if we just call df.head() with the default option (which is 5), we get:

In [15]:
df.head()

Unnamed: 0,ID,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place
0,1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England"
1,2,Albert Augustus Charles,,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England"
2,3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus"
3,4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England"
4,5,Alice Maud Mary,,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany"


That first column is an index column created by Pandas for the dataframe. It starts at $0$ and enumerates from there. Note this column is *not* in our original csv, it's created.

In this example, our original csv already has an index column, called ID, that starts at 1, so this additional data column is a bit redundant. We can specify an index column when we read in the data using the "index_col" parameter in read_csv.

In [17]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')

Now, the index column is the "ID" column from the original dataset.

In [19]:
df.head()

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England"
2,Albert Augustus Charles,,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England"
3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus"
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England"
5,Alice Maud Mary,,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany"


If we only wanted to see the "title" column, we could restrict our investigation to that as follows:

In [21]:
df['title'].head()

ID
1    Queen of England
2              Prince
3      Princess Royal
4     King of England
5            Princess
Name: title, dtype: object

We can specify more than one column as follows. Note the *double brackets*. Think of this as the outer brackets accessing the dataframe, while the inner brackets specify a list.

In [23]:
df[['title', 'first_name']].tail()

Unnamed: 0_level_0,title,first_name
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
3005,,Emily
3006,Sir,John Sanford
3007,,James
3008,,Flora
3009,,


In [24]:
df.columns

Index(['first_name', 'last_name', 'sex', 'title', 'birth_date', 'birth_place',
       'death_date', 'death_place'],
      dtype='object')

Note this returns an index objects that behaves as an iterable list, so you could, for example, go through the colums with a for loop.

You can also find out more information about a dataframe using the "info" function:

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3009 entries, 1 to 3009
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   first_name   2984 non-null   object
 1   last_name    1142 non-null   object
 2   sex          2996 non-null   object
 3   title        1398 non-null   object
 4   birth_date   1734 non-null   object
 5   birth_place  486 non-null    object
 6   death_date   1692 non-null   object
 7   death_place  449 non-null    object
dtypes: object(8)
memory usage: 211.6+ KB


### Dropping or Removing Data

There are many reasons why you might want to drop or remove data. For example, it could be that only some data is relevant to your analysis. Or, it could be that some data is insufficient or corrupt, and leaving it in would lead to incorrect conclusions. Also, sometimes certain columns aren't of interest to the analysis in question.

If we want do drop entire columns, we can use the "drop" function:

In [29]:
df.drop(columns = ['birth_place', 'death_place'])

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,death_date
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,22 JAN 1901
2,Albert Augustus Charles,,M,Prince,26 AUG 1819,14 DEC 1861
3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,5 AUG 1901
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,6 MAY 1910
5,Alice Maud Mary,,F,Princess,25 APR 1843,14 DEC 1878
...,...,...,...,...,...,...
3005,Emily,Scobell,F,,,
3006,John Sanford,Scobell,M,Sir,1879,
3007,James,Cartland,M,,,
3008,Flora,,F,,,


However, while the dataframe above only has six columns, if we call it again we get:

In [31]:
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England"
2,Albert Augustus Charles,,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England"
3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus"
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England"
5,Alice Maud Mary,,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany"
...,...,...,...,...,...,...,...,...
3005,Emily,Scobell,F,,,,,
3006,John Sanford,Scobell,M,Sir,1879,,,
3007,James,Cartland,M,,,,,
3008,Flora,,F,,,,,


What?!? I thought we dropped two!

What's going on here is that the drop command creates a new dataframe as its output. It *does not* modify the original dataframe. So, for example, we could say:

In [33]:
df2 = df.drop(columns = ['birth_place', 'death_place'])
df2

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,death_date
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,22 JAN 1901
2,Albert Augustus Charles,,M,Prince,26 AUG 1819,14 DEC 1861
3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,5 AUG 1901
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,6 MAY 1910
5,Alice Maud Mary,,F,Princess,25 APR 1843,14 DEC 1878
...,...,...,...,...,...,...
3005,Emily,Scobell,F,,,
3006,John Sanford,Scobell,M,Sir,1879,
3007,James,Cartland,M,,,
3008,Flora,,F,,,


Here, df2 is the dataframe with those two columns dropped, while df is the original, unchanged dataframe.

If we want to actually make the change to the original dataframe, we can do this with the "inplace" parameter.

In [35]:
df.drop(columns = ['birth_place', 'death_place'], inplace = True)
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,death_date
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,22 JAN 1901
2,Albert Augustus Charles,,M,Prince,26 AUG 1819,14 DEC 1861
3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,5 AUG 1901
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,6 MAY 1910
5,Alice Maud Mary,,F,Princess,25 APR 1843,14 DEC 1878
...,...,...,...,...,...,...
3005,Emily,Scobell,F,,,
3006,John Sanford,Scobell,M,Sir,1879,
3007,James,Cartland,M,,,
3008,Flora,,F,,,


This is the same as:

In [37]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
df = df.drop(columns = ['birth_place', 'death_place'])
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,death_date
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,22 JAN 1901
2,Albert Augustus Charles,,M,Prince,26 AUG 1819,14 DEC 1861
3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,5 AUG 1901
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,6 MAY 1910
5,Alice Maud Mary,,F,Princess,25 APR 1843,14 DEC 1878
...,...,...,...,...,...,...
3005,Emily,Scobell,F,,,
3006,John Sanford,Scobell,M,Sir,1879,
3007,James,Cartland,M,,,
3008,Flora,,F,,,


You can also drop rows by indicating specific indices:

In [39]:
df.drop(index=[4,5,6], inplace=True)

Or, by using df.index, which avoids potential variations in index numbering and always references the first row starting with $0$.

In [41]:
df.drop(df.index[0], inplace=True)
df.drop(df.index[0], inplace=True)

Each line drops the first row on the dataframe, whatever that first row might be. So, these two lines together drop the first two rows. We can also use standard Python indexing and slicing notation to specify indices here.

In [43]:
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,death_date
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,5 AUG 1901
7,Helena Augusta Victoria,,F,Princess,25 MAY 1846,9 JUN 1923
8,Louise Caroline Alberta,,F,Princess,18 MAR 1848,3 DEC 1939
9,Arthur William Patrick,,M,Prince,1 MAY 1850,16 JAN 1942
10,Leopold George Duncan,,M,Prince,7 APR 1853,28 MAR 1884
...,...,...,...,...,...,...
3005,Emily,Scobell,F,,,
3006,John Sanford,Scobell,M,Sir,1879,
3007,James,Cartland,M,,,
3008,Flora,,F,,,


The "drop_duplicates" function can be used to drop duplicate rows, while the "dropna" function drops every row that includes at least one NA entry. Be careful with this one, as it could potentially drop a lot of rows!

In [45]:
df.dropna(inplace=True)

### Adding, Modifying Data, and Mapping

Suppose we have a dataframe with NA values, and instead of dropping them we want to fill them with some values we determine. Here are a few ways to do that:

In [48]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
# replace ALL NA entries with a fixed value:
df.fillna(0, inplace=True)
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England"
2,Albert Augustus Charles,0,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England"
3,Victoria Adelaide Mary,0,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus"
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England"
5,Alice Maud Mary,0,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany"
...,...,...,...,...,...,...,...,...
3005,Emily,Scobell,F,0,0,0,0,0
3006,John Sanford,Scobell,M,Sir,1879,0,0,0
3007,James,Cartland,M,0,0,0,0,0
3008,Flora,0,F,0,0,0,0,0


In [49]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
# replace the first 2 NA entries in each column with a fixed value:
df.fillna(0, limit=2, inplace=True)
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England"
2,Albert Augustus Charles,0,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England"
3,Victoria Adelaide Mary,0,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus"
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England"
5,Alice Maud Mary,,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany"
...,...,...,...,...,...,...,...,...
3005,Emily,Scobell,F,,,,,
3006,John Sanford,Scobell,M,Sir,1879,,,
3007,James,Cartland,M,,,,,
3008,Flora,,F,,,,,


In [50]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
# replace ALL NA first names with a fixed value:
df['first_name'].fillna('no first name', inplace=True)  
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England"
2,Albert Augustus Charles,,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England"
3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus"
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England"
5,Alice Maud Mary,,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany"
...,...,...,...,...,...,...,...,...
3005,Emily,Scobell,F,,,,,
3006,John Sanford,Scobell,M,Sir,1879,,,
3007,James,Cartland,M,,,,,
3008,Flora,,F,,,,,


In [51]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
# replace specific columns with specific values:
values = {'first_name': 'no_first_name', 'last_name': 'no_last_name', 'sex': 'no_sex', 'title': 'no_title', 'birth_date': 'no_birth_date', 'birth_place': 'no_birth_place', 'death_date': 'no_death_date', 'death_place': 'no_death_place'}
df.fillna(value=values, inplace=True)
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England"
2,Albert Augustus Charles,no_last_name,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England"
3,Victoria Adelaide Mary,no_last_name,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus"
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England"
5,Alice Maud Mary,no_last_name,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany"
...,...,...,...,...,...,...,...,...
3005,Emily,Scobell,F,no_title,no_birth_date,no_birth_place,no_death_date,no_death_place
3006,John Sanford,Scobell,M,Sir,1879,no_birth_place,no_death_date,no_death_place
3007,James,Cartland,M,no_title,no_birth_date,no_birth_place,no_death_date,no_death_place
3008,Flora,no_last_name,F,no_title,no_birth_date,no_birth_place,no_death_date,no_death_place


In [52]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
# ffill and pad: from first row to last row, propagate the most recent row that # is not an NA forward until next valid row
df.fillna(method='ffill', inplace=True)  
df  

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England"
2,Albert Augustus Charles,Hanover,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England"
3,Victoria Adelaide Mary,Hanover,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus"
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England"
5,Alice Maud Mary,Wettin,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany"
...,...,...,...,...,...,...,...,...
3005,Emily,Scobell,F,Major,4 JAN 1912,"Florence,Italy",BEF 1877,"Nr Cassel,France"
3006,John Sanford,Scobell,M,Sir,1879,"Florence,Italy",BEF 1877,"Nr Cassel,France"
3007,James,Cartland,M,Sir,1879,"Florence,Italy",BEF 1877,"Nr Cassel,France"
3008,Flora,Cartland,F,Sir,1879,"Florence,Italy",BEF 1877,"Nr Cassel,France"


In [53]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
# bfill and backfill: like ffill, except from last row to first row
df.fillna(method='bfill', inplace=True)
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England"
2,Albert Augustus Charles,Wettin,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England"
3,Victoria Adelaide Mary,Wettin,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus"
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England"
5,Alice Maud Mary,Windsor,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany"
...,...,...,...,...,...,...,...,...
3005,Emily,Scobell,F,Sir,1879,,ABT 1911,
3006,John Sanford,Scobell,M,Sir,1879,,ABT 1911,
3007,James,Cartland,M,,,,ABT 1911,
3008,Flora,Cartland,F,,,,ABT 1911,


We can also create new columns from existing ones. For example:

In [55]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
df['full_name'] = df['first_name'] + ' ' + df['last_name']
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place,full_name
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England",Victoria Hanover
2,Albert Augustus Charles,,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England",
3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus",
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England",Edward_VII Wettin
5,Alice Maud Mary,,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany",
...,...,...,...,...,...,...,...,...,...
3005,Emily,Scobell,F,,,,,,Emily Scobell
3006,John Sanford,Scobell,M,Sir,1879,,,,John Sanford Scobell
3007,James,Cartland,M,,,,,,James Cartland
3008,Flora,,F,,,,,,


This illustrates a problem. Anytime we have an NaN value, the string concatenation is also NaN. How could we get around this? Well, we could create our own specific function that handles this, and then apply that to our dataframe:

In [57]:
def create_full_name(row):
    if isinstance(row['first_name'], str) and isinstance(row['last_name'], str):  # both first_name and last_name are strings
        result = row['first_name'] + ' ' + row['last_name']
    elif isinstance(row['first_name'], str):  # only first_name is a string
        result = row['first_name']
    elif isinstance(row['last_name'], str):  # only last_name is a string
        result = row['last_name']
    else:  # neither first_name nor last_name are strings, they are both NaN
        result = np.nan 
    return result

df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')

df['full_name'] = df.apply(create_full_name, axis=1)
df

Unnamed: 0_level_0,first_name,last_name,sex,title,birth_date,birth_place,death_date,death_place,full_name
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Victoria,Hanover,F,Queen of England,24 MAY 1819,"Kensington,Palace,London,England",22 JAN 1901,"Osborne House,Isle of Wight,England",Victoria Hanover
2,Albert Augustus Charles,,M,Prince,26 AUG 1819,"Schloss Rosenau,Near Coburg,Germany",14 DEC 1861,"Windsor Castle,Berkshire,England",Albert Augustus Charles
3,Victoria Adelaide Mary,,F,Princess Royal,21 NOV 1840,"Buckingham,Palace,London,England",5 AUG 1901,"Friedrichshof,Near,Kronberg,Taunus",Victoria Adelaide Mary
4,Edward_VII,Wettin,M,King of England,9 NOV 1841,"Buckingham,Palace,London,England",6 MAY 1910,"Buckingham,Palace,London,England",Edward_VII Wettin
5,Alice Maud Mary,,F,Princess,25 APR 1843,"Buckingham,Palace,London,England",14 DEC 1878,"Darmstadt,,,Germany",Alice Maud Mary
...,...,...,...,...,...,...,...,...,...
3005,Emily,Scobell,F,,,,,,Emily Scobell
3006,John Sanford,Scobell,M,Sir,1879,,,,John Sanford Scobell
3007,James,Cartland,M,,,,,,James Cartland
3008,Flora,,F,,,,,,Flora


This "apply" operation applies the specified function. You could also use Python lambda functions to create a function inline if needed. Note the option "axis = 1" means to process the data row by row. The option "axis = 0" would process the data column by column. For example:

In [59]:
# Create a dataframe that is a 6 x 2 array formed from a list of 12 numbers ordered from 0 to 11.
df = pd.DataFrame(np.arange(12).reshape(6,2), columns = ['column 1', 'column 2'])
print(df)

   column 1  column 2
0         0         1
1         2         3
2         4         5
3         6         7
4         8         9
5        10        11


In [60]:
# Create a new dataframe that takes the maximum value of each column in the dataframe we just created.
new_df = df.apply(lambda column: column.max())
print(new_df)

column 1    10
column 2    11
dtype: int64


There are three main functions used to create or change data in dataframes: apply, map, and applymap.

In [62]:
df = pd.DataFrame(np.arange(8).reshape(4,2), columns = ['column 1', 'column 2'])
print(df)

   column 1  column 2
0         0         1
1         2         3
2         4         5
3         6         7


In [63]:
print(df.apply(np.max))

column 1    6
column 2    7
dtype: int64


In [64]:
print(df.apply(np.max, axis = 1))

0    1
1    3
2    5
3    7
dtype: int64


In [65]:
print(df['column 1'].map(lambda x: x*2))

0     0
1     4
2     8
3    12
Name: column 1, dtype: int64


In [66]:
print(df.applymap(lambda x: x*2))

   column 1  column 2
0         0         2
1         4         6
2         8        10
3        12        14


### Changing Datatypes of Series or Columns

The datatypes for our "royal_line" examples have all been 'objects' because every column has had data that's been interpreted as a string. This is a general, default datatype that is quite encompassing in what it can handle. However, there are some functions, like maximum or average, that make sense for certain types of numeric data, but not for general data, and if we try to apply these functions to objects we'll have a bad time.

In [69]:
# Let's create a simple dataframe with three columns containing different types of data:
df = pd.DataFrame({'ints': [1,2,3,4], 'strings': ['a','b','c','d'], 'floats': [1.1, '2.2', '3.3', 4]})
print(df)
print(df.dtypes)

   ints strings floats
0     1       a    1.1
1     2       b    2.2
2     3       c    3.3
3     4       d      4
ints        int64
strings    object
floats     object
dtype: object


Here, the second and third column are interpreted as objects because both contained strings (the values 2.2 and 3.3 in the floats column were entered as strings).

To convert these to a different datatype, we can convert a single column, or multiple columns using a dictionary.

In [71]:
df['floats'] = df['floats'].astype(float)
print(df.dtypes)

ints         int64
strings     object
floats     float64
dtype: object


In [72]:
convert_dict = {'ints': int, 'strings': str, 'floats': float}
df = df.astype(convert_dict)
print(df.dtypes)

ints         int64
strings     object
floats     float64
dtype: object


In [73]:
# The following command would also work:
df['ints'] = df['ints'].astype(float)
print(df)
print(df.dtypes)

   ints strings  floats
0   1.0       a     1.1
1   2.0       b     2.2
2   3.0       c     3.3
3   4.0       d     4.0
ints       float64
strings     object
floats     float64
dtype: object


In [74]:
# But this one won't:
df['strings'] = df['strings'].astype(int)

ValueError: invalid literal for int() with base 10: 'a'

Pandas has many builtin conversion functions (to_datetime, to_timedelta, to_numeric, etc...) but you'll sometimes encounter data that's formatted in such a way that it's not possible to immediately convert it to the format you want using one of the builtin functions. To deal with this, sometimes you need to write your own conversion function.

For example, if we check out the 'birth_date' column in our royal_line dataset, we see:

In [113]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
print(df['birth_date'])

ID
1       24 MAY 1819
2       26 AUG 1819
3       21 NOV 1840
4        9 NOV 1841
5       25 APR 1843
           ...     
3005            NaN
3006           1879
3007            NaN
3008            NaN
3009            NaN
Name: birth_date, Length: 3009, dtype: object


A lot of NaN. OK, let's remove these and see what we get:

In [115]:
df.dropna(subset = ['birth_date'], inplace=True)
print(df['birth_date'])

ID
1       24 MAY 1819
2       26 AUG 1819
3       21 NOV 1840
4        9 NOV 1841
5       25 APR 1843
           ...     
2989    11 OCT 1937
2996     5 SEP 1877
2997     3 JAN 1907
2998     4 JAN 1912
3006           1879
Name: birth_date, Length: 1734, dtype: object


If we then try to convert these values to datetimes we get:

In [117]:
df['birth_date'] = pd.to_datetime(df['birth_date']) # This fails

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1292-01-01 00:00:00 present at position 123

This generates errors due to several issues. First, there are entries in the dataset formatted like the following: ABT 751. This notation means that the family history experts believe the person was born about (ABT) 751. the second is related to an out of bounds nanosecond timestamp error relating to Pandas only supporting approximately 580 years in the range from around 1677 to 2262.

To get around these issue, we'll write and then apply our own function. Note we're not dropping the NaN values here.

In [157]:
def get_year(x):
    if pd.isna(x):
        year_result = np.nan  # if the birth_year is nan then return nan
    else:  # checking a number of edge cases in the data and stripping it out:
        if "ABT" in x:  # for example: ABT  1775  
            x = x[3:]
            x = x.strip()
        if "/" in x:  #  For example: 1775/1776
            x = x[:x.find('/')]
        num_spaces = x.count(' ')
        if num_spaces == 0:  # only has the year
            year_result = int(x)
        elif num_spaces == 1:  # example: FEB 1337
            x = x[x.rfind(' ') + 1:]  # 'rfind' finds the last space. The 'r' stands for 'reverse.'
            if x.isnumeric():
                year_result = int(x)
            else:  # This could happen if there is only a day and month, like '10 JAN'
                year_result = np.nan
        elif num_spaces == 2:  # example: 16 FEB 1337
            x = x[x.rfind(' ') + 1:]  # 'rfind' finds the last space. The 'r' stands for 'reverse.'
            year_result = int(x)
        else:
            year_result = np.nan  # There are a few other strange dates that aren't worth our time to fix, so just return nan for those.
    return year_result

df['birth_year'] = df['birth_date'].map(get_year)

print(df['birth_year'])

ID
1       1819.0
2       1819.0
3       1840.0
4       1841.0
5       1843.0
         ...  
2989    1937.0
2996    1877.0
2997    1907.0
2998    1912.0
3006    1879.0
Name: birth_year, Length: 1734, dtype: float64


### Conditionals in Dataframes and Series

Conditionals are a very useful feature of Pandas which typically produce a Numpy array of Booleans or a Pandas Boolean series.

For example, consider the following code that produces True if the birth_year column (calculated above) is greater than or equal to 1990, and False otherwise.

In [160]:
boolean_mask = (df.birth_year >= 1990)
print(boolean_mask)

ID
1       False
2       False
3       False
4       False
5       False
        ...  
2989    False
2996    False
2997    False
2998    False
3006    False
Name: birth_year, Length: 1734, dtype: bool


We can then use this to, for example, only print the entries for which the boolean is True.

In [162]:
print(df[boolean_mask][['first_name', 'last_name', 'birth_year']])

                   first_name last_name  birth_year
ID                                                 
2958  Eugenie Victoria Helena   Windsor      1990.0
2961                      NaN    Mowatt      1990.0
2963                    Kitty       NaN      1991.0


We can combine Boolean expressions using the logical operators & ("and"), | ("or"), and ~ ("not"). For example:

In [164]:
print(df[(df.birth_year >= 1500) & (df.title.str.contains('Queen'))][['first_name', 'title', 'birth_year']])

                          first_name             title  birth_year
ID                                                                
1                           Victoria  Queen of England      1819.0
27            Victoria Eugenie Ena""    Queen of Spain      1887.0
30                Mary_of_Teck (May)             Queen      1867.0
52       Elizabeth_II Alexandra Mary  Queen of England      1926.0
76                 Sophie of_Prussia   Queen of Greece      1870.0
96    Marie of_Saxe-Coburg and_Gotha  Queen of Romania      1875.0
610                     Mergrethe_II  Queen of Denmark      1940.0
656              Emma of_Netherlands      Queen Regent      1858.0
657        Wilhelmina of_Netherlands             Queen      1880.0
659           Juliana of_Netherlands             Queen      1909.0
661           Beatrix of_Netherlands             Queen      1938.0
682        Maria Cristina of_Austria    Queen of Spain      1858.0
693                             Anne  Queen of England      16

### loc and iloc Functions

One of the most common tasks for data scientists is filtering the information to more efficiently derive actionable insights about the data. We've seen the "head" and "tail" functions, which provide a quick, truncated view of the beginning or end, respectively, of the dataframe or series. But what if you're interested in examining results that are not necessarily at the very beginning or end of the dataset.

For this purpose, the loc function is designed to access rows and columns by label. In contrast, the iloc function is used to access rows and columns by integer value - the "i" stands for "integer".

A quick example of the difference is illustrated below, where both commands do the same thing:

In [167]:
print(df.loc[1])

first_name                                Victoria
last_name                                  Hanover
sex                                              F
title                             Queen of England
birth_date                             24 MAY 1819
birth_place       Kensington,Palace,London,England
death_date                             22 JAN 1901
death_place    Osborne House,Isle of Wight,England
birth_year                                  1819.0
Name: 1, dtype: object


In [168]:
print(df.iloc[0])

first_name                                Victoria
last_name                                  Hanover
sex                                              F
title                             Queen of England
birth_date                             24 MAY 1819
birth_place       Kensington,Palace,London,England
death_date                             22 JAN 1901
death_place    Osborne House,Isle of Wight,England
birth_year                                  1819.0
Name: 1, dtype: object


However, the following will produce an error:

In [170]:
print(df.loc[0])

KeyError: 0

Because there is no row with label 0.

Now, row indices (labels) don't need to be unique. For example:

In [203]:
df = pd.DataFrame(np.arange(10).reshape(5, 2), columns=['A', 'B'], index=['cat', 42, 'stone', 42, 12345])# Five rows each with an associated index
print(df)

       A  B
cat    0  1
42     2  3
stone  4  5
42     6  7
12345  8  9


The index "42" appears twice, and some indices are numbers, while some are strings. Let's look at some examples:

In [205]:
print(df.loc[12345])

A    8
B    9
Name: 12345, dtype: int64


In [206]:
print(df.loc['stone'])

A    4
B    5
Name: stone, dtype: int64


In [207]:
print(df.loc[42])

    A  B
42  2  3
42  6  7


In [208]:
print(df.loc['A'])

KeyError: 'A'

In [236]:
print(df.loc['cat':'stone'])

       A  B
cat    0  1
42     2  3
stone  4  5


In [237]:
print(df.loc[['cat','stone']])

       A  B
cat    0  1
stone  4  5


In [238]:
print(df.loc['stone', 'B'])

5


In [239]:
print(df.loc[df['A'] > 3])

       A  B
stone  4  5
42     6  7
12345  8  9


Now let's take a look at some iloc examples:

In [241]:
print(df.iloc[0])

A    0
B    1
Name: cat, dtype: int64


In [242]:
print(df.iloc[0:3])

       A  B
cat    0  1
42     2  3
stone  4  5


In [243]:
print(df.iloc[[0,2,4]])

       A  B
cat    0  1
stone  4  5
12345  8  9


In [244]:
print(df.iloc[0,1])

1


In [245]:
print(df.iloc[0:3,1])

cat      1
42       3
stone    5
Name: B, dtype: int64


Returning to the royal family history data as an example, let's create a new column named "era". The "era" column signifies if a person was born in one of three distinct time periods: 'ancient', 'middle_years', or 'modern'. The following creates a new column and initially assigns the value 'unknown' to every entry within it:

In [247]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
df['birth_year'] = df['birth_date'].map(get_year)
df['era'] = 'unknown'

In [248]:
print(df)

                   first_name last_name sex             title   birth_date  \
ID                                                                           
1                    Victoria   Hanover   F  Queen of England  24 MAY 1819   
2     Albert Augustus Charles       NaN   M            Prince  26 AUG 1819   
3      Victoria Adelaide Mary       NaN   F    Princess Royal  21 NOV 1840   
4                  Edward_VII    Wettin   M   King of England   9 NOV 1841   
5             Alice Maud Mary       NaN   F          Princess  25 APR 1843   
...                       ...       ...  ..               ...          ...   
3005                    Emily   Scobell   F               NaN          NaN   
3006             John Sanford   Scobell   M               Sir         1879   
3007                    James  Cartland   M               NaN          NaN   
3008                    Flora       NaN   F               NaN          NaN   
3009                      NaN  Cartland   F               NaN   

The next question is how to divide the birth years. If we check out their maximum and minimum values, we get:

In [250]:
print(f"The earlier year = {df['birth_year'].min()} and the latest year = {df['birth_year'].max()}.")

The earlier year = 686.0 and the latest year = 1991.0.


So, 686 is the earliest year, and 1991 is the latest. This is a difference of 1991 - 656 = 1305 years, which if we divide by 3 this gives us 435 years per era. So, the "ancient" royals are those born between 686 and 1121, the "middle_years" royals are those born between 1121 and 1555, and the "modern" royals are those born after 1555. (Not all that modern!) We can assign these three eras with the following code:

In [252]:
df.loc[df['birth_year'] < 1122, 'era'] = 'ancient'  # 686 – 1121
df.loc[(df['birth_year'] >= 1122) & (df['birth_year'] <= 1555), 'era'] = 'middle_years'  # 1122 – 1555
df.loc[df['birth_year'] > 1555, 'era'] = 'modern'  # after 1555
print(df)

                   first_name last_name sex             title   birth_date  \
ID                                                                           
1                    Victoria   Hanover   F  Queen of England  24 MAY 1819   
2     Albert Augustus Charles       NaN   M            Prince  26 AUG 1819   
3      Victoria Adelaide Mary       NaN   F    Princess Royal  21 NOV 1840   
4                  Edward_VII    Wettin   M   King of England   9 NOV 1841   
5             Alice Maud Mary       NaN   F          Princess  25 APR 1843   
...                       ...       ...  ..               ...          ...   
3005                    Emily   Scobell   F               NaN          NaN   
3006             John Sanford   Scobell   M               Sir         1879   
3007                    James  Cartland   M               NaN          NaN   
3008                    Flora       NaN   F               NaN          NaN   
3009                      NaN  Cartland   F               NaN   

We could have also done this using a custom function and the "map" utility.

### Reshaping with Pivot, Pivot_Table, Groupby, and Transpose

Frequently it is convenient or informative to restructure data contained in a dataframe, effectively organizing the data into a different shape or format. This section will cover the most common reshaping functions provided by Pandas.

The pivot function addresses the situation in which separate categories of a dataset feature are enumerated and highlighted using a cross tabular format. For example:

In [256]:
df = pd.DataFrame({'Car': [1, 1, 2, 2],
                   'Type': ['new', 'used', 'new', 'used'],
                   'Price': [10, 5, 12, 7]})
print(df)

   Car  Type  Price
0    1   new     10
1    1  used      5
2    2   new     12
3    2  used      7


Calling the pivot function on this dataframe will reword the data into a more compact and usable format. In the example above, we want to reshape the data such that each car brand is represented on a single row. A use use case for this particular reorganization would be a care salesperson who needs to quickly view all the prices of a given car brand for the different 'Type' categories.

In [258]:
p = df.pivot(index='Car', columns='Type', values='Price')
print(p)

Type  new  used
Car            
1      10     5
2      12     7


The pivot function only works if there is either zero or one entries per cell in the result. Suppose we have the following dataframe:

In [260]:
df = pd.DataFrame({'Car': [1, 1, 2, 2, 2],
                   'Type': ['new', 'used', 'new', 'used', 'used'],
                   'Price': [10, 5, 12, 7, 6]})
print(df)

   Car  Type  Price
0    1   new     10
1    1  used      5
2    2   new     12
3    2  used      7
4    2  used      6


Invoking the pivot function on this dataframe will generate an error:

In [262]:
p = df.pivot(index='Car', columns='Type', values='Price')

ValueError: Index contains duplicate entries, cannot reshape

The reason for this is there are two price entries for the used version of car 2. In this case what we could do is use the "pivot_table" function with an aggregator, which specifies how to combine values when more than 1 occurs.

In [273]:
p = df.pivot_table(index='Car', columns='Type', values='Price', aggfunc=np.mean)
print(p)

Type   new  used
Car             
1     10.0   5.0
2     12.0   6.5


Pivot tables can result in immensely complex tabular formats with multiple indexes, multiple columns, and various aggregation functions specified. Today, we demonstrate only the basic single-index, single-column case.

Here's another example of a pivot_table using our royal family history dataset:

In [276]:
df = pd.read_csv('../Datasets/royal_line.csv', index_col='ID')
df['birth_year'] = df['birth_date'].map(get_year)

df.dropna(inplace=True, subset=['title', 'sex', 'birth_year'])
p = df.pivot_table(index='title', columns='sex', values='birth_year', aggfunc='mean')
print(p)

sex                    F       M
title                           
Admiral              NaN  1759.0
Admiral Sir          NaN  1881.0
Archduchess       1825.0     NaN
Archduke             NaN  1844.6
Baron                NaN  1901.0
...                  ...     ...
Tsarina           1859.5     NaN
Vicount Althorp      NaN  1964.0
Vicount Linley       NaN  1961.0
Viscount             NaN  1913.8
Viscount Hampden     NaN  1877.5

[171 rows x 2 columns]


Here are two more examples. The first fills blank entries in the resulting pivot table after aggregation with $0$ instead of NaN, and uses two aggregate functions, mean and count.

In [278]:
p = df.pivot_table(index='title', columns='sex', values='birth_year', aggfunc=['mean', 'count'], fill_value=0)
print(p)

                    mean         count   
sex                    F       M     F  M
title                                    
Admiral              0.0  1759.0     0  1
Admiral Sir          0.0  1881.0     0  1
Archduchess       1825.0     0.0     1  0
Archduke             0.0  1844.6     0  5
Baron                0.0  1901.0     0  4
...                  ...     ...   ... ..
Tsarina           1859.5     0.0     2  0
Vicount Althorp      0.0  1964.0     0  1
Vicount Linley       0.0  1961.0     0  1
Viscount             0.0  1913.8     0  5
Viscount Hampden     0.0  1877.5     0  2

[171 rows x 4 columns]


The second is similar to the first, but instead declares two indexes and two columns producing a much more complicated, nested output result.

In [280]:
p = df.pivot_table(index=['title', 'first_name'], columns=['sex', 'last_name'], values='birth_year', aggfunc=['mean', 'count'], fill_value=0)
print(p)

                                                    mean                \
sex                                                    F                 
last_name                                Armstrong-Jones Ashley Baring   
title            first_name                                              
Admiral          Hugh                                  0      0      0   
Admiral Sir      Alexander                             0      0      0   
Baron            Carl Marten                           0      0      0   
                 Nicholas                              0      0      0   
                 Paul                                  0      0      0   
...                                                  ...    ...    ...   
Viscount         George Earl_of_Harewood               0      0      0   
                 Henry George Charles                  0      0      0   
                 Louis de_la_Torre                     0      0      0   
                 Mervyn Powerscourt   

The groupby function's recasting of information is very similar to that of the pivot_table function. In general, the main difference is how the resulting output is shaped. Note it's a common mistake to create a group object without specifying an aggregating function like mean, sum, or std.

Consider the following:

In [282]:
df = pd.DataFrame({'Car': [1, 1, 2, 2, 2],
                   'Type': ['new', 'used', 'new', 'used', 'used'],
                   'Price': [10, 5, 12, 7, 6]})

g = df.groupby(by='Car')
print(g)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd0b6b0b010>


That's not particularly helpful. However, if we group by 'Car' and invoke the 'mean' function, we obtain a more useful result.

In [284]:
g = df.groupby(by='Car').mean()
print(g)

        Price
Car          
1    7.500000
2    8.333333


  g = df.groupby(by='Car').mean()


That warning basically tells us that we tried to take the mean of "type", which doesn't make sense, so it was tossed out. If instead of mean we wanted to use the more robust count we'd get:

In [286]:
g = df.groupby(by='Car').count()
print(g)

     Type  Price
Car             
1       2      2
2       3      3


If we wanted to group by both 'Car' and 'Type', and use two different aggregation function, we could do that:

In [288]:
g = df.groupby(by=['Car','Type']).agg(['mean','count'])
print(g)

         Price      
          mean count
Car Type            
1   new   10.0     1
    used   5.0     1
2   new   12.0     1
    used   6.5     2


The transpose function (or simply T) transposes a dataframe. For example:

In [290]:
df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['A', 'B'])
print(f'Original\n{df}')

df = df.transpose()  # or df.T

print(f'\nTransposed:\n{df}')

Original
   A  B
0  0  1
1  2  3
2  4  5

Transposed:
   0  1  2
A  0  2  4
B  1  3  5
