# DataFrame

The object DataFrame of the package pandas represents a table of data. Each column is a Series; the columns share a common index.

In [78]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd
import numpy as np

## Create a DataFrame

### From a file

Place the data file in the same folder as the ipynb file. Then, read it as follows:

In [2]:
df=pd.read_csv("students.csv") # this code will try to find students.csv in the same folder as this .ipynb file

In [3]:
df

Unnamed: 0,Name,hw1,hw2,program
0,Demetria,2.0,4.0,MSIS
1,Dorian,10.0,10.0,MSIS
2,Garland,9.0,1.0,MSIS
3,Iluminada,2.0,,MBA
4,Jeannine,6.0,7.0,MSIS
5,Jenny,8.0,,
6,John,,10.0,MSIS
7,Luci,7.0,7.0,MSIS
8,Mercy,5.0,6.0,MSIS
9,Michael,6.0,10.0,MBA


By the default, the index is 0, 1, ... 

Let us set the index as the column "Name".

In [4]:
df.set_index("Name", inplace=True)

In [5]:
df

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Demetria,2.0,4.0,MSIS
Dorian,10.0,10.0,MSIS
Garland,9.0,1.0,MSIS
Iluminada,2.0,,MBA
Jeannine,6.0,7.0,MSIS
Jenny,8.0,,
John,,10.0,MSIS
Luci,7.0,7.0,MSIS
Mercy,5.0,6.0,MSIS
Michael,6.0,10.0,MBA


## index, columns, values

<b>index</b> returns the index labels

In [6]:
df.index

Index([u'Demetria', u'Dorian', u'Garland', u'Iluminada', u'Jeannine', u'Jenny',
       u'John', u'Luci', u'Mercy', u'Michael', u'Shelby'],
      dtype='object', name=u'Name')

In [7]:
df.index[2]

'Garland'

<b>columns</b> returns the list of column names (as an index object)

In [8]:
df.columns

Index([u'hw1', u'hw2', u'program'], dtype='object')

<b>values</b> returns a (2-dimensional) ndarray of values

In [9]:
df.values

array([[2.0, 4.0, 'MSIS'],
       [10.0, 10.0, 'MSIS'],
       [9.0, 1.0, 'MSIS'],
       [2.0, nan, 'MBA'],
       [6.0, 7.0, 'MSIS'],
       [8.0, nan, nan],
       [nan, 10.0, 'MSIS'],
       [7.0, 7.0, 'MSIS'],
       [5.0, 6.0, 'MSIS'],
       [6.0, 10.0, 'MBA'],
       [1.0, 10.0, 'MSIS']], dtype=object)

## df.iloc[x, y]

Access using the positional index. 
<ul>
<li><b>x</b> is the information needed to select the rows: positional index or range of integers</li>
<li><b>y (optional)</b> is the information needed to select the columns: positional index or range of integers</li>
</ul>

Access one row by specifying a positional index

In [10]:
df.iloc[2,:]

hw1           9
hw2           1
program    MSIS
Name: Garland, dtype: object

In [11]:
df.iloc[2]

hw1           9
hw2           1
program    MSIS
Name: Garland, dtype: object

Access one column by specifying positional index of the column

In [12]:
df.iloc[:,1]

Name
Demetria      4.0
Dorian       10.0
Garland       1.0
Iluminada     NaN
Jeannine      7.0
Jenny         NaN
John         10.0
Luci          7.0
Mercy         6.0
Michael      10.0
Shelby       10.0
Name: hw2, dtype: float64

Access one specific value

In [13]:
df.iloc[2,1]

1.0

Access a subset of rows and of columns

In [14]:
df.iloc[:5,-2:]

Unnamed: 0_level_0,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Demetria,4.0,MSIS
Dorian,10.0,MSIS
Garland,1.0,MSIS
Iluminada,,MBA
Jeannine,7.0,MSIS


## df.loc[x, y]

Access using the index labels. 
<ul>
<li><b>x</b> is the information needed to select the rows: label index, range of index labels, or boolean masks</li>
<li><b>y (optional)</b> is the information needed to select the columns: label index, range of index labels, or boolean masks</li>
</ul>

Acccess one specific value by specifying index label and column name

In [15]:
df.loc['Garland','hw2']

1.0

Access one row by specifying index label

In [16]:
df.loc['Garland',:]

hw1           9
hw2           1
program    MSIS
Name: Garland, dtype: object

or, more simply:

In [17]:
df.loc['Garland']

hw1           9
hw2           1
program    MSIS
Name: Garland, dtype: object

Access one column by specifying index label

In [18]:
df.loc[:,'hw1']

Name
Demetria      2.0
Dorian       10.0
Garland       9.0
Iluminada     2.0
Jeannine      6.0
Jenny         8.0
John          NaN
Luci          7.0
Mercy         5.0
Michael       6.0
Shelby        1.0
Name: hw1, dtype: float64

Or, more simply:

In [19]:
df.hw1

Name
Demetria      2.0
Dorian       10.0
Garland       9.0
Iluminada     2.0
Jeannine      6.0
Jenny         8.0
John          NaN
Luci          7.0
Mercy         5.0
Michael       6.0
Shelby        1.0
Name: hw1, dtype: float64

In [20]:
df['hw1']

Name
Demetria      2.0
Dorian       10.0
Garland       9.0
Iluminada     2.0
Jeannine      6.0
Jenny         8.0
John          NaN
Luci          7.0
Mercy         5.0
Michael       6.0
Shelby        1.0
Name: hw1, dtype: float64

Common mistake: get the whole row about Lucy

In [21]:
df['Luci']

KeyError: 'Luci'

In [22]:
# the correct way
df.loc['Luci',:]

hw1           7
hw2           7
program    MSIS
Name: Luci, dtype: object

Select those students whose name starts with 'J'

In [23]:
mask = (df.index >= 'J') & (df.index < 'K')
mask

array([False, False, False, False,  True,  True,  True, False, False,
       False, False], dtype=bool)

In [24]:
df.loc[mask,:]

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jeannine,6.0,7.0,MSIS
Jenny,8.0,,
John,,10.0,MSIS


## Problems

#### Retrieve Shelby's hw1 grade

In [25]:
df.loc['Shelby','hw1']

1.0

#### Retrieve Shelby's information

In [26]:
df.loc['Shelby',:]

hw1           1
hw2          10
program    MSIS
Name: Shelby, dtype: object

#### Who obtained the highest grade in hw2? Note that there are ties

In [27]:
df.loc[:,'hw2'].nlargest(2)

Name
Dorian    10.0
John      10.0
Name: hw2, dtype: float64

Because of the ties, we need to this:

In [28]:
# first, find the series of hw2 grades
df.hw2

Name
Demetria      4.0
Dorian       10.0
Garland       1.0
Iluminada     NaN
Jeannine      7.0
Jenny         NaN
John         10.0
Luci          7.0
Mercy         6.0
Michael      10.0
Shelby       10.0
Name: hw2, dtype: float64

In [29]:
# second, get the max value
df.hw2.max()

10.0

In [30]:
# third, create a mask that selects whom has the max grade
mask = df.hw2 == df.hw2.max()
mask

Name
Demetria     False
Dorian        True
Garland      False
Iluminada    False
Jeannine     False
Jenny        False
John          True
Luci         False
Mercy        False
Michael       True
Shelby        True
Name: hw2, dtype: bool

In [31]:
# fourth, use the mask to select a subset of entries
df.hw2[mask]

Name
Dorian     10.0
John       10.0
Michael    10.0
Shelby     10.0
Name: hw2, dtype: float64

All together

In [32]:
df.loc[df.hw2 == df.hw2.max()]

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
John,,10.0,MSIS
Michael,6.0,10.0,MBA
Shelby,1.0,10.0,MSIS


#### Find those students who obtained the same score in hw1 and in hw2.

In [33]:
df.hw1

Name
Demetria      2.0
Dorian       10.0
Garland       9.0
Iluminada     2.0
Jeannine      6.0
Jenny         8.0
John          NaN
Luci          7.0
Mercy         5.0
Michael       6.0
Shelby        1.0
Name: hw1, dtype: float64

In [34]:
df.hw2

Name
Demetria      4.0
Dorian       10.0
Garland       1.0
Iluminada     NaN
Jeannine      7.0
Jenny         NaN
John         10.0
Luci          7.0
Mercy         6.0
Michael      10.0
Shelby       10.0
Name: hw2, dtype: float64

In [35]:
df.hw1 == df.hw2

Name
Demetria     False
Dorian        True
Garland      False
Iluminada    False
Jeannine     False
Jenny        False
John         False
Luci          True
Mercy        False
Michael      False
Shelby       False
dtype: bool

In [36]:
df.loc[df.hw1 == df.hw2,:]

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Luci,7.0,7.0,MSIS


#### Find the average hw1 score of those students who got a hw2 score greater than 5.

Make the filter on the rows: select the rows where hw2 > 5

In [37]:
df.hw2 > 5

Name
Demetria     False
Dorian        True
Garland      False
Iluminada    False
Jeannine      True
Jenny        False
John          True
Luci          True
Mercy         True
Michael       True
Shelby        True
Name: hw2, dtype: bool

Apply the filter and retrieve only hw1

In [38]:
df.loc[df.hw2 > 5,'hw1']

Name
Dorian      10.0
Jeannine     6.0
John         NaN
Luci         7.0
Mercy        5.0
Michael      6.0
Shelby       1.0
Name: hw1, dtype: float64

Compute the mean of the hw1 grade obtained by the students who got >5 in hw2

In [39]:
df.loc[df.hw2 > 5,'hw1'].mean()

5.833333333333333

## sort_values()

Sort the table based on the values of a set of columns (parameter <b>by</b>). 

Sorting by one column

In [40]:
df.sort_values(by='hw1',ascending=False)

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Garland,9.0,1.0,MSIS
Jenny,8.0,,
Luci,7.0,7.0,MSIS
Jeannine,6.0,7.0,MSIS
Michael,6.0,10.0,MBA
Mercy,5.0,6.0,MSIS
Demetria,2.0,4.0,MSIS
Iluminada,2.0,,MBA
Shelby,1.0,10.0,MSIS


Sorting by more columns. For example, by hw1 descending and, in case of ties, by hw2 ascending

In [41]:
df.sort_values(by=['hw1', 'hw2'], ascending=[False, True])

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Garland,9.0,1.0,MSIS
Jenny,8.0,,
Luci,7.0,7.0,MSIS
Jeannine,6.0,7.0,MSIS
Michael,6.0,10.0,MBA
Mercy,5.0,6.0,MSIS
Demetria,2.0,4.0,MSIS
Iluminada,2.0,,MBA
Shelby,1.0,10.0,MSIS


## sort_index

In [42]:
df.sort_index()

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Demetria,2.0,4.0,MSIS
Dorian,10.0,10.0,MSIS
Garland,9.0,1.0,MSIS
Iluminada,2.0,,MBA
Jeannine,6.0,7.0,MSIS
Jenny,8.0,,
John,,10.0,MSIS
Luci,7.0,7.0,MSIS
Mercy,5.0,6.0,MSIS
Michael,6.0,10.0,MBA


## head and tail

Returns the first (or last) n rows

In [43]:
df.head(4)

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Demetria,2.0,4.0,MSIS
Dorian,10.0,10.0,MSIS
Garland,9.0,1.0,MSIS
Iluminada,2.0,,MBA


In [44]:
df.tail(4)

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Luci,7.0,7.0,MSIS
Mercy,5.0,6.0,MSIS
Michael,6.0,10.0,MBA
Shelby,1.0,10.0,MSIS


## Problems

#### Sort the MSIS students by hw2 descending.

Make the filter: select rows where program == 'MSIS'

In [45]:
df.program == 'MSIS'

Name
Demetria      True
Dorian        True
Garland       True
Iluminada    False
Jeannine      True
Jenny        False
John          True
Luci          True
Mercy         True
Michael      False
Shelby        True
Name: program, dtype: bool

Apply filter

In [46]:
df.loc[df.program == 'MSIS',:]

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Demetria,2.0,4.0,MSIS
Dorian,10.0,10.0,MSIS
Garland,9.0,1.0,MSIS
Jeannine,6.0,7.0,MSIS
John,,10.0,MSIS
Luci,7.0,7.0,MSIS
Mercy,5.0,6.0,MSIS
Shelby,1.0,10.0,MSIS


Sort resulting rows by hw2 descending

In [47]:
df.loc[df.program == 'MSIS',:].sort_values(by='hw2',ascending=False)

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
John,,10.0,MSIS
Shelby,1.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Luci,7.0,7.0,MSIS
Mercy,5.0,6.0,MSIS
Demetria,2.0,4.0,MSIS
Garland,9.0,1.0,MSIS


####  Show <b>only</b> the field <i>hw1</i> of the four students with the largest hw2 grade (do not use nlargest on the dataframe... it has bugs)

Find the four students with the largest hw2 grade. Step 1: sort everyone by hw2 descending

In [48]:
df.sort_values(by='hw2',ascending=False)

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
John,,10.0,MSIS
Michael,6.0,10.0,MBA
Shelby,1.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Luci,7.0,7.0,MSIS
Mercy,5.0,6.0,MSIS
Demetria,2.0,4.0,MSIS
Garland,9.0,1.0,MSIS
Iluminada,2.0,,MBA


Step 2: retain only the top 4 rows and the column 'hw1'

In [49]:
df.sort_values(by='hw2',ascending=False).iloc[:4,:].loc[:,'hw1']

Name
Dorian     10.0
John        NaN
Michael     6.0
Shelby      1.0
Name: hw1, dtype: float64

## mean, min, max, etc

Aggregate functions are broadcasted to all columns (axis = 0, which is the default) or rows (axis = 1). Numeric aggregators will be executed only on numeric data.

The average for each hw

In [50]:
df.mean()

hw1    5.600000
hw2    7.222222
dtype: float64

The average for each student

In [51]:
df.mean(axis=1)

Name
Demetria      3.0
Dorian       10.0
Garland       5.0
Iluminada     2.0
Jeannine      6.5
Jenny         8.0
John         10.0
Luci          7.0
Mercy         5.5
Michael       8.0
Shelby        5.5
dtype: float64

## Problems

#### Compute the spread (i.e., highest minus lowest hw grade) of each student

Compute the difference hw1-hw2 for each student

In [52]:
df.hw1 - df.hw2

Name
Demetria    -2.0
Dorian       0.0
Garland      8.0
Iluminada    NaN
Jeannine    -1.0
Jenny        NaN
John         NaN
Luci         0.0
Mercy       -1.0
Michael     -4.0
Shelby      -9.0
dtype: float64

Then, take the absolute value

In [53]:
(df.hw1 - df.hw2).abs()

Name
Demetria     2.0
Dorian       0.0
Garland      8.0
Iluminada    NaN
Jeannine     1.0
Jenny        NaN
John         NaN
Luci         0.0
Mercy        1.0
Michael      4.0
Shelby       9.0
dtype: float64

#### Who has the largest spread?

Find the largest of the series obtained in the previous step

In [55]:
((df.hw1 - df.hw2).abs()).nlargest(1)

Name
Shelby    9.0
dtype: float64

## Modifying DataFrames

Make a copy of the data frame

In [58]:
df2 = df.copy()

### Add rows

A new student has joined. His name is Oliver and he is the MSIS program; his hw1 is missing and his hw2 score is 8.

In [64]:
import numpy as np
df2.loc['Oliver'] = [np.nan,8,'MSIS']
df2

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Demetria,2.0,4.0,MSIS
Dorian,10.0,10.0,MSIS
Garland,9.0,1.0,MSIS
Iluminada,2.0,,MBA
Jeannine,6.0,7.0,MSIS
Jenny,8.0,,
John,,10.0,MSIS
Luci,7.0,7.0,MSIS
Mercy,5.0,6.0,MSIS
Michael,6.0,10.0,MBA


A new student has joined. Her name is Caroline and she got 4 in hw2. She is not in any program yet.

In [67]:
df2.loc['Caroline','hw2'] = 4
df2

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Demetria,2.0,4.0,MSIS
Dorian,10.0,10.0,MSIS
Garland,9.0,1.0,MSIS
Iluminada,2.0,,MBA
Jeannine,6.0,7.0,MSIS
Jenny,8.0,,
John,,10.0,MSIS
Luci,7.0,7.0,MSIS
Mercy,5.0,6.0,MSIS
Michael,6.0,10.0,MBA


### Add columns

Add an "empty" column <b>hw3</b>

In [79]:
df2 = df.copy()

In [80]:
df2['hw3'] = np.nan
df2

Unnamed: 0_level_0,hw1,hw2,program,hw3
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Demetria,2.0,4.0,MSIS,
Dorian,10.0,10.0,MSIS,
Garland,9.0,1.0,MSIS,
Iluminada,2.0,,MBA,
Jeannine,6.0,7.0,MSIS,
Jenny,8.0,,,
John,,10.0,MSIS,
Luci,7.0,7.0,MSIS,
Mercy,5.0,6.0,MSIS,
Michael,6.0,10.0,MBA,


### Add calculated columns

In [75]:
df2 = df.copy()

Let's add a column with the final grade. It is computed as 0.2\*hw1 + 0.8\*hw2.

In [77]:
df2['finalGrade'] = 0.2 * df2['hw1'] + 0.8 * df2['hw2']
df2

Unnamed: 0_level_0,hw1,hw2,program,finalGrade
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Demetria,2.0,4.0,MSIS,3.6
Dorian,10.0,10.0,MSIS,10.0
Garland,9.0,1.0,MSIS,2.6
Iluminada,2.0,,MBA,
Jeannine,6.0,7.0,MSIS,6.8
Jenny,8.0,,,
John,,10.0,MSIS,
Luci,7.0,7.0,MSIS,7.0
Mercy,5.0,6.0,MSIS,5.8
Michael,6.0,10.0,MBA,9.2
