# Getting started with <i>pandas</i>

"pandas" is a Python package providing data structures to work on relational and labeled data. It is designed to be efficient and intuitive.

The convention is to import pandas as <i>pd</i>.

In [1]:
import pandas as pd

In [2]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

The two main classes in pandas are <i>DataFrame</i> and <i>Series</i>. In a nutshell, a DataFrame is a table and a Series is a column. This lecture illustrate the details of the class Series. First, we load the data set in <i>students.csv</i> and store it in a DataFrame called <i>df</i>.

In [3]:
df = pd.read_csv('students.csv', index_col=0)

The method <b>head</b> returns the top 5 rows of the DataFrame.  This DataFrame has one student per row and three columns: <i>hw1</i> (the grade received on hw1), <i>hw2</i> (the grade received on hw2), and <i>program</i>.

In [4]:
df.head()

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,


# Series

In this lecture, we will mostly focus only on the column <i>hw1</i>. Let's make a Series of hw1 scores. 

In [5]:
hw1 = df['hw1']

A Series is a one-dimensional array of data (<b>values</b>) and an associated array of data labels (<b>index</b>). The <b>index</b> is the student name and the <b>value</b> is the score in hw1.

In [6]:
hw1

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

The length of hw1

In [7]:
len(hw1)

11

## index and values

Return the index as an Index object and the values as ndarray

In [8]:
hw1.index

Index([u'Dorian', u'Jeannine', u'Iluminada', u'Luci', u'Jenny', u'Demetria',
       u'Michael', u'Garland', u'Shelby', u'Mercy', u'John'],
      dtype='object', name=u'Name')

In [9]:
hw1.values

array([ 10.,   6.,   2.,   7.,   8.,   2.,   6.,   9.,   1.,   5.,  nan])

You can retrieve elements in an <i>ndarray</i> as for regular arrays.

In [10]:
hw1.values[3]

7.0

## describe

The method <b>describe</b> reports summary statistics of the Series values.

In [11]:
hw1.describe()

count    10.000000
mean      5.600000
std       3.098387
min       1.000000
25%       2.750000
50%       6.000000
75%       7.750000
max      10.000000
Name: hw1, dtype: float64

## Aggregate functions (max, min, mean, ...)

We can call several aggregate functions on a Series object, which summarize its values.

The average grade among all students

In [12]:
hw1.mean()

5.5999999999999996

The median grade 

In [13]:
hw1.median()

6.0

The minimum and maximum grade among all students

In [26]:
hw1.min()

1.0

In [27]:
hw1.max()

10.0

The sum of all grades

In [28]:
hw1.sum()

56.0

## <i>s.iloc[...]</i>: position-based selection 

Selects rows using the positional index (using an integer or a slice). It is like accessing a list of elements, with one big difference: we can access the values using <b>slices</b>.

#### Using one index value

Access the 4-th value. It returns one value.

In [29]:
hw1.iloc[3]

7.0

#### Using slices

Retrieve all elements from the 3rd (included) to the 7th (excluded). It returns a Series. <b>Caution!</b> It returns a view, not a copy

In [30]:
hw1.iloc[2:6]

Name
Iluminada    2.0
Luci         7.0
Jenny        8.0
Demetria     2.0
Name: hw1, dtype: float64

## <i>s[...]</i>: index-based selection 

Selects rows using the index (using a label value, a slice of label values, or a Boolean selection). It is like accessing a Dictionary of elements, with one big difference: we can access the values using <b>slices</b> and <b>boolean selection</b>.

#### Using a label value

Find Luci's hw1 grade.

In [31]:
hw1['Luci']

7.0

#### Using a slice of label values (rarely used)

Find the grades from Luci's to Michael's

In [32]:
hw1['Luci':'Michael']

Name
Luci        7.0
Jenny       8.0
Demetria    2.0
Michael     6.0
Name: hw1, dtype: float64

#### Using Boolean selection

We can pass a Boolean mask (a list or a Series) to indicate which grades we want to retrieve.

<b>With a list</b>: 

In [33]:
v = [True, False, True, True, False,True, False, True, True, False,False]

In [34]:
hw1[v]

Name
Dorian       10.0
Iluminada     2.0
Luci          7.0
Demetria      2.0
Garland       9.0
Shelby        1.0
Name: hw1, dtype: float64

## Boolean selection

The binary operators >,<,>=,<=,==,!= can be used to create a Series of booleans to identify those elements whose value satisfy a certain condition

<b>Problem</b>: Find the students whose grade is greater than or equal to 6

First, create a boolean Series

In [39]:
hw1 >= 6

Name
Dorian        True
Jeannine      True
Iluminada    False
Luci          True
Jenny         True
Demetria     False
Michael       True
Garland       True
Shelby       False
Mercy        False
John         False
Name: hw1, dtype: bool

Second, select only those students who have a "True" in the boolean Series above

In [40]:
hw1[hw1 >= 6]

Name
Dorian      10.0
Jeannine     6.0
Luci         7.0
Jenny        8.0
Michael      6.0
Garland      9.0
Name: hw1, dtype: float64

## problems

What is Michael's hw1 score?

In [42]:
hw1['Michael']

6.0

Select the "last" student of the Series (i.e., the one reported last). Make sure to retrieve both the name and the grade.

In [43]:
hw1.iloc[len(hw1)-1:]

Name
John   NaN
Name: hw1, dtype: float64

In [44]:
hw1.iloc[-1:]

Name
John   NaN
Name: hw1, dtype: float64

Compute the average hw1 grade among those students whose grade is less than or equal to 6


In [47]:
hw1[hw1 <= 6].mean()

3.6666666666666665

(together) Select those students whose hw1 score is less than 5 or greater than 9


In [41]:
hw1[(hw1 < 5) | (hw1 > 9)]

Name
Dorian       10.0
Iluminada     2.0
Demetria      2.0
Shelby        1.0
Name: hw1, dtype: float64

## More Series methods

### rank

Ranks each row based on the value


In [48]:
hw1.rank()

Name
Dorian       10.0
Jeannine      5.5
Iluminada     2.5
Luci          7.0
Jenny         8.0
Demetria      2.5
Michael       5.5
Garland       9.0
Shelby        1.0
Mercy         4.0
John          NaN
Name: hw1, dtype: float64

### idmax and idmin

Find the index of the row with maximum and minimum values


In [49]:
hw1.idxmax()

'Dorian'

In [50]:
hw1.idxmin()

'Shelby'

### sort_values

Sort by values


In [51]:
hw1.sort_values()

Name
Shelby        1.0
Iluminada     2.0
Demetria      2.0
Mercy         5.0
Jeannine      6.0
Michael       6.0
Luci          7.0
Jenny         8.0
Garland       9.0
Dorian       10.0
John          NaN
Name: hw1, dtype: float64

### sort_index

Sort by index

In [52]:
hw1.sort_index()

Name
Demetria      2.0
Dorian       10.0
Garland       9.0
Iluminada     2.0
Jeannine      6.0
Jenny         8.0
John          NaN
Luci          7.0
Mercy         5.0
Michael       6.0
Shelby        1.0
Name: hw1, dtype: float64

### nlargest and nsmallest

Finds the n items with largest or smallest value


In [53]:
hw1.nlargest(3)

Name
Dorian     10.0
Garland     9.0
Jenny       8.0
Name: hw1, dtype: float64

In [54]:
hw1.nsmallest(3)

Name
Shelby       1.0
Iluminada    2.0
Demetria     2.0
Name: hw1, dtype: float64

### head and tail

Returns the first (or last) rows according to the positional index


In [55]:
hw1.head(3)

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Name: hw1, dtype: float64

In [56]:
hw1.tail(3)

Name
Shelby    1.0
Mercy     5.0
John      NaN
Name: hw1, dtype: float64

## problems

Explore the parameters of the method "rank" to solve this question. Find the rank of each student (1=best, 10=worst) and deal with ties in the way that makes most sense to you.

In [None]:
hw1.rank(ascending=False,method='min')

Who got the 4th highest grade?

In [None]:
hw1.sort_values(ascending=False).iloc[3:4]

In [None]:
hw1.nlargest(4).nsmallest(1)

Retrieve the row of  the person who comes last in alphabetical order.

In [None]:
hw1.sort_index()[-1:]

In [None]:
hw1.sort_index().iloc[-1:]

Among those whose name starts with ‘J’, who got the highest grade?

In [None]:
hw1[(hw1.index >= 'J') & (hw1.index < 'K')].nlargest(1)

## Operations on one Series

### Operations between a scalar and a Series

Operations between a Series and a scalar are performed element-wise on the values.

<b>Example</b>: It's Christmas time! As a gift, we want to increase everyone's grade by 5. What will the new grades be?

In [27]:
hw1 + 5

Name
Dorian       15.0
Jeannine     11.0
Iluminada     7.0
Luci         12.0
Jenny        13.0
Demetria      7.0
Michael      11.0
Garland      14.0
Shelby        6.0
Mercy        10.0
John          NaN
Name: hw1, dtype: float64

What if we wanted to multiply by 2 each grade?

In [14]:
hw1 * 2

Name
Dorian       20.0
Jeannine     12.0
Iluminada     4.0
Luci         14.0
Jenny        16.0
Demetria      4.0
Michael      12.0
Garland      18.0
Shelby        2.0
Mercy        10.0
John          NaN
Name: hw1, dtype: float64

### abs

Returns the absolute value of all values

In [16]:
hw1.abs() # not very helpful for hw scores, as they are all already positive

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

### astype

Sometimes it is useful to convert a series to another type. For instance, convert a numeric series into a series of strings (np.str) or convert a series of text into dates (np.datetime64).  Here is how to convert a Series of floats to a Series of string.

In [24]:
import numpy as np

In [25]:
hw1.astype(np.str)

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          nan
Name: hw1, dtype: object

In [26]:
hw1.astype(np.str) + 'aaa'

Name
Dorian       10.0aaa
Jeannine      6.0aaa
Iluminada     2.0aaa
Luci          7.0aaa
Jenny         8.0aaa
Demetria      2.0aaa
Michael       6.0aaa
Garland       9.0aaa
Shelby        1.0aaa
Mercy         5.0aaa
John          nanaaa
Name: hw1, dtype: object

## Operations between two Series

Operations between two Series are performed element-wise on those elements with the same index label.

Let's create a Series of the hw2 grades. Remember that we have a dataframe object, <i>df</i>

In [17]:
hw2 = df['hw2']

The operation is executed between elements with the same index label

In [28]:
hw1

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

<b>Example</b>: Compute everyone's average grade

In [29]:
(hw1 + hw2) / 2

Name
Dorian       10.0
Jeannine      6.5
Iluminada     NaN
Luci          7.0
Jenny         NaN
Demetria      3.0
Michael       8.0
Garland       5.0
Shelby        5.5
Mercy         5.5
John          NaN
dtype: float64

## problems

<p>The average grade of hw1 is too low. We want to normalize it to 8. To this end, do the following <b>in one single command</b>:
<ol>
<li>decrease everyone's grade by the average grade (this will set the new average to 0)</li>
<li>increase everyone's grade by 8</li>
</ol>
</p>
<p>Note that some students’ grade might become greater than 10 – don’t worry about it.</p>

In [22]:
newHW1 = hw1 + 8 - hw1.mean()
newHW1

Name
Dorian       12.4
Jeannine      8.4
Iluminada     4.4
Luci          9.4
Jenny        10.4
Demetria      4.4
Michael       8.4
Garland      11.4
Shelby        3.4
Mercy         7.4
John          NaN
Name: hw1, dtype: float64

Compute the average grade between hw1 and hw2 of each student. Which student has the average closest to 6.7?


In [30]:
((hw1 + hw2) / 2 - 6.7).abs().nsmallest(1)

Name
Jeannine    0.2
dtype: float64