# Getting started with <i>pandas</i>

"pandas" is a Python package providing data structures to work on relational and labeled data. It is designed to be efficient and intuitive.

The convention is to import pandas as <i>pd</i>.

In [1]:
import pandas as pd

In [2]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
%autosave 0

Autosave disabled


The two main classes in pandas are <i>DataFrame</i> and <i>Series</i>. In a nutshell, a DataFrame is a table and a Series is a column. This lecture illustrate the details of the class Series. First, we load the data set in <i>students.csv</i> and store it in a DataFrame called <i>df</i>.

In [4]:
df =pd.read_csv('students.csv')

In [5]:
df

Unnamed: 0,Name,hw1,hw2,program
0,Dorian,10.0,10.0,MSIS
1,Jeannine,6.0,7.0,MSIS
2,Iluminada,2.0,,MBA
3,Luci,7.0,7.0,MSIS
4,Jenny,8.0,,
5,Demetria,2.0,4.0,MSIS
6,Michael,6.0,10.0,MBA
7,Garland,9.0,1.0,MSIS
8,Shelby,1.0,10.0,MSIS
9,Mercy,5.0,6.0,MSIS


Let's use the students' name(column **Name**, or column position **0**) as the index for easy identification.

In [6]:
df =pd.read_csv('students.csv', index_col="Name")

In [7]:
df

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,
Demetria,2.0,4.0,MSIS
Michael,6.0,10.0,MBA
Garland,9.0,1.0,MSIS
Shelby,1.0,10.0,MSIS
Mercy,5.0,6.0,MSIS


In [8]:
df =pd.read_csv('students.csv', index_col=0)

In [9]:
df

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,
Demetria,2.0,4.0,MSIS
Michael,6.0,10.0,MBA
Garland,9.0,1.0,MSIS
Shelby,1.0,10.0,MSIS
Mercy,5.0,6.0,MSIS


In [10]:
type(df)

pandas.core.frame.DataFrame

The method <b>head</b> returns the top 5 rows of the DataFrame.  This DataFrame has one student per row and three columns: <i>hw1</i> (the grade received on hw1), <i>hw2</i> (the grade received on hw2), and <i>program</i>.

In [11]:
df.head()

Unnamed: 0_level_0,hw1,hw2,program
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dorian,10.0,10.0,MSIS
Jeannine,6.0,7.0,MSIS
Iluminada,2.0,,MBA
Luci,7.0,7.0,MSIS
Jenny,8.0,,


# Series

In this lecture, we will mostly focus only on the column <i>hw1</i>. Let's make a Series of hw1 scores. 

You can access the column in the df via using a **[' ']** or a **.**

In [12]:
hw1 = df['hw1']

In [13]:
hw1

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

In [14]:
hw1=df.hw1

In [15]:
hw1

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

In [16]:
type(hw1)

pandas.core.series.Series

A Series is a one-dimensional array of data (<b>values</b>) and an associated array of data labels (<b>index</b>). The <b>index</b> is the student name and the <b>value</b> is the score in hw1.

In [17]:
type(hw1.values)

numpy.ndarray

The length of hw1

In [18]:
len(hw1)

11

## index and values

Return the index as an Index object and the values as ndarray

In [19]:
hw1.index

Index(['Dorian', 'Jeannine', 'Iluminada', 'Luci', 'Jenny', 'Demetria',
       'Michael', 'Garland', 'Shelby', 'Mercy', 'John'],
      dtype='object', name='Name')

In [20]:
hw1.values

array([10.,  6.,  2.,  7.,  8.,  2.,  6.,  9.,  1.,  5., nan])

You can retrieve elements in an <i>ndarray</i> as for regular arrays.

In [21]:
hw1.values[2]

2.0

## describe

The method <b>describe</b> reports summary statistics of the Series values.

In [22]:
hw1.describe()

count    10.000000
mean      5.600000
std       3.098387
min       1.000000
25%       2.750000
50%       6.000000
75%       7.750000
max      10.000000
Name: hw1, dtype: float64

<div class="alert alert-block alert-info"> 
    <b>Tech Note</b>: To bring up the on-line help for a particular function, type the function name, then presss <b>shift-tab</b>. Press one time will bring up a small window. Press twice will expand the window into a bigger one.
</div>

In [23]:
hw1.describe()

count    10.000000
mean      5.600000
std       3.098387
min       1.000000
25%       2.750000
50%       6.000000
75%       7.750000
max      10.000000
Name: hw1, dtype: float64

## Aggregate functions (max, min, mean, ...)

An aggregate function performs a calculation on a set of values, and returns a single value.

We can call several aggregate functions on a Series object, which summarize its values.

The average grade among all students

In [24]:
hw1.mean()

5.6

The median grade 

In [25]:
hw1.median()

6.0

The minimum and maximum grade among all students

In [26]:
hw1.min()

1.0

In [27]:
hw1.max()

10.0

The sum of all grades

In [28]:
hw1.sum()

56.0

<div class="alert alert-block alert-info"> 
<b>Tech Note</b>: To check how many functions and data objects are available for an object( in this case <i>hw1</i>, a <i>Series</i>). Type <i>hw1.</i> then press <i>tab</i>
</div>

hw1.

<div class="alert alert-block alert-info"> 
<b>Tech Note</b>: 
    You can also use <i>tab</i> to perform the <b>auto-filled</b>. Type the partial function name from beginning, then presss <i>tab</i>. It will auto-fill the function name or bring up a pop-up window with matching multiple choices.
</div>

hw1.me

## <i>s.iloc[...]</i>: position-based selection 

Selects rows using the positional index (using an integer or a slice). It is like accessing a list of elements, with one big difference: we can access the values using <b>slices</b>.

#### Using one index value

Access the 4-th value. It returns one value.

In [29]:
hw1.iloc[3]

7.0

In [30]:
hw1.values[3]

7.0

In [31]:
hw1[3]

7.0

#### Using slices

Retrieve all elements from the 3rd (included) to the 7th (excluded). It returns a Series. <b>Caution!</b> It returns a view, not a copy

In [32]:
hw1.iloc[2:6]

Name
Iluminada    2.0
Luci         7.0
Jenny        8.0
Demetria     2.0
Name: hw1, dtype: float64

In [33]:
type(hw1.iloc[2:6])

pandas.core.series.Series

## <i>s[...]</i>: index-based selection 

Selects rows using the index (using a label value, a slice of label values, or a Boolean selection). It is like accessing a Dictionary of elements, with one big difference: we can access the values using <b>slices</b> and <b>boolean selection</b>.

#### Using a label value

Find Luci's hw1 grade.

In [34]:
hw1['Luci']

7.0

#### Using a slice of label values (rarely used)

Find the grades from Luci's to Michael's

In [35]:
hw1['Luci':'Michael']

Name
Luci        7.0
Jenny       8.0
Demetria    2.0
Michael     6.0
Name: hw1, dtype: float64

In [36]:
hw1[3:7]

Name
Luci        7.0
Jenny       8.0
Demetria    2.0
Michael     6.0
Name: hw1, dtype: float64

#### Using Boolean selection

We can pass a Boolean mask (a list or a Series) to indicate which grades we want to retrieve.

<b>With a list</b>: 

In [37]:
v = [True, False, True, True, False, True, False, True, True, False, False]

In [38]:
type(v)

list

In [39]:
hw11 = hw1[v]
hw11

Name
Dorian       10.0
Iluminada     2.0
Luci          7.0
Demetria      2.0
Garland       9.0
Shelby        1.0
Name: hw1, dtype: float64

## Boolean selection

The binary operators >,<,>=,<=,==,!= can be used to create a Series of booleans to identify those elements whose value satisfy a certain condition

<b>Problem</b>: Find the students whose grade is greater than or equal to 6

First, create a boolean Series

In [40]:
hw1 >= 6

Name
Dorian        True
Jeannine      True
Iluminada    False
Luci          True
Jenny         True
Demetria     False
Michael       True
Garland       True
Shelby       False
Mercy        False
John         False
Name: hw1, dtype: bool

Second, select only those students who have a "True" in the boolean Series above

In [41]:
hw1[hw1 >= 6]

Name
Dorian      10.0
Jeannine     6.0
Luci         7.0
Jenny        8.0
Michael      6.0
Garland      9.0
Name: hw1, dtype: float64

## problems - in class exercise

What is Michael's hw1 score?

In [42]:
hw1['Michael']

6.0

In [43]:
hw1[['Michael']]

Name
Michael    6.0
Name: hw1, dtype: float64

<div class="alert alert-block alert-info"> 
    <b>Tech Note</b> : Python uses the <b>[ ] </b>operator for both indexing and for constructing list literals. The outer [  ] in hw1[ ['Michael'] ] is performing the indexing, and the inner is creating a list.
</div>

Select the "last" student of the Series (i.e., the one reported last) without using <i>tail</i>. Make sure to retrieve both the name and the grade.

In [44]:
hw1.iloc[-1]

nan

In [45]:
hw1.iloc[[-1]]

Name
John   NaN
Name: hw1, dtype: float64

In [46]:
hw1.iloc[-1:]

Name
John   NaN
Name: hw1, dtype: float64

In [47]:
hw1.iloc[len(hw1) - 1:]

Name
John   NaN
Name: hw1, dtype: float64

Compute the average hw1 grade among those students whose grade is less than or equal to 6


In [48]:
hw1[hw1 <= 6].mean()

3.6666666666666665

(together) Select those students whose hw1 score is less than 5 or greater than 9


In [96]:
hw1[(hw1 < 5) | (hw1 > 9)]

Name
Dorian       10.0
Iluminada     2.0
Demetria      2.0
Shelby        1.0
Name: hw1, dtype: float64

## More Series methods

### rank

Ranks each row based on the value, **WITHOUT** reordering the list. The rank number is **NOT** the original value. Rank 1.0 means that row ranked as the first place among all the rows. With option 'ascending=False', it will make the highest original value row with rank 1.0.

In [50]:
hw1.rank(ascending=False)

Name
Dorian        1.0
Jeannine      5.5
Iluminada     8.5
Luci          4.0
Jenny         3.0
Demetria      8.5
Michael       5.5
Garland       2.0
Shelby       10.0
Mercy         7.0
John          NaN
Name: hw1, dtype: float64

### idxmax and idxmin

Find the index of the row with maximum and minimum values


In [51]:
hw1

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

In [52]:
hw1.idxmax()

'Dorian'

In [53]:
hw1.idxmin()

'Shelby'

### sort_values

Sort by values


In [54]:
hw1.sort_values(ascending=False)

Name
Dorian       10.0
Garland       9.0
Jenny         8.0
Luci          7.0
Jeannine      6.0
Michael       6.0
Mercy         5.0
Iluminada     2.0
Demetria      2.0
Shelby        1.0
John          NaN
Name: hw1, dtype: float64

### sort_index

Sort by index

In [55]:
hw1.sort_index()

Name
Demetria      2.0
Dorian       10.0
Garland       9.0
Iluminada     2.0
Jeannine      6.0
Jenny         8.0
John          NaN
Luci          7.0
Mercy         5.0
Michael       6.0
Shelby        1.0
Name: hw1, dtype: float64

In [56]:
hw1.sort_index(ascending=False)

Name
Shelby        1.0
Michael       6.0
Mercy         5.0
Luci          7.0
John          NaN
Jenny         8.0
Jeannine      6.0
Iluminada     2.0
Garland       9.0
Dorian       10.0
Demetria      2.0
Name: hw1, dtype: float64

### nlargest and nsmallest

Finds the n items with largest or smallest value


In [57]:
hw1.nlargest(1)

Name
Dorian    10.0
Name: hw1, dtype: float64

In [58]:
hw1.nsmallest(3)

Name
Shelby       1.0
Iluminada    2.0
Demetria     2.0
Name: hw1, dtype: float64

### head and tail

Returns the first (or last) rows according to the positional index


In [59]:
hw1.head(2)

Name
Dorian      10.0
Jeannine     6.0
Name: hw1, dtype: float64

In [60]:
hw1.tail(3)

Name
Shelby    1.0
Mercy     5.0
John      NaN
Name: hw1, dtype: float64

## problems

Explore the parameters of the method "rank" to solve this question. Find the rank of each student (1=best, 10=worst) and deal with ties in the way that makes most sense to you.

In [61]:
hw1.rank(ascending=False, method='min')

Name
Dorian        1.0
Jeannine      5.0
Iluminada     8.0
Luci          4.0
Jenny         3.0
Demetria      8.0
Michael       5.0
Garland       2.0
Shelby       10.0
Mercy         7.0
John          NaN
Name: hw1, dtype: float64

Who got the 4th highest grade? Return both name and grade. (there are multiple ways to solve this)

In [62]:
hw1.nlargest(4).nsmallest(1)

Name
Luci    7.0
Name: hw1, dtype: float64

In [63]:
hw1.sort_values(ascending=False)[[3]]

Name
Luci    7.0
Name: hw1, dtype: float64

In [64]:
hw1[hw1.rank(ascending=False, method='min') == 4]

Name
Luci    7.0
Name: hw1, dtype: float64

Retrieve the row of  the person who comes last in alphabetical order.

In [65]:
hw1.sort_index().tail(1)

Name
Shelby    1.0
Name: hw1, dtype: float64

In [66]:
hw1.sort_index(ascending=False)[[0]]

Name
Shelby    1.0
Name: hw1, dtype: float64

Retrieve the name only of the person who comes last in alphabetical order.

In [67]:
hw1.sort_index().tail(1).index[0]

'Shelby'

Retrieve the grade only of the person who comes last in alphabetical order.

In [68]:
hw1.sort_index().tail(1).values[0]

1.0

Among those whose name starts with ‘J’, who got the highest grade?

In [69]:
hw1.index >= 'J'

array([False,  True, False,  True,  True, False,  True, False,  True,
        True,  True])

In [70]:
hw1.index < 'K'

array([ True,  True,  True, False,  True,  True, False,  True, False,
       False,  True])

In [71]:
(hw1.index >= 'J') & (hw1.index < 'K')

array([False,  True, False, False,  True, False, False, False, False,
       False,  True])

In [72]:
hw1[(hw1.index >= 'J') & (hw1.index < 'K')].nlargest(1)

Name
Jenny    8.0
Name: hw1, dtype: float64

## Operations on one Series

### Operations between a scalar and a Series

Operations between a Series and a scalar(a real number) are performed element-wise on the values.

<b>Example</b>: It's Christmas time! As a gift, we want to increase everyone's grade by 5. What will the new grades be?

In [73]:
hw1 + 5

Name
Dorian       15.0
Jeannine     11.0
Iluminada     7.0
Luci         12.0
Jenny        13.0
Demetria      7.0
Michael      11.0
Garland      14.0
Shelby        6.0
Mercy        10.0
John          NaN
Name: hw1, dtype: float64

What if we wanted to multiply by 2 each grade?

In [74]:
hw1 * 2

Name
Dorian       20.0
Jeannine     12.0
Iluminada     4.0
Luci         14.0
Jenny        16.0
Demetria      4.0
Michael      12.0
Garland      18.0
Shelby        2.0
Mercy        10.0
John          NaN
Name: hw1, dtype: float64

### abs

Returns the absolute value of all values

In [75]:
hw1.abs()

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

### astype

Sometimes it is useful to convert a series to another type. For instance, convert a numeric series into a series of strings (np.str) or convert a series of text into dates (np.datetime64).  Here is how to convert a Series of floats to a Series of string.

In [76]:
import numpy as np

In [77]:
hw1_str=hw1.astype(np.str)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  hw1_str=hw1.astype(np.str)


In [78]:
hw1_str

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          nan
Name: hw1, dtype: object

In [79]:
hw1_str[0]

'10.0'

In [80]:
type(hw1_str[0])

str

## Operations between two Series

Operations between two Series are performed element-wise on those elements with the same index label.

Let's create a Series of the hw2 grades. Remember that we have a dataframe object, <i>df</i>

In [81]:
hw2 = df['hw2']

In [82]:
hw2

Name
Dorian       10.0
Jeannine      7.0
Iluminada     NaN
Luci          7.0
Jenny         NaN
Demetria      4.0
Michael      10.0
Garland       1.0
Shelby       10.0
Mercy         6.0
John         10.0
Name: hw2, dtype: float64

The operation is executed between elements with the same index label

In [83]:
hw1 + hw2

Name
Dorian       20.0
Jeannine     13.0
Iluminada     NaN
Luci         14.0
Jenny         NaN
Demetria      6.0
Michael      16.0
Garland      10.0
Shelby       11.0
Mercy        11.0
John          NaN
dtype: float64

<b>Example</b>: Compute everyone's average grade

In [84]:
(hw1 + hw2) / 2

Name
Dorian       10.0
Jeannine      6.5
Iluminada     NaN
Luci          7.0
Jenny         NaN
Demetria      3.0
Michael       8.0
Garland       5.0
Shelby        5.5
Mercy         5.5
John          NaN
dtype: float64

## problems

<p>The average grade of hw1 is too low. We want to normalize it to 8. To this end, do the following <b>in one single command</b>:
<ol>
<li>decrease everyone's grade by the average grade (this will set the new average to 0)</li>
<li>increase everyone's grade by 8</li>
</ol>
</p>
<p>Note that some students’ grade might become greater than 10 – don’t worry about it.</p>

In [85]:
(hw1 - hw1.mean()) + 8

Name
Dorian       12.4
Jeannine      8.4
Iluminada     4.4
Luci          9.4
Jenny        10.4
Demetria      4.4
Michael       8.4
Garland      11.4
Shelby        3.4
Mercy         7.4
John          NaN
Name: hw1, dtype: float64

To verify it ..

In [86]:
((hw1 - hw1.mean()) + 8).mean()

8.000000000000002

Compute the average grade between hw1 and hw2 of each student. Which student has the average closest to 6.7?


In [87]:
((((hw1 + hw2) / 2) - 6.7).abs()).nsmallest(1)

Name
Jeannine    0.2
dtype: float64

In [88]:
((((hw1+hw2)/2) - 6.7).abs()).nsmallest(1).index[0]

'Jeannine'

or..

In [89]:
(((hw1 + hw2) / 2) - 6.7).abs().idxmin()

'Jeannine'

<div class="alert alert-block alert-info"> 
    <b>Tech Note</b> : If we starts from a column (for example, df['hw1'], which refer to a column), which is a Pandas Series, then double [[  ]] ( df[ ['hw1'] ] ) will result in a Pandas DataFrame.
</div>

In [90]:
df['hw1']

Name
Dorian       10.0
Jeannine      6.0
Iluminada     2.0
Luci          7.0
Jenny         8.0
Demetria      2.0
Michael       6.0
Garland       9.0
Shelby        1.0
Mercy         5.0
John          NaN
Name: hw1, dtype: float64

In [91]:
type(df['hw1'])

pandas.core.series.Series

In [92]:
df[['hw1']]

Unnamed: 0_level_0,hw1
Name,Unnamed: 1_level_1
Dorian,10.0
Jeannine,6.0
Iluminada,2.0
Luci,7.0
Jenny,8.0
Demetria,2.0
Michael,6.0
Garland,9.0
Shelby,1.0
Mercy,5.0


In [93]:
type(df[['hw1']])

pandas.core.frame.DataFrame

<div class="alert alert-block alert-info"> 
    <b>Tech Note</b> : Or we can use <b>pd.DataFrame(Pandas.Series)</b> or <b>Pandas.Series.to_frame()</b>  to cover a series into a  Pandas DataFrame.
</div>

In [94]:
hw1.to_frame()

Unnamed: 0_level_0,hw1
Name,Unnamed: 1_level_1
Dorian,10.0
Jeannine,6.0
Iluminada,2.0
Luci,7.0
Jenny,8.0
Demetria,2.0
Michael,6.0
Garland,9.0
Shelby,1.0
Mercy,5.0


In [95]:
pd.DataFrame(hw1)

Unnamed: 0_level_0,hw1
Name,Unnamed: 1_level_1
Dorian,10.0
Jeannine,6.0
Iluminada,2.0
Luci,7.0
Jenny,8.0
Demetria,2.0
Michael,6.0
Garland,9.0
Shelby,1.0
Mercy,5.0
