# Lecture 5-1

# Pandas

## Week 5 Monday

## Miles Chen, PhD

# Pandas

NumPy creates ndarrays that must contain values that are of the same data type.

Pandas creates dataframes. **Each column in a dataframe is an ndarray**. This allows us to have traditional tables of data where each column can be a different data type.

Important References:

https://pandas.pydata.org/pandas-docs/stable/reference/series.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [1]:
import numpy as np
import pandas as pd

The basic data structure in pandas is the *series*. You can construct it in a similar fashion to making a numpy array.

The command to make a Series object is

`pd.Series(data, index=index)`

the `index` argument is optional

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
print(type(data))

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
<class 'pandas.core.series.Series'>


Automatically created index

In [3]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

The series is printed out in a table form. The type is a Pandas Series.

Output is almost exactly the same.

In [4]:
print(data.values)

[0.25 0.5  0.75 1.  ]


In [5]:
print(type(data.values))

<class 'numpy.ndarray'>


The values attribute of the series is a numpy array.

In [6]:
print(data.index)

RangeIndex(start=0, stop=4, step=1)


Range index object whose type is a `pandas.RangeIndex`

In [7]:
print(type(data.index))  # the row names are known as the index

<class 'pandas.core.indexes.range.RangeIndex'>


You can subset a pandas series like other python objects

In [8]:
print(data[1])

0.5


In [9]:
print(type(data[1]))  # when you select only one value, it simplifies the object

<class 'numpy.float64'>


In [10]:
print(data[1:3])

1    0.50
2    0.75
dtype: float64


In [11]:
print(type(data[1:3]))  # slicing / selecting multiple values returns a series

<class 'pandas.core.series.Series'>


Below you specify your own indices, as seen on the left column

In [12]:
print(data[np.array([1, 0, 1, 2])])  # You can also do fancy indexing by subsetting w/a numpy array

1    0.50
0    0.25
1    0.50
2    0.75
dtype: float64


In [13]:
# Pandas uses a 0-based index by default. You may also specify the index values
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index = ['a', 'b', 'c', 'd'])
print(data)

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64


In [14]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [15]:
data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

Data type above is `object` because missing string is input as nan

In [16]:
data[1]  # subset with index position

0.5

Virtual environment tells you what version you're using and will or won't load what you called for.

In [17]:
data["a"]  # subset with index names

0.25

In [18]:
data[0:2]  # slicing behavior is unchanged

a    0.25
b    0.50
dtype: float64

In [19]:
data["a":"c"] # slicing using index names includes the last value

a    0.25
b    0.50
c    0.75
dtype: float64

Slicing based on names (above) goes up to and includes "c"

In [20]:
# creating a series from a python dictionary
# remember, dictionary construction uses curly braces {}
samp_dict = {'Tony Stark': "Robert Downey Jr.",
              'Steve Rogers': "Chris Evans",
              'Natasha Romanoff': "Scarlett Johansson",
              'Bruce Banner': "Mark Ruffalo",
              'Thor': "Chris Hemsworth",
              'Clint Barton': "Jeremy Renner"}
samp_series = pd.Series(samp_dict)
samp_series

Tony Stark           Robert Downey Jr.
Steve Rogers               Chris Evans
Natasha Romanoff    Scarlett Johansson
Bruce Banner              Mark Ruffalo
Thor                   Chris Hemsworth
Clint Barton             Jeremy Renner
dtype: object

Can take a dictionary and create a panda seriesof it, turning keys into indices and values remain the values (here those values are the actors' names).

In [21]:
print(samp_series.index) # dtype = object is for strings but allows mixed data types.

Index(['Tony Stark', 'Steve Rogers', 'Natasha Romanoff', 'Bruce Banner',
       'Thor', 'Clint Barton'],
      dtype='object')


In [22]:
samp_series.values

array(['Robert Downey Jr.', 'Chris Evans', 'Scarlett Johansson',
       'Mark Ruffalo', 'Chris Hemsworth', 'Jeremy Renner'], dtype=object)

In [23]:
# ages during the First Avengers film (2012)
# I don't have an exact source, don't get mad at me.
age_dict = {'Thor': 1493,
              'Steve Rogers': 104,
              'Natasha Romanoff': 28,
              'Clint Barton': 41,
              'Tony Stark': 42,
              'Bruce Banner': 42}  # note that the dictionary order is not same here
ages = pd.Series(age_dict)
print(ages)

Thor                1493
Steve Rogers         104
Natasha Romanoff      28
Clint Barton          41
Tony Stark            42
Bruce Banner          42
dtype: int64


In [24]:
# Super Hero Names
hero_dict = {'Thor': np.NaN,
              'Steve Rogers': 'Captain America',
              'Natasha Romanoff': 'Black Widow',
              'Clint Barton': 'Hawkeye',
              'Tony Stark': 'Iron Man',
              'Bruce Banner': 'Hulk'}
hero_names = pd.Series(hero_dict)
print(hero_names)

Thor                            NaN
Steve Rogers        Captain America
Natasha Romanoff        Black Widow
Clint Barton                Hawkeye
Tony Stark                 Iron Man
Bruce Banner                   Hulk
dtype: object


# Creating a DataFrame

There are multiple ways of creating a DataFrame in Pandas. The next few slides show a few.

We can create a dataframe by providing a dictionary of series objects. The dictionary key becomes the column name. The dictionary values become values. The keys within the dictionaries become the index.

In [25]:
avengers = pd.DataFrame({'actor': samp_series,
                         'hero name': hero_names,
                         'age': ages})  

# the DataFrame will match the indices and sort them
print(avengers)

                               actor        hero name   age
Bruce Banner            Mark Ruffalo             Hulk    42
Clint Barton           Jeremy Renner          Hawkeye    41
Natasha Romanoff  Scarlett Johansson      Black Widow    28
Steve Rogers             Chris Evans  Captain America   104
Thor                 Chris Hemsworth              NaN  1493
Tony Stark         Robert Downey Jr.         Iron Man    42


[Panda] Matches up the indices, put into alphabetical order (look at left column). This is because the series had different orderings.

In [26]:
print(type(avengers))  # this is a DataFrame object

<class 'pandas.core.frame.DataFrame'>


The data is a **list of dictionaries**. Each dictionary needs to have the same set of keys, otherwise, NaNs will appear.

In [27]:
data = [{'a': 0, 'b': 0},
        {'a': 1, 'b': 2},
        {'a': 2, 'b': 5}]  
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 5}]

A column for `a` and a column for `b`.

In [28]:
print(pd.DataFrame(data, index = [1, 2, 3]))

   a  b
1  0  0
2  1  2
3  2  5


In [29]:
data2 = [{'a': 0, 'b': 0},
         {'a': 1, 'b': 2},
         {'a': 2, 'c': 5}] # mismatch of keys. NAs will appear
data2

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'c': 5}]

In [30]:
pd.DataFrame(data2)  # if the index argument is not supplied, it defaults to integer index start at 0

Unnamed: 0,a,b,c
0,0,0.0,
1,1,2.0,
2,2,,5.0


NaN for where there aren't values, like how we only have 1 `c` value.

You can convert a dictionary to a DataFrame. The keys form column names, and the values are lists/arrays of values. **The arrays need to be of the same length.**

In [31]:
data3 = {'a': [1, 2, 3],
         'b': ['x','y','z']} 
data3

{'a': [1, 2, 3], 'b': ['x', 'y', 'z']}

In [32]:
pd.DataFrame(data3)

Unnamed: 0,a,b
0,1,x
1,2,y
2,3,z


Monospace typewriter text when you only call on the DataFrame with column names and indices in bold, as seen above. This is different from just printing.

## Q1: A

In [34]:
data4 = {'a': [1, 2, 3, 4],
         'b': ['x','y','z']} # arrays are not of the same length
pd.DataFrame(data4)

ValueError: arrays must all be same length

Note that the error on the last line lets you easily know what went wrong.

Turn a 2D Numpy array (matrix) into a DataFrame by adding column names and optionally index values.

In [35]:
data = np.random.randint(10, size = 10).reshape((5,2))
print(data)

[[1 3]
 [7 8]
 [8 8]
 [5 3]
 [4 9]]


In [36]:
print(pd.DataFrame(data, columns = ["x","y"], index = ['a','b','c','d','e']))

   x  y
a  1  3
b  7  8
c  8  8
d  5  3
e  4  9


Column names are specified via `columns = ` and indices via `index = `

# Subsetting the DataFrame

In a DataFrame, the `.columns` attribute show the column names and the `.index` attribute show the row names.

In [37]:
print(avengers)

                               actor        hero name   age
Bruce Banner            Mark Ruffalo             Hulk    42
Clint Barton           Jeremy Renner          Hawkeye    41
Natasha Romanoff  Scarlett Johansson      Black Widow    28
Steve Rogers             Chris Evans  Captain America   104
Thor                 Chris Hemsworth              NaN  1493
Tony Stark         Robert Downey Jr.         Iron Man    42


In [38]:
print(avengers.columns)

Index(['actor', 'hero name', 'age'], dtype='object')


Your column names are printed/accessed above. Below are the indices.

In [39]:
print(avengers.index)

Index(['Bruce Banner', 'Clint Barton', 'Natasha Romanoff', 'Steve Rogers',
       'Thor', 'Tony Stark'],
      dtype='object')


You can select a column using dot notation or with single square brackets.

In [40]:
avengers.actor  # extracting the column

Bruce Banner              Mark Ruffalo
Clint Barton             Jeremy Renner
Natasha Romanoff    Scarlett Johansson
Steve Rogers               Chris Evans
Thor                   Chris Hemsworth
Tony Stark           Robert Downey Jr.
Name: actor, dtype: object

In [41]:
avengers["hero name"] # if there's a space in the column name, you'll need to use square brackets

Bruce Banner                   Hulk
Clint Barton                Hawkeye
Natasha Romanoff        Black Widow
Steve Rogers        Captain America
Thor                            NaN
Tony Stark                 Iron Man
Name: hero name, dtype: object

Can't use dot notation if you have spaces in your column name, must use square brackets (above)

In [42]:
type(avengers.actor)

pandas.core.series.Series

The selected column is a Pandas Series and can be subset accordingly.

In [43]:
avengers.actor[1] # 0 based indexing

'Jeremy Renner'

In [44]:
avengers.actor[avengers.age == 42]

Bruce Banner         Mark Ruffalo
Tony Stark      Robert Downey Jr.
Name: actor, dtype: object

`avengers.age == 42` gives an array of booleans for each data value, which is what's being used above.

In [45]:
avengers["hero name"]['Steve Rogers']

'Captain America'

In [46]:
avengers["hero name"]['Steve Rogers':'Tony Stark']

Steve Rogers    Captain America
Thor                        NaN
Tony Stark             Iron Man
Name: hero name, dtype: object

# `.loc`
The `.loc` attribute can be used to subset the DataFrame using the index names.

Transposes values in the row "Thor":

In [47]:
avengers.loc['Thor'] # subset based on location to get a row

actor        Chris Hemsworth
hero name                NaN
age                     1493
Name: Thor, dtype: object

In [48]:
print(type(avengers.loc['Thor']))
print(type(avengers.loc['Thor'].values))  # the values are of mixed type but is still a numpy array. 
# this is possible because it is a structured numpy array. (covered in "Python for Data Science" chapter 2)

<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>


Structured numpy arrays: the first item will always be ___ data type, and the same logic for the rest of the items in the series. It must know the datatype for each position.

In [49]:
print(avengers.loc[ : ,'age']) # subset based on location to get a column

Bruce Banner          42
Clint Barton          41
Natasha Romanoff      28
Steve Rogers         104
Thor                1493
Tony Stark            42
Name: age, dtype: int64


`print(avengers.age)` same thing as above

In [51]:
print(type(avengers.loc[:,'age']))  #the object is a pandas series
print(type(avengers.loc[:,'age'].values))

<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>


In [52]:
avengers.loc['Steve Rogers','age']  # you can provide a pair of 'coordinates' to get a particular value

104

In [54]:
avengers.loc['Steve Rogers':'Thor', 'hero name':'age']

Unnamed: 0,hero name,age
Steve Rogers,Captain America,104
Thor,,1493


# `.iloc`

The `.iloc` attribute can be used to **subset the DataFrame using the index position (zero-indexed)**.

In [55]:
avengers.iloc[3,] # subset based on index location

actor            Chris Evans
hero name    Captain America
age                      104
Name: Steve Rogers, dtype: object

In [56]:
avengers.iloc[0, 1] # pair of coordinates

'Hulk'

# Assignment with `.loc` and `.iloc`
The `.loc` and `.iloc` attributes can be used in conjunction with assignment.

In [57]:
avengers

Unnamed: 0,actor,hero name,age
Bruce Banner,Mark Ruffalo,Hulk,42
Clint Barton,Jeremy Renner,Hawkeye,41
Natasha Romanoff,Scarlett Johansson,Black Widow,28
Steve Rogers,Chris Evans,Captain America,104
Thor,Chris Hemsworth,,1493
Tony Stark,Robert Downey Jr.,Iron Man,42


In [58]:
# set values individually
avengers.loc['Thor', 'age'] = 1500
avengers.loc['Thor', 'hero name'] = 'Thor'
avengers

Unnamed: 0,actor,hero name,age
Bruce Banner,Mark Ruffalo,Hulk,42
Clint Barton,Jeremy Renner,Hawkeye,41
Natasha Romanoff,Scarlett Johansson,Black Widow,28
Steve Rogers,Chris Evans,Captain America,104
Thor,Chris Hemsworth,Thor,1500
Tony Stark,Robert Downey Jr.,Iron Man,42


Now Thor's hero name is `Thor` and his age is `1500`.

In [59]:
# assign multiple values at once
avengers.loc['Thor', ['hero name', 'age']] = [np.NaN, 1493]
avengers

Unnamed: 0,actor,hero name,age
Bruce Banner,Mark Ruffalo,Hulk,42
Clint Barton,Jeremy Renner,Hawkeye,41
Natasha Romanoff,Scarlett Johansson,Black Widow,28
Steve Rogers,Chris Evans,Captain America,104
Thor,Chris Hemsworth,,1493
Tony Stark,Robert Downey Jr.,Iron Man,42


We've changed the values back to their original values for Thor.

# `.loc` vs `.iloc` with numeric index

The following DataFrame has a numeric index, but it starts at 1 instead of 0.

In [60]:
data = [{'a': 11, 'b': 2},
        {'a': 12, 'b': 4},
        {'a': 13, 'b': 6}]  
df = pd.DataFrame(data, index = [1, 2, 3])
df

Unnamed: 0,a,b
1,11,2
2,12,4
3,13,6


Below gives you the first row of the data frame above. Note how this data frame index starts at 0.

In [61]:
df.loc[1, :] # .loc always uses the actual index.

a    11
b     2
Name: 1, dtype: int64

In [None]:
df.iloc[1, :] # .iloc always uses the position using a 0-based index.

In [62]:
df.iloc[3, :] # using a position that doesn't exist results in an exception. 

IndexError: single positional indexer is out-of-bounds

## Q2: E

# Boolean subsetting examples with `.loc`

In [63]:
# select avengers whose age is less than 50 and greater than 40
# select the columns 'hero name' and 'age'
avengers.loc[ (avengers.age < 50) & (avengers.age > 40), ['hero name', 'age']]

Unnamed: 0,hero name,age
Bruce Banner,Hulk,42
Clint Barton,Hawkeye,41
Tony Stark,Iron Man,42


Note the bitwise and operator, `&`, which is what you want to use. DON'T use `and`

In [64]:
# Use the index of the DataFrame, treat it as a string, and select rows that start with B
avengers.loc[ avengers.index.str.startswith('B'), : ]

Unnamed: 0,actor,hero name,age
Bruce Banner,Mark Ruffalo,Hulk,42


`.str` gives access to string-based functions, like `startswith` that's used here

In [65]:
# Use the index of the DataFrame, treat it as a string,
# find the character capital R. Find returns -1 if it does not find the letter
# We select rows that did not result in -1, which means it does contain a capital R
avengers.loc[ avengers.index.str.find('R') != -1, : ]

Unnamed: 0,actor,hero name,age
Natasha Romanoff,Scarlett Johansson,Black Widow,28
Steve Rogers,Chris Evans,Captain America,104


.find -> gives location if found, else -1

# Other commonly used DataFrame attributes

In [66]:
avengers.T  # the transpose

Unnamed: 0,Bruce Banner,Clint Barton,Natasha Romanoff,Steve Rogers,Thor,Tony Stark
actor,Mark Ruffalo,Jeremy Renner,Scarlett Johansson,Chris Evans,Chris Hemsworth,Robert Downey Jr.
hero name,Hulk,Hawkeye,Black Widow,Captain America,,Iron Man
age,42,41,28,104,1493,42


In [67]:
avengers.dtypes # the data types contained in the DataFrame

actor        object
hero name    object
age           int64
dtype: object

To tell you the types for each row/index

In [None]:
avengers.shape # shape

# Importing Data with pd.read_csv()

In [68]:
# Titanic Dataset
url = 'https://assets.datacamp.com/production/course_1607/datasets/titanic_sub.csv'
titanic = pd.read_csv(url)

This how we'll mostly be "creating" Data Frames, not the methods mentioned previously

In [69]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.0,0,0,211536,13.0000,,S
887,888,1,1,female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,male,26.0,0,0,111369,30.0000,C148,C


In [70]:
titanic.shape

(891, 11)

891 rows

In [71]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [73]:
titanic.index

RangeIndex(start=0, stop=891, step=1)

In [72]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


This gives us summary information, like how many values are NOT missing (`Non-Null Count`)

In [74]:
titanic.describe()  # displays summary statistics of the numeric variables

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Above gives only numeric summary information. Note how `Pclass` is categorical but stored as 1, 2, 3.

## Q3: A