# Pandas Introduction Part 1

## Overview

This notebook uses the IMDB dataset from Kaggle:  
https://www.kaggle.com/PromptCloudHQ/imdb-data#IMDB-Movie-Data.csv

This notebook has examples which illustrate the basics of Pandas.

Another notebook TODO Link will discuss Pandas in more depth.

In [2]:
import pandas as pd
import numpy as np

## Components of a DataFrame 
A DataFrame has:
1. column labels (identifiers)
2. row labels (identifiers)
3. values (cells)

In [3]:
data = [[ 0, 1, 2],
       [ 10, 21, 22],
       [ 20, 31, 32]]
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,21,22
Row3,20,31,32


In [4]:
# 3 rows and 3 columns
rows, cols = df.shape
print(f'Rows: {rows} Cols: {cols}')

Rows: 3 Cols: 3


In [5]:
# data as nested list
df.values.tolist()

[[0, 1, 2], [10, 21, 22], [20, 31, 32]]

In [6]:
# column IDs as list
df.columns.tolist()

['Col1', 'Col2', 'Col3']

In [7]:
# row IDs as list
df.index.tolist()

['Row1', 'Row2', 'Row3']

## DataFrame Column and Row Labels

df.columns and df.rows are always of type: pd.Index (or subclass of pd.Index)

In [8]:
# get column labels
df.columns

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [9]:
isinstance(df.columns, pd.Index)

True

In [10]:
# get row labels
df.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [11]:
isinstance(df.index, pd.Index)

True

## Distinguish Between pd.Index and df.index

**pd.Index:** This is a class.  
**df.columns:** This is an instance of pd.Index (or subclass)  
**df.index:**   This is an instance of pd.Index (or subclass)

df.index contains the row labels

### DataFrame Column Values

In [12]:
# col values, using [] with column name
df['Col1']

Row1     0
Row2    10
Row3    20
Name: Col1, dtype: int64

In [13]:
# the column is an instance of pd.Series
isinstance(df['Col1'], pd.Series)

True

In [14]:
# The Series holds values of type int64
df['Col1'].dtype

dtype('int64')

In [15]:
# when selecting columns, the index for the columns is the index for the DataFrame
df['Col1'].index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [16]:
# are the Indexes equal?
df['Col1'].index.equals(df.index)

True

In [17]:
# are the Index values equal?
df['Col1'].index == df.index

array([ True,  True,  True])

In [18]:
# are all the Index values equal?
(df['Col1'].index == df.index).all()

True

### **.loc\[rows, cols\]**

Where row is:
* a single row label
* a list of row labels
* a slice representing row labels in order, inclusive of the slice end

Where col is:
* a single col label
* a list of col labels
* a slice representing col labels in order, inclusive of the slice end

In [28]:
# equivalent df['Col1'] above
df.loc[:, 'Col1']

Row1     0
Row2    10
Row3    20
Name: Col1, dtype: int64

In [29]:
s1 = df['Col1']
s2 = df.loc[:, 'Col1']
print(f'is Series: {isinstance(s1, pd.Series)}')
print(f'is Series: {isinstance(s2, pd.Series)}')

is Series: True
is Series: True


In [30]:
# are the Series equal
s1.equals(s2)

True

In [31]:
# are the Series values equal?
s1 == s2

Row1    True
Row2    True
Row3    True
Name: Col1, dtype: bool

In [32]:
# are all Series values equal?
(s1 == s2).all()

True

In [57]:
# Do they refer to the same object in memory?
s1 is s2

True

### DataFrame Row Values

In [33]:
# row values with .loc[]
df.loc['Row1', :]

Col1    0
Col2    1
Col3    2
Name: Row1, dtype: int64

In [34]:
# the 2nd argument can be skipped if you want all columns
df.loc['Row1']

Col1    0
Col2    1
Col3    2
Name: Row1, dtype: int64

In [35]:
isinstance(df.loc['Row1'], pd.Series)

True

In [36]:
# The row holds values of type int64
s = df.loc['Row1']
s.dtype

dtype('int64')

### Selection of Values using Both Row and Column **Labels**
Use **df.loc\[row_labels, col_labels\]** or **df.loc\[boolean_series, col_labels\]**

In [26]:
# a single label for both references exactly 1 cell, return: single value
df.loc['Row2', 'Col2']

21

In [27]:
# a list of labels may reference more than 1 cell, return: DataFrame
df.loc[['Row2'], ['Col2']]

Unnamed: 0,Col2
Row2,21


In [28]:
df.loc[['Row2', 'Row3'], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
Row2,10,21
Row3,20,31


In [29]:
# same labels specified in different order
df.loc[['Row3', 'Row2'], ['Col2', 'Col1']]

Unnamed: 0,Col2,Col1
Row3,31,20
Row2,21,10


In [30]:
# .loc[] is inclusive with slices!
df.loc['Row2':'Row3','Col1':'Col2']

Unnamed: 0,Col1,Col2
Row2,10,21
Row3,20,31


In [31]:
# .iloc[] slices must match labels in order!
df.loc['Row3':'Row2','Col2':'Col1']

In [56]:
df.index.is_monotonic

True

In [32]:
# it is a KeyError if the label is not in the column index
try:
    df.loc[:, 'Col99']
except KeyError as err:
    print(f'KeyError: {err}')

KeyError: 'Col99'


In [33]:
# it is a KeyError if the label is not in the row index
try:
    df.loc['Row99', :]
except KeyError as err:
    print(f'KeyError: {err}')

KeyError: 'Row99'


### Pandas Operations Modify in-place or return Modified values
By default, most Pandas Operations will perform the requested operation without modifying the underlying DataFrame or Series.  Most Pandas operations have a keyword parameter, inplace, which when set to True, causes the operation to occur within the DataFrame.  For very large DataFrames where memory is a concern, it may be useful to specify inplace=True, otherwise it is a usually safer coding practice to use the default of inplace=False and assign the return value to a new variable.

### Working with the Default Index
The default index creates a range of integers that match the row position.

Even though the default index matches the row position, it should still be thought of as "row ID", not "row position".

In [38]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 21, 22],
       [ 20, 31, 32]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,21,22
2,20,31,32


In [39]:
# index is concisely represented as a RangeIndex
df.index

RangeIndex(start=0, stop=3, step=1)

In [40]:
# this is a subclass of pd.Index
isinstance(df.index, pd.Index)

True

In [41]:
df.loc[1:2, 'Col1':'Col2']

Unnamed: 0,Col1,Col2
1,10,21
2,20,31


In [42]:
df.loc[[1, 2], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
1,10,21
2,20,31


In [43]:
# resort the dataframe
# sort_index() has inplace=False by default, so assign the result to a variable
df = df.sort_index(ascending=False)
df

Unnamed: 0,Col1,Col2,Col3
2,20,31,32
1,10,21,22
0,0,1,2


In [44]:
# slice depends on the sort order of the DataFrame
# there are no row labels that match in order from 1 to 2 inclusive
df.loc[1:2, ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2


In [45]:
# list based selection is independent of the sort order of the DataFrame
df.loc[[1, 2], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
1,10,21
2,20,31


### Selection of Values by Row and Column **Position**
Use **df.iloc\[row_positions, col_positions\]**

Slice works as is normal for Python.  Only **.loc\[\]** interprets slices differently.

In [46]:
df = df.sort_index()
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,21,22
2,20,31,32


In [47]:
df.loc[1:2]

Unnamed: 0,Col1,Col2,Col3
1,10,21,22
2,20,31,32


In [48]:
# position happens to match label now ...
df.iloc[1:]

Unnamed: 0,Col1,Col2,Col3
1,10,21,22
2,20,31,32


In [49]:
# change the default index
df.index += 10
df.index
df

Unnamed: 0,Col1,Col2,Col3
10,0,1,2
11,10,21,22
12,20,31,32


In [50]:
# no row labels match 1 to 2 inclusive in order
df.loc[1:2]

Unnamed: 0,Col1,Col2,Col3


In [51]:
# there are row positions 1:2
df.iloc[1:2]

Unnamed: 0,Col1,Col2,Col3
11,10,21,22


## Boolean Series from Value Comparison

In [48]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 21, 22],
       [ 20, 31, 32]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,21,22
2,20,31,32


In [49]:
# relational operator on series produces True/False for each value
boolean_series = df['Col1'] < df['Col2']
boolean_series

0    True
1    True
2    True
dtype: bool

In [53]:
# select rows based on boolean_series
criteria = df['Col1'] < df['Col2']
df[criteria]

Unnamed: 0,Col1,Col2,Col3
10,0,1,2
11,10,21,22
12,20,31,32


In [54]:
df.loc[criteria]

Unnamed: 0,Col1,Col2,Col3
10,0,1,2
11,10,21,22
12,20,31,32


In [52]:
critera1 = df['Col2'] - df['Col1'] == 11
critera2 = df['Col1'] < 20
filter_rows = critera1 & critera2
df[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,21,22


In [53]:
# in one line, requires () around expressions
filter_rows = (df['Col2'] - df['Col1'] == 11) & (df['Col1'] < 20)
df[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,21,22


In [54]:
# boolean series constructed from columns, has the same index as the dataframe
filter_rows.index.equals(df.index)

True

## Pandas Indexing with **df\[filter\]** and **df\[filter1\]\[filter2\]**

Pandas makes the use of **df\[filter\]** convenient.  Depending on the filter's datatype, index and values, rows or columns will be selected.  Similarly for **df\[filter1\]\[filter2\]** rows and columns will be selected.

Whether Pandas is selecting rows or columns should be apparent from the context of the code.  If it is not apparent, then **.loc\[row_filter, col_filter\]** or **.iloc\[row_positions, col_position\]** should be used.

**.loc\[row_filter, col_filter\]** and **.iloc\[row_positions, col_position\]** provide all the functionality that **\[filter\]** provides and more.  The choice of which to use is a matter of code clarity.  

Large dataframes may have better performance with **.loc\[row_filter, col_filter\]** and **.iloc\[row_positions, col_position\]** than **df\[filter\]**.