# PANDAS LIBRARY

`Another Open Source Python Library used for working with (at least) 2-dimensional Data`

Here's what the People behind Pandas say about it (https://pandas.pydata.org/about/ from March, 2021)...

**Mission** <br>

"pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. <br> Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language."

**Library Highlights** <br>

* A fast and efficient `DataFrame` object for data manipulation with integrated indexing <br>
<br>
* Tools for `reading and writing data` between in-memory data structures and different formats: <br> CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format

[...]

* Intelligent `label-based` slicing, fancy indexing, and subsetting of large data sets

[...]


* `Aggregating or transforming data` with a powerful group by engine allowing split-apply-combine operations on data sets <br>
<br>
* High performance `merging and joining` of data sets

[...]


---

`Import Pandas Library`

In [7]:
import pandas as pd

`Creating Pandas Series for 1-dimensional Data (aka Column Vector)` 

In [8]:
# Series from Integer/Float/String
series = pd.Series('A', index=[1, 2, 3, 4, 5])
print(series)
print(type(series))

1    A
2    A
3    A
4    A
5    A
dtype: object
<class 'pandas.core.series.Series'>


In [9]:
# Series from List
list_for_series = [x for x in range(0, 50+1, 5)]
series = pd.Series(list_for_series, index=None)
print(series)

0      0
1      5
2     10
3     15
4     20
5     25
6     30
7     35
8     40
9     45
10    50
dtype: int64


In [10]:
# Series from Dictionary
dict_for_series = {'A': 1, 'B': 2, 'C': 3}
series = pd.Series(dict_for_series, index=None)
print(series)

A    1
B    2
C    3
dtype: int64


In [11]:
# Series from NumPy Array
import numpy as np
array_for_series = np.random.rand(5)
series = pd.Series(array_for_series, index=None, name='Random')
print(series)

0    0.604163
1    0.983262
2    0.642115
3    0.022003
4    0.508260
Name: Random, dtype: float64


`Creating Pandas DataFrame for 2-dimensional Data (aka Matrix)` 

In [12]:
# DataFrame from Dictionary of Lists
dict_lists_for_df = {
    'Capital': ['A', 'B', 'C'],
    'Small': ['a', 'b', 'c']
}

df = pd.DataFrame(dict_lists_for_df, index=None)

display(df)

Unnamed: 0,Capital,Small
0,A,a
1,B,b
2,C,c


In [13]:
# DataFrame from Dictionary of NumPy Arrays
dict_arrays_for_df = {
    'Zeros': np.zeros(5),
    'Ones': np.ones(5)
}

df = pd.DataFrame(dict_arrays_for_df, index=None)

display(df)

Unnamed: 0,Zeros,Ones
0,0.0,1.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,0.0,1.0


In [14]:
# DataFrame from 2D-Array
array2D_for_df = np.eye(10, dtype=np.int64)

list_columns = ['Column_' + str(i) for i in range(0,10)]

df = pd.DataFrame(array2D_for_df, index=None, columns=list_columns)

display(df)

Unnamed: 0,Column_0,Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8,Column_9
0,1,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,1,0,0,0
7,0,0,0,0,0,0,0,1,0,0
8,0,0,0,0,0,0,0,0,1,0
9,0,0,0,0,0,0,0,0,0,1


In [15]:
# DataFrame from named Series
series_for_df = pd.Series(np.random.rand(5), index=None, name='Random')

df = pd.DataFrame(series_for_df, index=None)

display(df)

Unnamed: 0,Random
0,0.069368
1,0.650606
2,0.603694
3,0.02756
4,0.54685


In [16]:
# Empty DataFrame with named Columns
df = pd.DataFrame(index=None, columns=['Column1', 'Column2', 'Column3'])
display(df)

Unnamed: 0,Column1,Column2,Column3


`Indexing (= Selecting Data from) a DataFrame` using `loc` and `iloc`

`Note:` `loc` is used for `label-based` and `iloc` is used for `integer-based Indexing` 

In [17]:
n = 10
m = 5

list_idx = ['Row_' + str(i) for i in range(0,n)]
list_col = ['Col_' + str(i) for i in range(0,m)]

df = pd.DataFrame(np.random.randint(100, size=(n,m)), index=list_idx, columns=list_col)

display(df)

# Select Row by Label (loc)
series = df.loc['Row_3', :]
print(' ')
print('Row_3 as Pandas Series')
print(series)
print(type(series))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_3,32,89,91,78,79
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


 
Row_3 as Pandas Series
Col_0    32
Col_1    89
Col_2    91
Col_3    78
Col_4    79
Name: Row_3, dtype: int64
<class 'pandas.core.series.Series'>


In [18]:
display(df)

# Select Row by Integer (iloc)
series = df.iloc[4, :]
print(' ')
print('Row_3 as Pandas Series')
print(series)
print(type(series))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_3,32,89,91,78,79
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


 
Row_3 as Pandas Series
Col_0    92
Col_1    53
Col_2    38
Col_3    31
Col_4    97
Name: Row_4, dtype: int64
<class 'pandas.core.series.Series'>


In [19]:
display(df)

# Select Column by Label (loc)
series = df.loc[:, 'Col_4']
print(' ')
print('Col_4 as Pandas Series')
print(series)
print(type(series))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_3,32,89,91,78,79
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


 
Col_4 as Pandas Series
Row_0    64
Row_1    51
Row_2     5
Row_3    79
Row_4    97
Row_5    85
Row_6    46
Row_7    96
Row_8    81
Row_9    88
Name: Col_4, dtype: int64
<class 'pandas.core.series.Series'>


In [20]:
display(df)

# Select Column by Integer (iloc)
series = df.iloc[:, -1]
print(' ')
print('Col_4 as Pandas Series')
print(series)
print(type(series))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_3,32,89,91,78,79
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


 
Col_4 as Pandas Series
Row_0    64
Row_1    51
Row_2     5
Row_3    79
Row_4    97
Row_5    85
Row_6    46
Row_7    96
Row_8    81
Row_9    88
Name: Col_4, dtype: int64
<class 'pandas.core.series.Series'>


In [21]:
display(df)

# Select multiple Rows and Columns by Label (loc)
df_sel = df.loc['Row_0':'Row_4', ['Col_1', 'Col_3']]
print(' ')
print('Rows 0-4 and Cols 1,3 as Pandas DataFrame')
display(df_sel)
print(type(df_sel))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_3,32,89,91,78,79
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


 
Rows 0-4 and Cols 1,3 as Pandas DataFrame


Unnamed: 0,Col_1,Col_3
Row_0,59,33
Row_1,67,63
Row_2,94,9
Row_3,89,78
Row_4,53,31


<class 'pandas.core.frame.DataFrame'>


In [22]:
display(df)

# Select multiple Rows and Columns by Integer (iloc)
df_sel = df.iloc[:3, -3:]
print(' ')
print('First three Rows and last three Columns as Pandas DataFrame')
display(df_sel)
print(type(df_sel))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_3,32,89,91,78,79
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


 
First three Rows and last three Columns as Pandas DataFrame


Unnamed: 0,Col_2,Col_3,Col_4
Row_0,76,33,64
Row_1,69,63,51
Row_2,25,9,5


<class 'pandas.core.frame.DataFrame'>


In [23]:
display(df)

# Select single Value by Label (loc)
value = df.loc['Row_8', 'Col_3']
print(' ')
print('Single Value in Row_8 and Col_3 as NumPy Integer')
print(value)
print(type(value))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_3,32,89,91,78,79
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


 
Single Value in Row_8 and Col_3 as NumPy Integer
24
<class 'numpy.int64'>


In [24]:
display(df)

# Select single Value by Integer (iloc)
value = df.iloc[1,0]
print(' ')
print('Single Value in Row_1 and Col_0 as NumPy Integer')
print(value)
print(type(value))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_3,32,89,91,78,79
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


 
Single Value in Row_1 and Col_0 as NumPy Integer
97
<class 'numpy.int64'>


`Selecting Rows` using `head` and `tail`

In [25]:
# Select first three Rows
df_sel = df.head(3)
display(df_sel)

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5


In [26]:
# Select last two Rows
df_sel = df.tail(2)
display(df_sel)

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


`Selecting Rows` using `Boolean Masking`

In [27]:
display(df)

# Select Rows
df_sel = df[df['Col_0'] > 50]
print(' ')
print('All Rows with Col_0 > 50')
display(df_sel)
print(type(df_sel))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_3,32,89,91,78,79
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


 
All Rows with Col_0 > 50


Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96


<class 'pandas.core.frame.DataFrame'>


`Boolean Masking with multiple Conditions`

`Note: You must use Square Brackets for each Condition`

In [28]:
display(df)

# Select Rows
df_sel = df[(df['Col_0'] > 50) & (df['Col_1'] < 30)]
print(' ')
print('All Rows with Col_0 > 50 AND Col_1 < 30')
display(df_sel)
print(type(df_sel))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_0,55,59,76,33,64
Row_1,97,67,69,63,51
Row_2,98,94,25,9,5
Row_3,32,89,91,78,79
Row_4,92,53,38,31,97
Row_5,65,23,31,56,85
Row_6,52,73,74,6,46
Row_7,86,33,26,1,96
Row_8,15,73,64,24,81
Row_9,41,71,69,85,88


 
All Rows with Col_0 > 50 AND Col_1 < 30


Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
Row_5,65,23,31,56,85


<class 'pandas.core.frame.DataFrame'>


`Simplified Column Selection`

In [29]:
series = df['Col_1']

print(series)
print(type(series))

Row_0    59
Row_1    67
Row_2    94
Row_3    89
Row_4    53
Row_5    23
Row_6    73
Row_7    33
Row_8    73
Row_9    71
Name: Col_1, dtype: int64
<class 'pandas.core.series.Series'>


In [30]:
df_sel = df[['Col_1', 'Col_2']]

display(df_sel)
print(type(df_sel))

Unnamed: 0,Col_1,Col_2
Row_0,59,76
Row_1,67,69
Row_2,94,25
Row_3,89,91
Row_4,53,38
Row_5,23,31
Row_6,73,74
Row_7,33,26
Row_8,73,64
Row_9,71,69


<class 'pandas.core.frame.DataFrame'>


`Notice: It is good Practice to use copy() whenever you index a DataFrame` <br> `...no matter if you use loc, iloc, Boolean Masking or a simplified Version of Indexing`

In [31]:
# Example
df_sel = df[['Col_1', 'Col_2']].copy()

display(df_sel)
print(type(df_sel))

Unnamed: 0,Col_1,Col_2
Row_0,59,76
Row_1,67,69
Row_2,94,25
Row_3,89,91
Row_4,53,38
Row_5,23,31
Row_6,73,74
Row_7,33,26
Row_8,73,64
Row_9,71,69


<class 'pandas.core.frame.DataFrame'>


`Selecting the Index from a DataFrame`

In [32]:
index = df.index

print(index)
print(type(index))

Index(['Row_0', 'Row_1', 'Row_2', 'Row_3', 'Row_4', 'Row_5', 'Row_6', 'Row_7',
       'Row_8', 'Row_9'],
      dtype='object')
<class 'pandas.core.indexes.base.Index'>


`Resetting the Index of a DataFrame`

In [33]:
df_new = df.reset_index()

display(df_new)

Unnamed: 0,index,Col_0,Col_1,Col_2,Col_3,Col_4
0,Row_0,55,59,76,33,64
1,Row_1,97,67,69,63,51
2,Row_2,98,94,25,9,5
3,Row_3,32,89,91,78,79
4,Row_4,92,53,38,31,97
5,Row_5,65,23,31,56,85
6,Row_6,52,73,74,6,46
7,Row_7,86,33,26,1,96
8,Row_8,15,73,64,24,81
9,Row_9,41,71,69,85,88


In [34]:
# Get rid of the old Index
df_new = df.reset_index(drop=True)

display(df_new)

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
0,55,59,76,33,64
1,97,67,69,63,51
2,98,94,25,9,5
3,32,89,91,78,79
4,92,53,38,31,97
5,65,23,31,56,85
6,52,73,74,6,46
7,86,33,26,1,96
8,15,73,64,24,81
9,41,71,69,85,88


`Returning the Dimensions of a DataFrame`

In [35]:
# Number of Dimensions
print('# of Dimensions:', df.ndim)

# Number of Rows
print('# of Rows:', len(df))

# Number of Columns
print('# of Columns:', len(df.columns))

# Shape
print('Shape:', df.shape)

# Number of Items (Size)
print('Number of Items (Size):', df.size)

# of Dimensions: 2
# of Rows: 10
# of Columns: 5
Shape: (10, 5)
Number of Items (Size): 50


`Checking whether a DataFrame is empty`

In [36]:
print(df.empty)

False


`Sort DataFrame by Column(s)`

In [37]:
# Create new DataFrame
list_cols = ['x' + str(i) for i in range(1, 10+1)]

df = pd.DataFrame(np.random.randint(3, size=(10, 10)), index=None, columns=list_cols)

display(df)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
0,1,0,1,0,1,2,1,0,0,2
1,2,1,2,1,2,2,2,2,1,1
2,2,1,1,0,2,1,1,0,2,0
3,1,0,1,1,2,0,2,1,0,2
4,2,0,1,2,0,2,2,0,1,0
5,1,0,2,2,2,2,0,2,2,1
6,1,1,0,1,1,0,2,1,1,0
7,1,1,1,0,0,2,0,1,1,1
8,0,2,2,0,2,2,0,1,1,2
9,0,1,1,1,1,2,0,2,1,0


In [38]:
# Sort by x1 ascending
df_sorted = df.sort_values(by=['x1'], ascending=True)
display(df_sorted)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
8,0,2,2,0,2,2,0,1,1,2
9,0,1,1,1,1,2,0,2,1,0
0,1,0,1,0,1,2,1,0,0,2
3,1,0,1,1,2,0,2,1,0,2
5,1,0,2,2,2,2,0,2,2,1
6,1,1,0,1,1,0,2,1,1,0
7,1,1,1,0,0,2,0,1,1,1
1,2,1,2,1,2,2,2,2,1,1
2,2,1,1,0,2,1,1,0,2,0
4,2,0,1,2,0,2,2,0,1,0


In [39]:
# Sort by x9 descending 
df_sorted = df.sort_values(by=['x9'], ascending=False)
display(df_sorted)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
2,2,1,1,0,2,1,1,0,2,0
5,1,0,2,2,2,2,0,2,2,1
1,2,1,2,1,2,2,2,2,1,1
4,2,0,1,2,0,2,2,0,1,0
6,1,1,0,1,1,0,2,1,1,0
7,1,1,1,0,0,2,0,1,1,1
8,0,2,2,0,2,2,0,1,1,2
9,0,1,1,1,1,2,0,2,1,0
0,1,0,1,0,1,2,1,0,0,2
3,1,0,1,1,2,0,2,1,0,2


In [40]:
# Sort by x2 descending and x3 ascending
df_sorted = df.sort_values(by=['x2', 'x3'], ascending=[False, True])
display(df_sorted)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
8,0,2,2,0,2,2,0,1,1,2
6,1,1,0,1,1,0,2,1,1,0
2,2,1,1,0,2,1,1,0,2,0
7,1,1,1,0,0,2,0,1,1,1
9,0,1,1,1,1,2,0,2,1,0
1,2,1,2,1,2,2,2,2,1,1
0,1,0,1,0,1,2,1,0,0,2
3,1,0,1,1,2,0,2,1,0,2
4,2,0,1,2,0,2,2,0,1,0
5,1,0,2,2,2,2,0,2,2,1


`Dropping Duplicates`

In [41]:
# Duplicates including all Columns
print('# Rows before dropping Duplicates:', len(df))
df_dropped = df.drop_duplicates()
print('# Rows after dropping Duplicates:', len(df_dropped))
display(df_dropped)

# Rows before dropping Duplicates: 10
# Rows after dropping Duplicates: 10


Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
0,1,0,1,0,1,2,1,0,0,2
1,2,1,2,1,2,2,2,2,1,1
2,2,1,1,0,2,1,1,0,2,0
3,1,0,1,1,2,0,2,1,0,2
4,2,0,1,2,0,2,2,0,1,0
5,1,0,2,2,2,2,0,2,2,1
6,1,1,0,1,1,0,2,1,1,0
7,1,1,1,0,0,2,0,1,1,1
8,0,2,2,0,2,2,0,1,1,2
9,0,1,1,1,1,2,0,2,1,0


In [42]:
# Duplicates including a Subset of Columns
print('# Rows before dropping Duplicates:', len(df))
df_dropped = df.drop_duplicates(subset=['x1', 'x2', 'x3'], keep='first')
print('# Rows after dropping Duplicates:', len(df_dropped))
display(df_dropped)

# Rows before dropping Duplicates: 10
# Rows after dropping Duplicates: 9


Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
0,1,0,1,0,1,2,1,0,0,2
1,2,1,2,1,2,2,2,2,1,1
2,2,1,1,0,2,1,1,0,2,0
4,2,0,1,2,0,2,2,0,1,0
5,1,0,2,2,2,2,0,2,2,1
6,1,1,0,1,1,0,2,1,1,0
7,1,1,1,0,0,2,0,1,1,1
8,0,2,2,0,2,2,0,1,1,2
9,0,1,1,1,1,2,0,2,1,0


`Creating new Columns`

In [43]:
# New Column with constant Value
df['n1'] = np.NaN
df['n2'] = 3
df['n3'] = -1

display(df)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,n1,n2,n3
0,1,0,1,0,1,2,1,0,0,2,,3,-1
1,2,1,2,1,2,2,2,2,1,1,,3,-1
2,2,1,1,0,2,1,1,0,2,0,,3,-1
3,1,0,1,1,2,0,2,1,0,2,,3,-1
4,2,0,1,2,0,2,2,0,1,0,,3,-1
5,1,0,2,2,2,2,0,2,2,1,,3,-1
6,1,1,0,1,1,0,2,1,1,0,,3,-1
7,1,1,1,0,0,2,0,1,1,1,,3,-1
8,0,2,2,0,2,2,0,1,1,2,,3,-1
9,0,1,1,1,1,2,0,2,1,0,,3,-1


In [44]:
# New Column derived from other Columns
df['n4'] = df['n2'] + df['n3']
df['n5'] = df['x1']**2 - 2*df['x2'] + 1
df['n6'] = np.ceil((df['n2'] - df['x1']) / 2).astype(int)
df['n7'] = np.where(
    (df['x2'] > 0) | (df['x3'] > 0),
    True, False
)

display(df)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,n1,n2,n3,n4,n5,n6,n7
0,1,0,1,0,1,2,1,0,0,2,,3,-1,2,2,1,True
1,2,1,2,1,2,2,2,2,1,1,,3,-1,2,3,1,True
2,2,1,1,0,2,1,1,0,2,0,,3,-1,2,3,1,True
3,1,0,1,1,2,0,2,1,0,2,,3,-1,2,2,1,True
4,2,0,1,2,0,2,2,0,1,0,,3,-1,2,5,1,True
5,1,0,2,2,2,2,0,2,2,1,,3,-1,2,2,1,True
6,1,1,0,1,1,0,2,1,1,0,,3,-1,2,0,1,True
7,1,1,1,0,0,2,0,1,1,1,,3,-1,2,0,1,True
8,0,2,2,0,2,2,0,1,1,2,,3,-1,2,-3,2,True
9,0,1,1,1,1,2,0,2,1,0,,3,-1,2,-1,2,True


`Dropping Columns` using `drop`

In [45]:
df_dropped = df.drop(columns=['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7'])

display(df_dropped)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
0,1,0,1,0,1,2,1,0,0,2
1,2,1,2,1,2,2,2,2,1,1
2,2,1,1,0,2,1,1,0,2,0
3,1,0,1,1,2,0,2,1,0,2
4,2,0,1,2,0,2,2,0,1,0
5,1,0,2,2,2,2,0,2,2,1
6,1,1,0,1,1,0,2,1,1,0
7,1,1,1,0,0,2,0,1,1,1
8,0,2,2,0,2,2,0,1,1,2
9,0,1,1,1,1,2,0,2,1,0


`Dropping Columns` using `loc` (`Negative Selection`)

In [46]:
df_dropped = df.loc[:, ~df.columns.str.startswith('n')]

display(df_dropped)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
0,1,0,1,0,1,2,1,0,0,2
1,2,1,2,1,2,2,2,2,1,1
2,2,1,1,0,2,1,1,0,2,0
3,1,0,1,1,2,0,2,1,0,2
4,2,0,1,2,0,2,2,0,1,0
5,1,0,2,2,2,2,0,2,2,1
6,1,1,0,1,1,0,2,1,1,0
7,1,1,1,0,0,2,0,1,1,1
8,0,2,2,0,2,2,0,1,1,2
9,0,1,1,1,1,2,0,2,1,0


`Aggregation Functions`

In [47]:
dict_for_df = {
    'x1': [-1, 0, 1],
    'x2': [ 0, 0, 0],
    'x3': [ 1, 0, np.NaN],
    'x4': [ 0, 2, 0],
    'x5': [ 1, -2, 0]
}

df = pd.DataFrame(dict_for_df, index=None)

display(df)

# Maximum per Column
print(' ')
print('Maxium per Column:')
print(df.max(axis=0))

# Minimum per Row
print(' ')
print('Minimum per Row:')
print(df.min(axis=1))

# Mean per Column
print(' ')
print('Mean per Column:')
print(df.mean(axis=0))
print(df.mean(axis=0, skipna=False))

Unnamed: 0,x1,x2,x3,x4,x5
0,-1,0,1.0,0,1
1,0,0,0.0,2,-2
2,1,0,,0,0


 
Maxium per Column:
x1    1.0
x2    0.0
x3    1.0
x4    2.0
x5    1.0
dtype: float64
 
Minimum per Row:
0   -1.0
1   -2.0
2    0.0
dtype: float64
 
Mean per Column:
x1    0.000000
x2    0.000000
x3    0.500000
x4    0.666667
x5   -0.333333
dtype: float64
x1    0.000000
x2    0.000000
x3         NaN
x4    0.666667
x5   -0.333333
dtype: float64


`Creating new Columns` using `Aggregation Functions`

In [53]:
# Maximum of Column 'x4'
df['ColMax_x4'] = df.loc[:, 'x4'].max(axis=0)

# ...or
df['ColMax_x4_v2'] = df.max(axis=0).loc['x4']

# Maximum per Row over Columns 'x1' to 'x5'
df['RowMaxAll'] = df.loc[:, 'x1':'x5'].max(axis=1)

# Maximum per Row over Columns 'x1' and 'x2'
df['RowMax_x1_x2'] = df.loc[:, ['x1', 'x2']].max(axis=1)

display(df)

Unnamed: 0,x1,x2,x3,x4,x5,ColMax_x4,RowMaxAll,RowMax_x1_x2,ColMax_x4_v2
0,-1,0,1.0,0,1,2,1.0,0,2.0
1,0,0,0.0,2,-2,2,2.0,0,2.0
2,1,0,,0,0,2,1.0,1,2.0


`Creating new Columns` using `Shift Function`

In [60]:
import datetime
import numpy as np

array_for_df = np.random.randint(100, size=(10, 3))

indices_for_df = pd.date_range(datetime.date.today(), periods=10).tolist()

columns_for_df = ['A', 'B', 'C']

df = pd.DataFrame(array_for_df, index=indices_for_df, columns=columns_for_df)

display(df)

Unnamed: 0,A,B,C
2021-03-28,13,76,24
2021-03-29,76,16,91
2021-03-30,59,65,87
2021-03-31,68,85,56
2021-04-01,29,37,47
2021-04-02,65,59,25
2021-04-03,70,40,81
2021-04-04,40,10,94
2021-04-05,17,86,76
2021-04-06,59,38,8


In [63]:
# Create Lag 1-Values for Column A
df_shift = df.copy()

df_shift['A_lag1'] = df_shift['A'].shift(1)
    
display(df_shift)

Unnamed: 0,A,B,C,A_lag1
2021-03-28,13,76,24,
2021-03-29,76,16,91,13.0
2021-03-30,59,65,87,76.0
2021-03-31,68,85,56,59.0
2021-04-01,29,37,47,68.0
2021-04-02,65,59,25,29.0
2021-04-03,70,40,81,65.0
2021-04-04,40,10,94,70.0
2021-04-05,17,86,76,40.0
2021-04-06,59,38,8,17.0


In [67]:
# Create Lag 1 and 2-Values for Column B
df_shift = df.copy()

for lag in [1,2]:
    df_shift['B_lag' + str(lag)] = df_shift['B'].shift(lag)
    
display(df_shift)

Unnamed: 0,A,B,C,B_lag1,B_lag2
2021-03-28,13,76,24,,
2021-03-29,76,16,91,76.0,
2021-03-30,59,65,87,16.0,76.0
2021-03-31,68,85,56,65.0,16.0
2021-04-01,29,37,47,85.0,65.0
2021-04-02,65,59,25,37.0,85.0
2021-04-03,70,40,81,59.0,37.0
2021-04-04,40,10,94,40.0,59.0
2021-04-05,17,86,76,10.0,40.0
2021-04-06,59,38,8,86.0,10.0


In [70]:
# Create Lead 1, 2 and 3-Values for Columns A, B and C
df_shift = df.copy()

for col in df_shift.columns:
    for lead in [-1, -2, -3]:
        df_shift[col + '_lead' + str(-lead)] = df_shift[col].shift(lead)
    
display(df_shift)

Unnamed: 0,A,B,C,A_lead1,A_lead2,A_lead3,B_lead1,B_lead2,B_lead3,C_lead1,C_lead2,C_lead3
2021-03-28,13,76,24,76.0,59.0,68.0,16.0,65.0,85.0,91.0,87.0,56.0
2021-03-29,76,16,91,59.0,68.0,29.0,65.0,85.0,37.0,87.0,56.0,47.0
2021-03-30,59,65,87,68.0,29.0,65.0,85.0,37.0,59.0,56.0,47.0,25.0
2021-03-31,68,85,56,29.0,65.0,70.0,37.0,59.0,40.0,47.0,25.0,81.0
2021-04-01,29,37,47,65.0,70.0,40.0,59.0,40.0,10.0,25.0,81.0,94.0
2021-04-02,65,59,25,70.0,40.0,17.0,40.0,10.0,86.0,81.0,94.0,76.0
2021-04-03,70,40,81,40.0,17.0,59.0,10.0,86.0,38.0,94.0,76.0,8.0
2021-04-04,40,10,94,17.0,59.0,,86.0,38.0,,76.0,8.0,
2021-04-05,17,86,76,59.0,,,38.0,,,8.0,,
2021-04-06,59,38,8,,,,,,,,,


`Grouping DataFrames`

In [110]:
import datetime
import numpy as np

dates  = pd.date_range(datetime.date(2020,1,1), periods=365).tolist()
months = pd.date_range(datetime.date(2020,1,1), periods=365).month.tolist()
sales  = np.random.randint(100000, size=(365))

dict_for_df = {
    'Month': months,
    'Sales': sales
}

df = pd.DataFrame(dict_for_df, index=dates)

display(df)

Unnamed: 0,Month,Sales
2020-01-01,1,21895
2020-01-02,1,41329
2020-01-03,1,47579
2020-01-04,1,4545
2020-01-05,1,22767
2020-01-06,1,75953
2020-01-07,1,35102
2020-01-08,1,1505
2020-01-09,1,96986
2020-01-10,1,23399


In [115]:
# Group Sales by Month
df_grouped = df.groupby('Month').agg({'Sales': ['sum', 'mean']})

display(df_grouped)

Unnamed: 0_level_0,Sales,Sales
Unnamed: 0_level_1,sum,mean
Month,Unnamed: 1_level_2,Unnamed: 2_level_2
1,1579092,50938.451613
2,1452323,50080.103448
3,1837565,59276.290323
4,1717689,57256.3
5,1525379,49205.774194
6,1579367,52645.566667
7,1526572,49244.258065
8,1500627,48407.322581
9,1341074,44702.466667
10,1568074,50583.032258


`Concatenating DataFrames` with identical Columns

In [2]:
# Two DataFrames with identical Columns
dict1 = {
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B4'],
    'C': ['C0', 'C1', 'C2', 'C3', 'C4']
}

dict2 = {
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2'],
    'C': ['C0', 'C1', 'C2']
}

df1 = pd.DataFrame(dict1, index=None)
df2 = pd.DataFrame(dict2, index=None)

display(df1, df2)

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3
4,A4,B4,C4


Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


Use `append`

In [3]:
# Will give duplicate Index Values
df_joined = df1.append(df2)

display(df_joined)

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3
4,A4,B4,C4
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


In [4]:
# Will "ignore" Index
df_joined = df1.append(df2, ignore_index=True)

display(df_joined)

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3
4,A4,B4,C4
5,A0,B0,C0
6,A1,B1,C1
7,A2,B2,C2


Or use `concat` instead

In [6]:
# Concatenating along Axis 0 (= Appending)
list_df_to_concat = [df1, df2]

df_joined = pd.concat(list_df_to_concat, axis=0, ignore_index=True)

display(df_joined)

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3
4,A4,B4,C4
5,A0,B0,C0
6,A1,B1,C1
7,A2,B2,C2


`Concatenating DataFrames` with identical Rows

In [7]:
# Two DataFrames with identical Rows 
dict1 = {
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B4'],
    'C': ['C0', 'C1', 'C2', 'C3', 'C4']
}

dict2 = {
    'D': ['D0', 'D1', 'D2', 'D3', 'D4'],
    'E': ['E0', 'E1', 'E2', 'E3', 'E4']
}

df1 = pd.DataFrame(dict1, index=None)
df2 = pd.DataFrame(dict2, index=None)

display(df1, df2)

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3
4,A4,B4,C4


Unnamed: 0,D,E
0,D0,E0
1,D1,E1
2,D2,E2
3,D3,E3
4,D4,E4


Use `concat`

In [8]:
# Concatenate along Axis 1
df_joined = pd.concat([df1, df2], axis=1)

display(df_joined)

Unnamed: 0,A,B,C,D,E
0,A0,B0,C0,D0,E0
1,A1,B1,C1,D1,E1
2,A2,B2,C2,D2,E2
3,A3,B3,C3,D3,E3
4,A4,B4,C4,D4,E4


`Concatenating DataFrames` with different Rows and Columns

In [9]:
# Two DataFrames with different Rows and Columns
dict1 = {
    'X': ['X0', 'X1', 'X2', 'X3', 'X4'],
    'Y': ['Y0', 'Y1', 'Y2', 'Y3', 'Y4'],
    'A': ['A0', 'A1', 'A2', 'A3', 'A4']
}

dict2 = {
    'X': ['X0', 'X1', 'X2', 'X3'],
    'Y': ['Y0', 'Y1', 'Y2', 'Y3'],
    'Z': ['Z0', 'Z1', 'Z2', 'Z3']
}

df1 = pd.DataFrame(dict1, index=None)
df2 = pd.DataFrame(dict2, index=None)

display(df1, df2)

Unnamed: 0,X,Y,A
0,X0,Y0,A0
1,X1,Y1,A1
2,X2,Y2,A2
3,X3,Y3,A3
4,X4,Y4,A4


Unnamed: 0,X,Y,Z
0,X0,Y0,Z0
1,X1,Y1,Z1
2,X2,Y2,Z2
3,X3,Y3,Z3


Use `concat` along Axis 0

In [181]:
# Will append Rows and fill non-existing Columns with NaN
df_joined = pd.concat([df1, df2], axis=0, ignore_index=True, sort=True)

display(df_joined)

Unnamed: 0,A,X,Y,Z
0,A0,X0,Y0,
1,A1,X1,Y1,
2,A2,X2,Y2,
3,A3,X3,Y3,
4,A4,X4,Y4,
5,,X0,Y0,Z0
6,,X1,Y1,Z1
7,,X2,Y2,Z2
8,,X3,Y3,Z3


`Merging DataFrames`

`Left Join`

In [42]:
# Fact Table
df = pd.DataFrame(dict({'city_id': np.random.randint(1, 4,    size=(5)),
                        'x1'     : np.random.randint(0, 1000, size=(5)),
                        'x2'     : np.random.randint(0, 1000, size=(5)),
                        'x3'     : np.random.randint(0, 1000, size=(5)),}), 
                  index=None)
# Dimension Table
df_city = pd.DataFrame(dict({'city_id': [1, 2, 3], 
                             'city_text': ['Berlin', 'Hamburg', 'München']}), 
                       index=None)

display(df)
display(df_city)

Unnamed: 0,city_id,x1,x2,x3
0,2,215,534,6
1,3,740,220,372
2,3,146,548,722
3,3,263,285,455
4,1,357,91,894


Unnamed: 0,city_id,city_text
0,1,Berlin
1,2,Hamburg
2,3,München


In [43]:
# Merge
df.loc[:, 'city_text'] = pd.merge(df, df_city, 
                                  how='left', left_on='city_id', right_on='city_id')

df.sort_index(axis=1, inplace=True)

display(df)

Unnamed: 0,city_id,city_text,x1,x2,x3
0,2,Hamburg,215,534,6
1,3,München,740,220,372
2,3,München,146,548,722
3,3,München,263,285,455
4,1,Berlin,357,91,894


`Inner Join`

In [5]:
df1 = pd.DataFrame(dict({'customer_id': [1, 2, 3, 4, 5],
                         'sex': ['male', 'male', 'female', 'male', 'female']}), 
                        index=None)

df2 = pd.DataFrame(dict({'customer_id': [3, 5, 6, 7, 8],
                         'age': [39, 24, 63, 43, 50]}), index=None)

df_joined = pd.merge(df1, df2, how='inner', left_on='customer_id', right_on='customer_id')

print('Original DataFrames')
display(df1, df2)
print(' ')
print('Inner Join on Customer ID')
display(df_joined)

Original DataFrames


Unnamed: 0,customer_id,sex
0,1,male
1,2,male
2,3,female
3,4,male
4,5,female


Unnamed: 0,customer_id,age
0,3,39
1,5,24
2,6,63
3,7,43
4,8,50


 
Inner Join on Customer ID


Unnamed: 0,customer_id,sex,age
0,3,female,39
1,5,female,24


`Displaying DataFrames`

In [93]:
df = pd.DataFrame(np.zeros((100, 50), dtype=np.int64), index=None)

pd.options.display.max_rows=10
pd.options.display.max_columns=20

display(df)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [94]:
pd.options.display.max_rows=None
pd.options.display.max_columns=None

display(df)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
