# PANDAS LIBRARY

`Another Open Source Python Library used for working with (at least) 2-dimensional Data`

Here's what the People behind Pandas say about it (https://pandas.pydata.org/about/ from March, 2021)...

**Mission** <br>

"pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. <br> Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language."

**Library Highlights** <br>

* A fast and efficient `DataFrame` object for data manipulation with integrated indexing <br>
<br>
* Tools for `reading and writing data` between in-memory data structures and different formats: <br> CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format

[...]

* Intelligent `label-based` slicing, fancy indexing, and subsetting of large data sets

[...]


* `Aggregating or transforming data` with a powerful group by engine allowing split-apply-combine operations on data sets <br>
<br>
* High performance `merging and joining` of data sets

[...]


---

`Import Pandas Library`

In [None]:
import pandas as pd

`Creating Pandas Series for 1-dimensional Data (aka Column Vector)` 

In [None]:
# Series from Integer/Float/String
series = pd.Series('A', index=[1, 2, 3, 4, 5])
print(series)
print(type(series))

In [None]:
# Series from List
list_for_series = [x for x in range(0, 50+1, 5)]
series = pd.Series(list_for_series, index=None)
print(series)

In [None]:
# Series from Dictionary
dict_for_series = {'A': 1, 'B': 2, 'C': 3}
series = pd.Series(dict_for_series, index=None)
print(series)

In [None]:
# Series from NumPy Array
import numpy as np
array_for_series = np.random.rand(5)
series = pd.Series(array_for_series, index=None, name='Random')
print(series)

`Creating Pandas DataFrame for 2-dimensional Data (aka Matrix)` 

In [None]:
# DataFrame from Dictionary of Lists
dict_lists_for_df = {
    'Capital': ['A', 'B', 'C'],
    'Small': ['a', 'b', 'c']
}

df = pd.DataFrame(dict_lists_for_df, index=None)

display(df)

In [None]:
# DataFrame from Dictionary of NumPy Arrays
dict_arrays_for_df = {
    'Zeros': np.zeros(5),
    'Ones': np.ones(5)
}

df = pd.DataFrame(dict_arrays_for_df, index=None)

display(df)

In [None]:
# DataFrame from 2D-Array
array2D_for_df = np.eye(10, dtype=np.int64)

list_columns = ['Column_' + str(i) for i in range(0,10)]

df = pd.DataFrame(array2D_for_df, index=None, columns=list_columns)

display(df)

In [None]:
# DataFrame from named Series
series_for_df = pd.Series(np.random.rand(5), index=None, name='Random')

df = pd.DataFrame(series_for_df, index=None)

display(df)

In [None]:
# Empty DataFrame with named Columns
df = pd.DataFrame(index=None, columns=['Column1', 'Column2', 'Column3'])
display(df)

`Indexing (= Selecting Data from) a DataFrame` using `loc` and `iloc`

`Note:` `loc` is used for `label-based` and `iloc` is used for `integer-based Indexing` 

In [None]:
n = 10
m = 5

list_idx = ['Row_' + str(i) for i in range(0,n)]
list_col = ['Col_' + str(i) for i in range(0,m)]

df = pd.DataFrame(np.random.randint(100, size=(n,m)), index=list_idx, columns=list_col)

display(df)

# Select Row by Label (loc)
series = df.loc['Row_3', :]
print(' ')
print('Row_3 as Pandas Series')
print(series)
print(type(series))

In [None]:
display(df)

# Select Row by Integer (iloc)
series = df.iloc[4, :]
print(' ')
print('Row_3 as Pandas Series')
print(series)
print(type(series))

In [None]:
display(df)

# Select Column by Label (loc)
series = df.loc[:, 'Col_4']
print(' ')
print('Col_4 as Pandas Series')
print(series)
print(type(series))

In [None]:
display(df)

# Select Column by Integer (iloc)
series = df.iloc[:, -1]
print(' ')
print('Col_4 as Pandas Series')
print(series)
print(type(series))

In [None]:
display(df)

# Select multiple Rows and Columns by Label (loc)
df_sel = df.loc['Row_0':'Row_4', ['Col_1', 'Col_3']]
print(' ')
print('Rows 0-4 and Cols 1,3 as Pandas DataFrame')
display(df_sel)
print(type(df_sel))

In [None]:
display(df)

# Select multiple Rows and Columns by Integer (iloc)
df_sel = df.iloc[:3, -3:]
print(' ')
print('First three Rows and last three Columns as Pandas DataFrame')
display(df_sel)
print(type(df_sel))

In [None]:
display(df)

# Select single Value by Label (loc)
value = df.loc['Row_8', 'Col_3']
print(' ')
print('Single Value in Row_8 and Col_3 as NumPy Integer')
print(value)
print(type(value))

In [None]:
display(df)

# Select single Value by Integer (iloc)
value = df.iloc[1,0]
print(' ')
print('Single Value in Row_1 and Col_0 as NumPy Integer')
print(value)
print(type(value))

`Selecting Rows` using `head` and `tail`

In [None]:
# Select first three Rows
df_sel = df.head(3)
display(df_sel)

In [None]:
# Select last two Rows
df_sel = df.tail(2)
display(df_sel)

`Selecting Rows` using `Boolean Masking`

In [None]:
display(df)

# Select Rows
df_sel = df[df['Col_0'] > 50]
print(' ')
print('All Rows with Col_0 > 50')
display(df_sel)
print(type(df_sel))

`Boolean Masking with multiple Conditions`

`Note: You must use Square Brackets for each Condition`

In [None]:
display(df)

# Select Rows
df_sel = df[(df['Col_0'] > 50) & (df['Col_1'] < 30)]
print(' ')
print('All Rows with Col_0 > 50 AND Col_1 < 30')
display(df_sel)
print(type(df_sel))

`Simplified Column Selection`

In [None]:
series = df['Col_1']

print(series)
print(type(series))

In [None]:
df_sel = df[['Col_1', 'Col_2']]

display(df_sel)
print(type(df_sel))

`Notice: It is good Practice to use copy() whenever you index a DataFrame` <br> `...no matter if you use loc, iloc, Boolean Masking or a simplified Version of Indexing`

In [None]:
# Example
df_sel = df[['Col_1', 'Col_2']].copy()

display(df_sel)
print(type(df_sel))

`Selecting the Index from a DataFrame`

In [None]:
index = df.index

print(index)
print(type(index))

`Resetting the Index of a DataFrame`

In [None]:
df_new = df.reset_index()

display(df_new)

In [None]:
# Get rid of the old Index
df_new = df.reset_index(drop=True)

display(df_new)

`Returning the Dimensions of a DataFrame`

In [None]:
# Number of Dimensions
print('# of Dimensions:', df.ndim)

# Number of Rows
print('# of Rows:', len(df))

# Number of Columns
print('# of Columns:', len(df.columns))

# Shape
print('Shape:', df.shape)

# Number of Items (Size)
print('Number of Items (Size):', df.size)

`Checking whether a DataFrame is empty`

In [None]:
print(df.empty)

`Sort DataFrame by Column(s)`

In [None]:
# Create new DataFrame
list_cols = ['x' + str(i) for i in range(1, 10+1)]

df = pd.DataFrame(np.random.randint(3, size=(10, 10)), index=None, columns=list_cols)

display(df)

In [None]:
# Sort by x1 ascending
df_sorted = df.sort_values(by=['x1'], ascending=True)
display(df_sorted)

In [None]:
# Sort by x9 descending 
df_sorted = df.sort_values(by=['x9'], ascending=False)
display(df_sorted)

In [None]:
# Sort by x2 descending and x3 ascending
df_sorted = df.sort_values(by=['x2', 'x3'], ascending=[False, True])
display(df_sorted)

`Dropping Duplicates`

In [None]:
# Duplicates including all Columns
print('# Rows before dropping Duplicates:', len(df))
df_dropped = df.drop_duplicates()
print('# Rows after dropping Duplicates:', len(df_dropped))
display(df_dropped)

In [None]:
# Duplicates including a Subset of Columns
print('# Rows before dropping Duplicates:', len(df))
df_dropped = df.drop_duplicates(subset=['x1', 'x2', 'x3'], keep='first')
print('# Rows after dropping Duplicates:', len(df_dropped))
display(df_dropped)

`Creating new Columns`

In [None]:
# New Column with constant Value
df['n1'] = np.NaN
df['n2'] = 3
df['n3'] = -1

display(df)

In [None]:
# New Column derived from other Columns
df['n4'] = df['n2'] + df['n3']
df['n5'] = df['x1']**2 - 2*df['x2'] + 1
df['n6'] = np.ceil((df['n2'] - df['x1']) / 2).astype(int)
df['n7'] = np.where(
    (df['x2'] > 0) | (df['x3'] > 0),
    True, False
)

display(df)

`Dropping Columns` using `drop`

In [None]:
df_dropped = df.drop(columns=['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7'])

display(df_dropped)

`Dropping Columns` using `loc` (`Negative Selection`)

In [None]:
df_dropped = df.loc[:, ~df.columns.str.startswith('n')]

display(df_dropped)

`Aggregation Functions`

In [None]:
dict_for_df = {
    'x1': [-1, 0, 1],
    'x2': [ 0, 0, 0],
    'x3': [ 1, 0, np.NaN],
    'x4': [ 0, 2, 0],
    'x5': [ 1, -2, 0]
}

df = pd.DataFrame(dict_for_df, index=None)

display(df)

# Maximum per Column
print(' ')
print('Maxium per Column:')
print(df.max(axis=0))

# Minimum per Row
print(' ')
print('Minimum per Row:')
print(df.min(axis=1))

# Mean per Column
print(' ')
print('Mean per Column:')
print(df.mean(axis=0))
print(df.mean(axis=0, skipna=False))

`Creating new Columns` using `Aggregation Functions`

In [None]:
# Maximum of Column 'x4'
df['ColMax_x4'] = df.loc[:, 'x4'].max(axis=0)

# ...or
df['ColMax_x4_v2'] = df.max(axis=0).loc['x4']

# Maximum per Row over Columns 'x1' to 'x5'
df['RowMaxAll'] = df.loc[:, 'x1':'x5'].max(axis=1)

# Maximum per Row over Columns 'x1' and 'x2'
df['RowMax_x1_x2'] = df.loc[:, ['x1', 'x2']].max(axis=1)

display(df)

`Creating new Columns` using `Shift Function`

In [None]:
import datetime
import numpy as np

array_for_df = np.random.randint(100, size=(10, 3))

indices_for_df = pd.date_range(datetime.date.today(), periods=10).tolist()

columns_for_df = ['A', 'B', 'C']

df = pd.DataFrame(array_for_df, index=indices_for_df, columns=columns_for_df)

display(df)

In [None]:
# Create Lag 1-Values for Column A
df_shift = df.copy()

df_shift['A_lag1'] = df_shift['A'].shift(1)
    
display(df_shift)

In [None]:
# Create Lag 1 and 2-Values for Column B
df_shift = df.copy()

for lag in [1,2]:
    df_shift['B_lag' + str(lag)] = df_shift['B'].shift(lag)
    
display(df_shift)

In [None]:
# Create Lead 1, 2 and 3-Values for Columns A, B and C
df_shift = df.copy()

for col in df_shift.columns:
    for lead in [-1, -2, -3]:
        df_shift[col + '_lead' + str(-lead)] = df_shift[col].shift(lead)
    
display(df_shift)

`Grouping DataFrames`

In [None]:
import datetime
import numpy as np

dates  = pd.date_range(datetime.date(2020,1,1), periods=365).tolist()
months = pd.date_range(datetime.date(2020,1,1), periods=365).month.tolist()
sales  = np.random.randint(100000, size=(365))

dict_for_df = {
    'Month': months,
    'Sales': sales
}

df = pd.DataFrame(dict_for_df, index=dates)

display(df)

In [None]:
# Group Sales by Month
df_grouped = df.groupby('Month').agg({'Sales': ['sum', 'mean']})

display(df_grouped)

`Concatenating DataFrames` with identical Columns

In [None]:
# Two DataFrames with identical Columns
dict1 = {
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B4'],
    'C': ['C0', 'C1', 'C2', 'C3', 'C4']
}

dict2 = {
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2'],
    'C': ['C0', 'C1', 'C2']
}

df1 = pd.DataFrame(dict1, index=None)
df2 = pd.DataFrame(dict2, index=None)

display(df1, df2)

Use `append`

In [None]:
# Will give duplicate Index Values
df_joined = df1.append(df2)

display(df_joined)

In [None]:
# Will "ignore" Index
df_joined = df1.append(df2, ignore_index=True)

display(df_joined)

Or use `concat` instead

In [None]:
# Concatenating along Axis 0 (= Appending)
list_df_to_concat = [df1, df2]

df_joined = pd.concat(list_df_to_concat, axis=0, ignore_index=True)

display(df_joined)

`Concatenating DataFrames` with identical Rows

In [None]:
# Two DataFrames with identical Rows 
dict1 = {
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B4'],
    'C': ['C0', 'C1', 'C2', 'C3', 'C4']
}

dict2 = {
    'D': ['D0', 'D1', 'D2', 'D3', 'D4'],
    'E': ['E0', 'E1', 'E2', 'E3', 'E4']
}

df1 = pd.DataFrame(dict1, index=None)
df2 = pd.DataFrame(dict2, index=None)

display(df1, df2)

Use `concat`

In [None]:
# Concatenate along Axis 1
df_joined = pd.concat([df1, df2], axis=1)

display(df_joined)

`Concatenating DataFrames` with different Rows and Columns

In [None]:
# Two DataFrames with different Rows and Columns
dict1 = {
    'X': ['X0', 'X1', 'X2', 'X3', 'X4'],
    'Y': ['Y0', 'Y1', 'Y2', 'Y3', 'Y4'],
    'A': ['A0', 'A1', 'A2', 'A3', 'A4']
}

dict2 = {
    'X': ['X0', 'X1', 'X2', 'X3'],
    'Y': ['Y0', 'Y1', 'Y2', 'Y3'],
    'Z': ['Z0', 'Z1', 'Z2', 'Z3']
}

df1 = pd.DataFrame(dict1, index=None)
df2 = pd.DataFrame(dict2, index=None)

display(df1, df2)

Use `concat` along Axis 0

In [None]:
# Will append Rows and fill non-existing Columns with NaN
df_joined = pd.concat([df1, df2], axis=0, ignore_index=True, sort=True)

display(df_joined)

`Merging DataFrames`

`Left Join`

In [None]:
# Fact Table
df = pd.DataFrame(dict({'city_id': np.random.randint(1, 4,    size=(5)),
                        'x1'     : np.random.randint(0, 1000, size=(5)),
                        'x2'     : np.random.randint(0, 1000, size=(5)),
                        'x3'     : np.random.randint(0, 1000, size=(5)),}), 
                  index=None)
# Dimension Table
df_city = pd.DataFrame(dict({'city_id': [1, 2, 3], 
                             'city_text': ['Berlin', 'Hamburg', 'München']}), 
                       index=None)

display(df)
display(df_city)

In [None]:
# Merge
df.loc[:, 'city_text'] = pd.merge(df, df_city, 
                                  how='left', left_on='city_id', right_on='city_id')

df.sort_index(axis=1, inplace=True)

display(df)

`Inner Join`

In [None]:
df1 = pd.DataFrame(dict({'customer_id': [1, 2, 3, 4, 5],
                         'sex': ['male', 'male', 'female', 'male', 'female']}), 
                        index=None)

df2 = pd.DataFrame(dict({'customer_id': [3, 5, 6, 7, 8],
                         'age': [39, 24, 63, 43, 50]}), index=None)

df_joined = pd.merge(df1, df2, how='inner', left_on='customer_id', right_on='customer_id')

print('Original DataFrames')
display(df1, df2)
print(' ')
print('Inner Join on Customer ID')
display(df_joined)

`Displaying DataFrames`

In [None]:
df = pd.DataFrame(np.zeros((100, 50), dtype=np.int64), index=None)

pd.options.display.max_rows=10
pd.options.display.max_columns=20

display(df)

In [None]:
pd.options.display.max_rows=None
pd.options.display.max_columns=None

display(df)