<a href="https://colab.research.google.com/github/sayyed-uoft/fullstackai/blob/main/04_Data_Analysis_with_Pandas_Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Strata.ai - Artificial Intelligence Certificate 

# Module 1: Data Science for AI

# Data Analysis with Pandas - Part 1

## Learning Outcome

- Introduction to Dara Analysis process
- Introduction to the most common Data Analysis package in Python (pandas)
- Learn how to work with two powerful data structures in pandas (Series & DataFrame)

## Topics
- [Data Analysis Process](#process)
- [Panda's Data Structures](#data-structures)
- [Series](#series)
- [DataFrame](#dataframe)



<a id="process"></a>
## Data Analysis Process

<center><img src="attachment:image.png" width="70%"></center>

<a id="pandas"></a>
## Introduction to Pandas

- The most common Data Analysis open-source package in Python.
- Contains data structures and data manipulation tools designed to make **data cleaning and analysis** fast and easy.
- designed for working with tabular or heterogeneous data (NumPy, by contrast, is best suited for working with homogeneous numerical array data.
- Usually used in tandem with other common packages: NumPy, statsmodels, scikit-learn, and matplotlib

<a id="data-structures"></a>
## Pandas' Data Structures

Pandas provides two powerful data structures: 

- **Series:** a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.
- **DataFrame:** represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

<a id="series"></a>
## Series

- It contains a **sequence of values** of the **same type**
- Also, contains an **index** (data labels associated to values)

<center><img src="attachment:image.png" width="60%"></center>

In [None]:
import pandas as pd # import pandas package (conventionally used "pd" as an alias)

my_series = pd.Series([0, 2, -3, 5, 7]) # simplest way of creating a Series 
my_series

0    0
1    2
2   -3
3    5
4    7
dtype: int64

In [None]:
# lts's use "display" for multiple output to get more info
from IPython.display import display 

display(my_series.values) # the "values" of the Series
display(my_series.index) # "index" is created automatically and it's like "range(5)" 
display(my_series.dtype) # The Series' data type

print(list(my_series.index)) # convert to a list and print it 

array([ 0,  2, -3,  5,  7])

RangeIndex(start=0, stop=5, step=1)

dtype('int64')

[0, 1, 2, 3, 4]


In [None]:
# Assigning "index" externally  
my_series = pd.Series(['a', 'b', 'c', 'd'], 
                      index=['first', 'second', 'third', 'forth']) 
display(my_series.values, my_series.index) 

# Alternative way
my_series = pd.Series(['a', 'b', 'c', 'd'])
my_series.index = ['first', 'second', 'third', 'forth']
display(my_series.index)

# Yet another way (from a dictionary)
my_series = pd.Series({'first': 'a', 'second': 'b', 'third': 'c', 'forth': 'd'})
display(my_series.values, my_series.index)

array(['a', 'b', 'c', 'd'], dtype=object)

Index(['first', 'second', 'third', 'forth'], dtype='object')

Index(['first', 'second', 'third', 'forth'], dtype='object')

array(['a', 'b', 'c', 'd'], dtype=object)

Index(['first', 'second', 'third', 'forth'], dtype='object')

In [None]:
# Accessing by index
is_employed = pd.Series([True, False, True], index=['Jack', 'Sara', 'Paula'])

display(is_employed)
display(is_employed['Paula']) # selecting a value
display(is_employed[['Paula', 'Jack']]) # selecting a set of values

Jack      True
Sara     False
Paula     True
dtype: bool

True

Paula    True
Jack     True
dtype: bool

You can assign a **Name** to a Series and to the index of a Series (will be useful later):

In [None]:
is_employed = pd.Series([True, False, True], index=['Jack', 'Sara', 'Paula'])
is_employed.name = "Is Employed"
is_employed.index.name = "Person"

display(is_employed)

Person
Jack      True
Sara     False
Paula     True
Name: Is Employed, dtype: bool

Panda's **Series** behaves very similar to NumPy Arrays. You can use NumPy functions or NumPy-like operations.

In [None]:
import numpy as np
my_series = pd.Series([0, 2, -3], index=['Jack', 'Sara', 'Paula'])

display(my_series * 5) # element-wise arithmetic operation
display(np.exp(my_series)) # element-wise function
display('Sum: {}'.format(np.sum(my_series))) # aggregate function
display(my_series[my_series < 0]) # comparision and slicing

Jack      0
Sara     10
Paula   -15
dtype: int64

Jack     1.000000
Sara     7.389056
Paula    0.049787
dtype: float64

'Sum: -1'

Paula   -3
dtype: int64

Pandas matches elements of two Series by their indexes: 

In [None]:
# Two series by different index orders
salaries = pd.Series([65_000, 0, 78_000], index=['Jack', 'Sara', 'Paula'])
bonuses =  pd.Series([10_000, 18_000, 0], index=['Jack', 'Paula', 'Sara'])

display(salaries + bonuses)

Jack     75000
Paula    96000
Sara         0
dtype: int64

In [None]:
# Check if an index exists
my_series = pd.Series([0, 2, -3], index=['a', 'b', 'c'])
display('c' in my_series) # simialr to a dictionary

# Iteration
for value in my_series: # iterating values
    print(value)
    
    
for idx in my_series.index: # iterating index
    print("{} = {}".format(idx, my_series[idx]))

True

0
2
-3
a = 0
b = 2
c = -3


### Handling missing values:

In [None]:
# Example with missing values
salaries = pd.Series([65_000, 0, 78_000], 
                     index=['Jack', 'Sara', 'Paula']) # Alex is missing here
bonuses =  pd.Series([15_000, 10_000, 18_000], 
                     index=['Alex', 'Jack', 'Paula']) # Sara is missing here
totals = salaries + bonuses
display(totals)

# Nan = Not a Number = missing/NA value

Alex         NaN
Jack     75000.0
Paula    96000.0
Sara         NaN
dtype: float64

In [None]:
# Checking missing values
display(pd.isnull(totals)) # check if any element is missing
display(pd.notnull(totals)) # check if any element is not missing
display(totals.isnull()) # using a method as an alternative

Alex      True
Jack     False
Paula    False
Sara      True
dtype: bool

Alex     False
Jack      True
Paula     True
Sara     False
dtype: bool

Alex      True
Jack     False
Paula    False
Sara      True
dtype: bool

### Sorting a Series:

In [None]:
display(totals.sort_values(ascending=True)) # sort values in an ascending order
display(totals.sort_values(ascending=False, na_position='first')) # desc. order and NA values first 
display(totals.sort_index(ascending=True))

Jack     75000.0
Paula    96000.0
Alex         NaN
Sara         NaN
dtype: float64

Alex         NaN
Sara         NaN
Paula    96000.0
Jack     75000.0
dtype: float64

Alex         NaN
Jack     75000.0
Paula    96000.0
Sara         NaN
dtype: float64

<a id="dataframe"></a>
## DataFrame

- **DataFrame** represents tabular data (an ordered collection of columns)
- Each **column** is a **Series** that can be a different value type (numeric, string, boolean, etc.)
- It has both a **row** and **column index**. All columns share the same index.

<center><img src="attachment:image.png" width="70%"></center>

In [None]:
# Creating a DataFrame 
data = {
    'Person': ['Jack', 'Sara', 'Paula', 'Alex'],
    'Is Employed': [True, False, True, True],
    'Salary': [65_000, 0, 78_000, None],
    'Bonus': [10_000, 0, 18_000, 15_000]
}
df = pd.DataFrame(data)
print(df)
df # iPython/Juptyr prints more pretty (browser friendly)

  Person  Is Employed   Salary  Bonus
0   Jack         True  65000.0  10000
1   Sara        False      0.0      0
2  Paula         True  78000.0  18000
3   Alex         True      NaN  15000


Unnamed: 0,Person,Is Employed,Salary,Bonus
0,Jack,True,65000.0,10000
1,Sara,False,0.0,0
2,Paula,True,78000.0,18000
3,Alex,True,,15000


In [None]:
# Change the order of the columns
df = pd.DataFrame(data, columns=['Person', 'Salary', 'Is Employed', 'Bonus'])
display(df)

# Index (Row index) is assigned automatically (sequential numbers)
display(df.index)

Unnamed: 0,Person,Salary,Is Employed,Bonus
0,Jack,65000.0,True,10000
1,Sara,0.0,False,0
2,Paula,78000.0,True,18000
3,Alex,,True,15000


RangeIndex(start=0, stop=4, step=1)

In [None]:
# Setting the index at creation 
df = pd.DataFrame(data, 
                  columns=['Salary', 'Is Employed', 'Bonus', 'Age'],
                  index=data['Person']
                 )
display(df) # The column "Age" doesn't exist in "data", it is created with NaN values

Unnamed: 0,Salary,Is Employed,Bonus,Age
Jack,65000.0,True,10000,
Sara,0.0,False,0,
Paula,78000.0,True,18000,
Alex,,True,15000,


In [None]:
#  Alternative method (list of lists)
df = pd.DataFrame([
    [65_000, True, 10_000],
    [0, False, 0],
    [78_000, True, 18_000],
    [None, True, 15_000]
], 
    columns=['Salary', 'Is Employed', 'Bonus'],
    index=['Jack', 'Sara', 'Paula', 'Alex']    
)

display(df)

Unnamed: 0,Salary,Is Employed,Bonus
Jack,65000.0,True,10000
Sara,0.0,False,0
Paula,78000.0,True,18000
Alex,,True,15000


In [None]:
# Yet another method (neted dictionary)
df = pd.DataFrame({
    'Salary': {'Jack': 65_000, 'Sara': 0, 'Paula': 78_000},
    'Is Employed': {'Jack': True, 'Sara': False, 'Paula': True, 'Alex': True},
    'Bonus': {'Jack': 10_000, 'Sara': 0, 'Paula': 18_000, 'Alex': 15_000},
    'Age': {'Jack': 25},
})

df.index.name = 'Person' # setting row index name 
df.columns.name = 'HR Details' # setting column index 

display(df)

HR Details,Salary,Is Employed,Bonus,Age
Person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jack,65000.0,True,10000,25.0
Sara,0.0,False,0,
Paula,78000.0,True,18000,
Alex,,True,15000,


In [None]:
# Getting DataFrame values
display(df.values)

array([[65000.0, True, 10000, 25.0],
       [0.0, False, 0, nan],
       [78000.0, True, 18000, nan],
       [nan, True, 15000, nan]], dtype=object)

In [None]:
# Get the data DataFrame columns
display(df.columns)

# Accessing a column
display(df.Salary)
display(df['Is Employed'])

# Accessing a row
display(df.loc['Jack'])

Index(['Salary', 'Is Employed', 'Bonus', 'Age'], dtype='object', name='HR Details')

Person
Jack     65000.0
Sara         0.0
Paula    78000.0
Alex         NaN
Name: Salary, dtype: float64

Person
Jack      True
Sara     False
Paula     True
Alex      True
Name: Is Employed, dtype: bool

HR Details
Salary         65000
Is Employed     True
Bonus          10000
Age               25
Name: Jack, dtype: object

In [None]:
# Broadcasting a value to a column
df['Age'] = 0

# Creating a new column by assignment
df['Gender'] = 'Unknown'

# Craeting calculated column
df['Total Wage'] = df.Salary + df.Bonus
df['Low Income'] = df['Total Wage'] > 80000

display(df)

HR Details,Salary,Is Employed,Bonus,Age,Gender,Total Wage,Low Income
Person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Jack,65000.0,True,10000,0,Unknown,75000.0,False
Sara,0.0,False,0,0,Unknown,0.0,False
Paula,78000.0,True,18000,0,Unknown,96000.0,True
Alex,,True,15000,0,Unknown,,False


In [None]:
# Assigning values to a column
df['Age'] = [25, 30, 35, 19]

# Assigning some values to a column
df['Gender'] = pd.Series(['Male', 'Female'], index=['Jack', 'Paula'])

display(df)

HR Details,Salary,Is Employed,Bonus,Age,Gender,Total Wage,Low Income
Person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Jack,65000.0,True,10000,25,Male,75000.0,False
Sara,0.0,False,0,30,,0.0,False
Paula,78000.0,True,18000,35,Female,96000.0,True
Alex,,True,15000,19,,,False


In [None]:
# Deleting a column
del df['Total Wage']
display(df)

HR Details,Salary,Is Employed,Bonus,Age,Gender,Low Income
Person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jack,65000.0,True,10000,25,Male,False
Sara,0.0,False,0,30,,False
Paula,78000.0,True,18000,35,Female,True
Alex,,True,15000,19,,False


In [None]:
# Transposing a DataFrame
display(df.T)

Person,Jack,Sara,Paula,Alex
HR Details,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Salary,65000,0,78000,
Is Employed,True,False,True,True
Bonus,10000,0,18000,15000
Age,25,30,35,19
Gender,Male,,Female,
Low Income,False,False,True,False


### Index Object

In [None]:
data = {
    'Person': ['Jack', 'Sara', 'Paula', 'Alex'],
    'Is Employed': [True, False, True, True],
    'Salary': [65_000, 0, 78_000, None],
    'Bonus': [10_000, 0, 18_000, 15_000]
}
df = pd.DataFrame(data)  # Autogenerated index as RangeIndex
display(df.index)

RangeIndex(start=0, stop=4, step=1)

In [None]:
# Set a column as the index
df = df.set_index('Person') # returns the changed DataFrame
display(df)
display(df.index)

# Reset the index
df = df.reset_index() # returns the changed DataFrame
display(df.index)

Unnamed: 0_level_0,Is Employed,Salary,Bonus
Person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jack,True,65000.0,10000
Sara,False,0.0,0
Paula,True,78000.0,18000
Alex,True,,15000


Index(['Jack', 'Sara', 'Paula', 'Alex'], dtype='object', name='Person')

RangeIndex(start=0, stop=4, step=1)

In [None]:
# Index object has array-like behavior 
df = df.set_index('Person')
display('Paula' in df.index) # check an element

display([idx for idx in df.index]) # iteration

display(df.index.append(pd.Index(['Joe', 'Janet']))) # append creates a new Index object

# But it is immutable
try:
    df.index[0] = 'Joe'
except Exception as e:
    print(e)
    

True

['Jack', 'Sara', 'Paula', 'Alex']

Index(['Jack', 'Sara', 'Paula', 'Alex', 'Joe', 'Janet'], dtype='object')

Index does not support mutable operations


### Re-arranging (re-indexing) data in Series and DataFrame:

In [None]:
# Re-indexing a Series
s = pd.Series(['Jack', 'Sara', 'Paula', 'Alex'])
display(s)

display(s.reindex([1, 2, 0, 3]))

0     Jack
1     Sara
2    Paula
3     Alex
dtype: object

1     Sara
2    Paula
0     Jack
3     Alex
dtype: object

In [None]:
# Interpolation
s = pd.Series(['Jack', 'Sara'], index=[0, 3])
display(s)

display(s.reindex([0, 1, 2, 3, 4], method='backfill')) # Use next valid observation to fill gap
display(s.reindex([0, 1, 2, 3, 4], method='pad')) # Propagate last valid observation forward to next valid
display(s.reindex([0, 1, 2, 3, 4], method='nearest')) # Use nearest valid observations to fill gap

0    Jack
3    Sara
dtype: object

0    Jack
1    Sara
2    Sara
3    Sara
4     NaN
dtype: object

0    Jack
1    Jack
2    Jack
3    Sara
4    Sara
dtype: object

0    Jack
1    Jack
2    Sara
3    Sara
4    Sara
dtype: object

In [None]:
# Re-indexing DataFrame Rows
df = pd.DataFrame(data, 
                  columns=['Salary', 'Is Employed', 'Bonus'],
                  index=data['Person'])
display(df)
display(df.reindex(['Sara', 'Alex', 'Paula', 'Jack', 'Julio'])) # the row for 'Julio' will be all 'NaN'

Unnamed: 0,Salary,Is Employed,Bonus
Jack,65000.0,True,10000
Sara,0.0,False,0
Paula,78000.0,True,18000
Alex,,True,15000


Unnamed: 0,Salary,Is Employed,Bonus
Sara,0.0,False,0.0
Alex,,True,15000.0
Paula,78000.0,True,18000.0
Jack,65000.0,True,10000.0
Julio,,,


In [None]:
# Re-indexing columns
df.reindex(columns=['Bonus', 'Salary', 'Age', 'Is Employed']) # the 'Age' column will be all 'NaN'

Unnamed: 0,Bonus,Salary,Age,Is Employed
Jack,10000,65000.0,,True
Sara,0,0.0,,False
Paula,18000,78000.0,,True
Alex,15000,,,True


### Dropping rows or columns

In [None]:
# Dropping elements from a Series
s = pd.Series(['Jack', 'Sara', 'Paula', 'Alex'], index=['a', 'b', 'c', 'd'])
display(s.drop('d'))
display(s.drop(['a', 'c']))

a     Jack
b     Sara
c    Paula
dtype: object

b    Sara
d    Alex
dtype: object

In [None]:
# Dropping rows from a DataFrame
display(df)
display(df.drop(['Jack', 'Alex']))
df.drop('Sara', inplace=True) # in place dropping
display(df)

Unnamed: 0,Salary,Is Employed,Bonus
Jack,65000.0,True,10000
Sara,0.0,False,0
Paula,78000.0,True,18000
Alex,,True,15000


Unnamed: 0,Salary,Is Employed,Bonus
Sara,0.0,False,0
Paula,78000.0,True,18000


Unnamed: 0,Salary,Is Employed,Bonus
Jack,65000.0,True,10000
Paula,78000.0,True,18000
Alex,,True,15000


In [None]:
# Dropping columns
display(df.drop('Bonus', axis=1)) # axis = 1 or axis = 'columns'
display(df.drop(['Salary', 'Is Employed'], axis='columns'))

Unnamed: 0,Salary,Is Employed
Jack,65000.0,True
Paula,78000.0,True
Alex,,True


Unnamed: 0,Bonus
Jack,10000
Paula,18000
Alex,15000


### Selecting and Filtering

In [None]:
# Select one or more columns
display(df)
display(df.Salary) # select one column
display(df[['Salary', 'Bonus']]) # select more than one column

Unnamed: 0,Salary,Is Employed,Bonus
Jack,65000.0,True,10000
Paula,78000.0,True,18000
Alex,,True,15000


Jack     65000.0
Paula    78000.0
Alex         NaN
Name: Salary, dtype: float64

Unnamed: 0,Salary,Bonus
Jack,65000.0,10000
Paula,78000.0,18000
Alex,,15000


In [None]:
# Select uisng row indexes 
display(df.loc['Jack']) # select one row by index
display(df.loc[['Jack', 'Alex']]) # select more than one row by index

Salary         65000
Is Employed     True
Bonus          10000
Name: Jack, dtype: object

Unnamed: 0,Salary,Is Employed,Bonus
Jack,65000.0,True,10000
Alex,,True,15000


In [None]:
# Select using row numbers
display(df.iloc[0]) # selecting one row by row number
display(df.iloc[[0, 2]]) # select more than one row by list of row numbers
display(df.iloc[0:2]) # select more than one row by a range
display(df[0:2]) # alternative method

Salary         65000
Is Employed     True
Bonus          10000
Name: Jack, dtype: object

Unnamed: 0,Salary,Is Employed,Bonus
Jack,65000.0,True,10000
Alex,,True,15000


Unnamed: 0,Salary,Is Employed,Bonus
Jack,65000.0,True,10000
Paula,78000.0,True,18000


Unnamed: 0,Salary,Is Employed,Bonus
Jack,65000.0,True,10000
Paula,78000.0,True,18000


In [None]:
# Select based on row and column indexes
display(df.loc['Jack', 'Salary']) # one row and one column (one cell)
display(df.loc[['Jack', 'Paula'], ['Salary', 'Bonus']]) # multiple rows and/or columns
display(df.loc['Jack', :'Is Employed']) # using ranges
display(df.iloc[0, 0]) # one cell by row and column numbers
display(df.iloc[0:2, [0, 2] ]) # multiple rows and/or columns (list or range)

65000.0

Unnamed: 0,Salary,Bonus
Jack,65000.0,10000
Paula,78000.0,18000


Salary         65000
Is Employed     True
Name: Jack, dtype: object

65000.0

Unnamed: 0,Salary,Bonus
Jack,65000.0,10000
Paula,78000.0,18000


In [None]:
# Filter based on the values of a column
display(df[df.Salary > 70_000]) # Filter high paid 
display(df.loc[:, 'Bonus'][df.Salary > 70_000]) # Filter high paid (show only one column) 
display(df[~df['Is Employed']]) # Filter unemploed 

Unnamed: 0,Salary,Is Employed,Bonus
Paula,78000.0,True,18000


Paula    18000
Name: Bonus, dtype: int64

Unnamed: 0,Salary,Is Employed,Bonus


In [None]:
# Filter based on the values of the entire data frame
df2 = pd.DataFrame(np.arange(12).reshape(3, 4), columns=list('abcd'))
display(df2 % 2 == 0) # Boolean DataFrame
display(df2[df2 % 2 == 0]) # Filtering by boolean DataFrame

Unnamed: 0,a,b,c,d
0,True,False,True,False
1,True,False,True,False
2,True,False,True,False


Unnamed: 0,a,b,c,d
0,0,,2,
1,4,,6,
2,8,,10,


### Data Alignment

In [None]:
# Aligning two Series during arithmetic operations
salaries = pd.Series([65_000, 78_000], index=['Jack', 'Paula'])
bonuses = pd.Series([10_000, 18_000, 15_000], index=['Alex', 'Paula', 'Jack'])

display(salaries + bonuses) # alignment + addition

# Aligning two DataFrames during arithmetic operations
df1 = pd.DataFrame(np.arange(6).reshape(2, 3), columns=list('abc'), index=['Jack', 'Paula'])
df2 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=list('bcd'), index=['Alex', 'Paula', 'Jack'])

display(df1 - df2) # alignment + suntraction

Alex         NaN
Jack     80000.0
Paula    96000.0
dtype: float64

Unnamed: 0,a,b,c,d
Alex,,,,
Jack,,-5.0,-5.0,
Paula,,1.0,1.0,


In [None]:
# Fill the missing values befoer aligned operation
display(df1.add(df2, fill_value=0)) # addition with fill value (fills with 0 before addition)

# other functions: sub, div, mul, floordiv, pow
# reverse functions: radd, rsub, rdiv, rmul, rfloordiv, rpow
display(1/df1)
display(df1.rdiv(1))
display((df1 + df2).rdiv(1, fill_value=0))

Unnamed: 0,a,b,c,d
Alex,,0.0,1.0,2.0
Jack,0.0,7.0,9.0,8.0
Paula,3.0,7.0,9.0,5.0


Unnamed: 0,a,b,c
Jack,inf,1.0,0.5
Paula,0.333333,0.25,0.2


Unnamed: 0,a,b,c
Jack,inf,1.0,0.5
Paula,0.333333,0.25,0.2


Unnamed: 0,a,b,c,d
Alex,inf,inf,inf,inf
Jack,inf,0.142857,0.111111,inf
Paula,inf,0.142857,0.111111,inf


### Broadcasting

In [None]:
df = pd.DataFrame(np.arange(6).reshape(2, 3), columns=['col1', 'col2', 'col3'])
display(df)
display(df + 5) # adding 5 to all the data frame elements

Unnamed: 0,col1,col2,col3
0,0,1,2
1,3,4,5


Unnamed: 0,col1,col2,col3
0,5,6,7
1,8,9,10


In [None]:
display(df - df.iloc[1]) # subtracting the second row from all rows
display(df.sub(df['col3'], axis='index')) # subtracting thiord column from all columns
display(df - pd.Series([3, 1], index=['col1', 'col4'])) # broadcasting and alignment 

Unnamed: 0,col1,col2,col3
0,-3,-3,-3
1,0,0,0


Unnamed: 0,col1,col2,col3
0,-2,-1,0
1,-2,-1,0


Unnamed: 0,col1,col2,col3,col4
0,-3.0,,,
1,0.0,,,


### Element-wise Functions (ufuncs)

In [None]:
# All NumPy ufuncs work on pandas objects too:
display(np.sin(df))
display(np.abs(df - 10))

Unnamed: 0,col1,col2,col3
0,0.0,0.841471,0.909297
1,0.14112,-0.756802,-0.958924


Unnamed: 0,col1,col2,col3
0,10,9,8
1,7,6,5


### Aggregation and Statistics Methods


| Method | Description | Method | Description | 
| :-- | :-- | :-- | :-- | 
| info | Prints a concise summary of a DataFrame | prod | Returns the product of the values for the requested axis |
| describe | Generates descriptive statistics for a Series or a DataFrame column | var | Returns unbiased variance over requested axis |
| count | Counts non-NA cells for each column or row | std | Returns sample standard deviation over requested axis |
| min, max | Returns the minimum/maximum of the values for the requested axis | skew | Returns unbiased skew over requested axis |
| argmin, argmax | Returns int position of the smallest/largest value  | kurt | Returns unbiased kurtosis over requested axis |
| idxmin, idxmax | Returns index of first occurrence of minimum/maximum over requested axis | cumsum | Returns cumulative sum over a DataFrame or Series axis |
| quantile | Returns values at the given quantile over requested axis | cummin, cummax | Returns cumulative minimum/maximum over a DataFrame or Series axis |
| sum | Returns the sum of the values for the requested axis | cumprod | Returns cumulative product over a DataFrame or Series axis |
| mean | Returns the mean of the values for the requested axis | diff | First discrete difference of element |
| median | Returns the mean of the values for the requested axis | pct_change | Percentage change between the current and a prior element |
| mad | Returns the mean absolute deviation of the values for the requested axis | apply | Applies a function along an axis of the DataFrame |

In [None]:
df = pd.DataFrame({
    'Salary': {'Jack': 65_000, 'Sara': 0, 'Paula': 78_000},
    'Is Employed': {'Jack': True, 'Sara': False, 'Paula': True, 'Alex': True},
    'Bonus': {'Jack': 10_000, 'Sara': 0, 'Paula': 18_000, 'Alex': 15_000},
    'Age': {'Jack': 25},
})

display(df)

Unnamed: 0,Salary,Is Employed,Bonus,Age
Jack,65000.0,True,10000,25.0
Sara,0.0,False,0,
Paula,78000.0,True,18000,
Alex,,True,15000,


In [None]:
# Information
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Jack to Alex
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Salary       3 non-null      float64
 1   Is Employed  4 non-null      bool   
 2   Bonus        4 non-null      int64  
 3   Age          1 non-null      float64
dtypes: bool(1), float64(2), int64(1)
memory usage: 132.0+ bytes


In [None]:
# Descriptive statistics
df.describe()

Unnamed: 0,Salary,Bonus,Age
count,3.0,4.0,1.0
mean,47666.666667,10750.0,25.0
std,41789.153298,7889.866919,
min,0.0,0.0,25.0
25%,32500.0,7500.0,25.0
50%,65000.0,12500.0,25.0
75%,71500.0,15750.0,25.0
max,78000.0,18000.0,25.0


In [None]:
display(df.sum()) # calculate sum of each column (True = 1, False = 0. NaN is skipped)
display(df.sum(axis='columns')) # calculate sume of each row
display(df.loc[:, df.columns != 'Is Employed'].mean()) # calculate sum of each column (excluding 'Is Employee')
display(df.mean(axis='columns', skipna=False)) # NaN is not ignored , the result of every row with NaN is NaN

Salary         143000.0
Is Employed         3.0
Bonus           43000.0
Age                25.0
dtype: float64

Jack     75026.0
Sara         0.0
Paula    96001.0
Alex     15001.0
dtype: float64

Salary    47666.666667
Bonus     10750.000000
Age          25.000000
dtype: float64

Jack     18756.5
Sara         NaN
Paula        NaN
Alex         NaN
dtype: float64

In [None]:
# Apply a function
display(df.apply(lambda col: col.max() - col.min())) # apply a lambda function
display(df.apply(lambda col: np.divide(col.max(), col.min()), axis='columns')) # apply a lambda function

Salary         78000.0
Is Employed        1.0
Bonus          18000.0
Age                0.0
dtype: float64

  This is separate from the ipykernel package so we can avoid doing imports until


Jack     65000.0
Sara         NaN
Paula    78000.0
Alex     15000.0
dtype: float64

### Some Other Useful Functions

In [None]:
# Unique values for a Series
display(df['Is Employed'].unique())

# Count unique values
display(df['Is Employed'].value_counts())

array([ True, False])

True     3
False    1
Name: Is Employed, dtype: int64

In [None]:
s = pd.Series(['Jack', 'Paula', 'Alex', 'Sara'])
display(s.isin(['Jack', 'Pamela'])) # check if elements of a Series are in a list
display(df.isin([False, 0])) # check if elements of a DataFrame are in a list

0     True
1    False
2    False
3    False
dtype: bool

Unnamed: 0,Salary,Is Employed,Bonus,Age
Jack,False,False,False,False
Sara,True,True,True,False
Paula,False,False,False,False
Alex,False,False,False,False


In [None]:
df.fillna(0) # fills missing values by a value (returns a new DataFrame)

Unnamed: 0,Salary,Is Employed,Bonus,Age
Jack,65000.0,True,10000,25.0
Sara,0.0,False,0,
Paula,78000.0,True,18000,
Alex,,True,15000,


### Sorting

In [None]:
display(df.sort_index(ascending=False)) # sorting the row index 
display(df.sort_index(axis='columns')) # sort the column lables

Unnamed: 0,Salary,Is Employed,Bonus,Age
Sara,0.0,False,0,
Paula,78000.0,True,18000,
Jack,65000.0,True,10000,25.0
Alex,,True,15000,


Unnamed: 0,Age,Bonus,Is Employed,Salary
Jack,25.0,10000,True,65000.0
Sara,,0,False,0.0
Paula,,18000,True,78000.0
Alex,,15000,True,


In [None]:
display(df.Salary.sort_values()) # sorting a Series
display(df.sort_values(by='Salary')) # sorting DataFrame by one column
display(df.sort_values(by=['Age', 'Salary'], ascending=False)) # sorting DataFrame by multiple columns

Sara         0.0
Jack     65000.0
Paula    78000.0
Alex         NaN
Name: Salary, dtype: float64

Unnamed: 0,Salary,Is Employed,Bonus,Age
Sara,0.0,False,0,
Jack,65000.0,True,10000,25.0
Paula,78000.0,True,18000,
Alex,,True,15000,


Unnamed: 0,Salary,Is Employed,Bonus,Age
Jack,65000.0,True,10000,25.0
Paula,78000.0,True,18000,
Sara,0.0,False,0,
Alex,,True,15000,


In [None]:
display(df.Salary.rank()) # ranking a series based on values
display(df.rank()) # ranking the DataFrame columns
display(df.rank(axis=1)) # ranking a DataFrame rows

Jack     2.0
Sara     1.0
Paula    3.0
Alex     NaN
Name: Salary, dtype: float64

Unnamed: 0,Salary,Is Employed,Bonus,Age
Jack,2.0,3.0,2.0,1.0
Sara,1.0,1.0,1.0,
Paula,3.0,3.0,4.0,
Alex,,3.0,3.0,


Unnamed: 0,Salary,Is Employed,Bonus,Age
Jack,4.0,1.0,3.0,2.0
Sara,2.0,2.0,2.0,
Paula,3.0,1.0,2.0,
Alex,,1.0,2.0,


In [None]:
# Breaking the ties
display(df['Is Employed'].rank(ascending=False, method='first')) # assign ranks in the order the values appear in the data
display(df['Is Employed'].rank(ascending=False, method='max')) # use the maximum rank for the whole group
display(df['Is Employed'].rank(ascending=False, method='min')) # use the minimum rank for the whole group
display(df['Is Employed'].rank(ascending=False, method='average')) # assign the average rank to each entry in the equal group

Jack     1.0
Sara     4.0
Paula    2.0
Alex     3.0
Name: Is Employed, dtype: float64

Jack     3.0
Sara     4.0
Paula    3.0
Alex     3.0
Name: Is Employed, dtype: float64

Jack     1.0
Sara     4.0
Paula    1.0
Alex     1.0
Name: Is Employed, dtype: float64

Jack     2.0
Sara     4.0
Paula    2.0
Alex     2.0
Name: Is Employed, dtype: float64