Pandas provides high-performance, easy-to-use data structures and data analysis tools for Python.
Think of it as an extremely powerful version of Excel with a lot more features.

**1) pandas Data structures**

Series and DataFrame are two main data structures in pandas

1.1) Series
Series is a 1-D array-like object, which contains values and an array of labels (associated with the values)
Series is built on top of NumPy array object
Series can hold any arbitrary Python object

In [126]:
import numpy as np
import pandas as pd

In [127]:
print('numpy version: ', np.__version__)
print('pandas version: ', pd.__version__)

numpy version:  1.23.5
pandas version:  1.5.2


We can create a Series using a list, numpy array or dictionary
Let's create these objects and convert them into panda's Series!

In [128]:
my_labels = ['x', 'y', 'z']
my_data = [100, 200, 300]

In [129]:
# We can use pd.Series (with Capital S) for conversion
pd.Series(data = my_data)

0    100
1    200
2    300
dtype: int64

Column "0, 1, 2" is automatically generated index for the elements in series with data "100 200 300".
We can specify index values and grab the respective data/values using these indexes.

In [130]:
pd.Series(data = my_data, index = my_labels)

x    100
y    200
z    300
dtype: int64

Series using NumPy arrays

In [131]:
# Let's create NumPy array from my_data and then Series from that
# array
my_array = np.array(my_data)
pd.Series(data = my_array)

0    100
1    200
2    300
dtype: int32

Series using dictionary

In [132]:
# Let's create a dictionary my_dict
my_dict = {'x': 100, 'y': 200, 'z': 300}
pd.Series(my_dict)

x    100
y    200
z    300
dtype: int64

Series can hold a wide variety of object types, lets see some examples

In [133]:
# Let's pass 'my_labels' as data now
pd.Series(data = my_labels)

0    x
1    y
2    z
dtype: object

In [134]:
# We can pas a list of built-in functions!
pd.Series([min, max, sum, print])

0      <built-in function min>
1      <built-in function max>
2      <built-in function sum>
3    <built-in function print>
dtype: object

Grabbing data from Series:
Indexes are the key thing to understand in Series. Pandas uses these indexes (numbers or names) for fast info retrieval.
Index works just like a hash table or dictionary

In [135]:
# Creating three dictionaries dict_1, dict_2, dict_3
dict_1 = {'Toronto': 500, 'Calgary': 200, 'Vancouver': 300, 'Montreal': 700}
dict_2 = {'Calgary': 200, 'Vancouver': 300, 'Montreal': 700}
dict_3 = {'Calgary': 200, 'Vancouver': 300, 'Montreal': 700, 'Jasper': 1000}

In [136]:
# Create pandas series from the dictionaries
ser1 = pd.Series(dict_1)
ser2 = pd.Series(dict_2)
ser3 = pd.Series(dict_3)

In [137]:
ser1

Toronto      500
Calgary      200
Vancouver    300
Montreal     700
dtype: int64

In [138]:
# Grabbing info for series is very much similar to dictionary
ser1['Calgary']

200

Basic operations on series are usually based on the index.
(The index in resulting series is the same as the one passed in the dictionary)

For example, if we wanted to add ser1 + ser2, it tries to match up the operation based on index
For Calgary, Montreal and Vancouver it adds the values, whereas for Toronto it cannot find a match and puts NaN

In [139]:
ser4 = ser1 + ser2
ser4

Calgary       400.0
Montreal     1400.0
Toronto         NaN
Vancouver     600.0
dtype: float64

In [140]:
ser3 # C J M V are in order that was passed in

Calgary       200
Vancouver     300
Montreal      700
Jasper       1000
dtype: int64

In [141]:
ser5 = ser4 + ser3
ser5

Calgary       600.0
Jasper          NaN
Montreal     2100.0
Toronto         NaN
Vancouver     900.0
dtype: float64

Good to know!
Below are some commonly used built-in functions and attributes for series during the data processing.

In [142]:
# isnull() - detect missing data
ser4.isnull()

Calgary      False
Montreal     False
Toronto       True
Vancouver    False
dtype: bool

In [143]:
# notnull() - detect existing (non-null) values
# pd.notnull(ser5) is same as ser5.notnull()
ser5.notnull()

Calgary       True
Jasper       False
Montreal      True
Toronto      False
Vancouver     True
dtype: bool

In [144]:
# head(), tail()
# to view a small sample of a Series or DataFrame,
# use head() and tail()
# The default number of elements to display is 5, but you
# can pass in a custom number
ser1.head(1)

Toronto    500
dtype: int64

In [145]:
ser1.tail(1)

Montreal    700
dtype: int64

In [146]:
# axes - returns the list of the row axis labels
ser1.axes

[Index(['Toronto', 'Calgary', 'Vancouver', 'Montreal'], dtype='object')]

In [147]:
# values - returns the list of values/data
ser1.values

array([500, 200, 300, 700], dtype=int64)

In [148]:
# size - returns the number of elements in the series
ser1.size

4

In [149]:
# empty - True if the series is empty
ser1.empty

False

**1.2) DataFrame**
A very simple way to think about the DataFrame is "bunch of Series together such as they share the same index"
A Dataframe is a rectangular table of data that contains an order collection of columns, each of which can be a different value type (string, numeric, boolean etc.)
DataFrame has both row & column index. It can be thought of as a dictionary of Series all sharing the same index (any row or column)

Lets create two labels or indexes:
index: for rows 'r1 to r10'
columns: for columns 'c1 to c10'

In [150]:
index = 'r1 r2 r3 r4 r5 r6 r7 r8 r9 r10'.split()
columns = 'c1 c2 c3 c4 c5 c6 c7 c8 c9 c10'.split()

In [151]:
index

['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']

In [152]:
columns

['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']

Let's start with a simple example, lets first create a matrix using arange() and reshape

In [153]:
array_2d = np.arange(0, 100).reshape(10, 10)
array_2d

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

In [154]:
# Now, lets create our first DataFrame using index, columns and
# array_2d
df = pd.DataFrame(data = array_2d, index = index, columns = columns)

In [155]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


**df** is our first dataframe. We have columns, c1 to c10 and their corresponding rows, r1 to r10.
Each column is actually a pandas series, sharing a common index, which is the row labels.

Grabbing Columns from dataframe - grabbing data

In [156]:
# Grabbing a single column
df['c1']

r1      0
r2     10
r3     20
r4     30
r5     40
r6     50
r7     60
r8     70
r9     80
r10    90
Name: c1, dtype: int32

Did you notice? The above output looks like a series right?
The returned series has the same index as the DataFrame

In [157]:
type(df['c1'])

pandas.core.series.Series

In [158]:
# We can grab more than one column
# simply pass in the list of columns you need!
df[['c1', 'c10']]

Unnamed: 0,c1,c10
r1,0,9
r2,10,19
r3,20,29
r4,30,39
r5,40,49
r6,50,59
r7,60,69
r8,70,79
r9,80,89
r10,90,99


df.<column_name> can be used to grab a column as well. 
e.g. df.c1, df.c2 etc

In [159]:
df.c5

r1      4
r2     14
r3     24
r4     34
r5     44
r6     54
r7     64
r8     74
r9     84
r10    94
Name: c5, dtype: int32

Adding a new column to dataframe

Let's add a new column 'new' into our dataframe df by adding any two existing columns using a simple "+" operator!

In [160]:
df['new'] = df['c1'] + df['c2']
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new
r1,0,1,2,3,4,5,6,7,8,9,1
r2,10,11,12,13,14,15,16,17,18,19,21
r3,20,21,22,23,24,25,26,27,28,29,41
r4,30,31,32,33,34,35,36,37,38,39,61
r5,40,41,42,43,44,45,46,47,48,49,81
r6,50,51,52,53,54,55,56,57,58,59,101
r7,60,61,62,63,64,65,66,67,68,69,121
r8,70,71,72,73,74,75,76,77,78,79,141
r9,80,81,82,83,84,85,86,87,88,89,161
r10,90,91,92,93,94,95,96,97,98,99,181


Deleting column from dataframe

**drop()** - We can delete any column from a dataframe using drop() method.
Few important parameters we need to consider:
- label: column name that we need to pass, if we need to drop more than one column, it can be a list of column names
- axis: default value is 0, which refers to row. Meaning if we want to drop a column, we need to pass axis = 1
- inplace: default is False, we need to pass True for permanent delete. inplace makes sure that we don't delete a column by mistake. If we don't pass this parameter, the column won't be dropped from the dataframe

In [161]:
# Quick checks!
df.shape

(10, 11)

So we have 10 rows and 11 columns, lets try deleting the "new" column

In [162]:
df.drop(['new'], axis = 1, inplace = True)
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


Grabbing rows from dataframe

We can retrieve a row by its name or position with loc and iloc
**loc**: Access a rows by label(s)
**iloc**: Using row's index location

In [163]:
df.loc[['r2', 'r3']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29


In [164]:
# Using iloc
df.iloc[[1, 2]]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29


Grabbing a single element from a dataframe

In [165]:
# we need to tell the location of the element [row, col]
df.loc['r1', 'c1']

0

In [166]:
# another element, which is at [r2, c10
df.loc['r2', 'c10']

19

Grabbing a subset of a dataframe.

In [167]:
# We can grab a subset by passing list of required
# rows and list of required columns
df.loc[['r1', 'r2'], ['c1', 'c2']]

Unnamed: 0,c1,c2
r1,0,1
r2,10,11


In [168]:
# another example
df.loc[['r2', 'r5'], ['c3', 'c4']]

Unnamed: 0,c3,c4
r2,12,13
r5,42,43


Conditional selection or masking

In [169]:
# this is similar to numpy bool masking!
df > 5

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,False,False,False,False,False,False,True,True,True,True
r2,True,True,True,True,True,True,True,True,True,True
r3,True,True,True,True,True,True,True,True,True,True
r4,True,True,True,True,True,True,True,True,True,True
r5,True,True,True,True,True,True,True,True,True,True
r6,True,True,True,True,True,True,True,True,True,True
r7,True,True,True,True,True,True,True,True,True,True
r8,True,True,True,True,True,True,True,True,True,True
r9,True,True,True,True,True,True,True,True,True,True
r10,True,True,True,True,True,True,True,True,True,True


In [170]:
df != 0

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,False,True,True,True,True,True,True,True,True,True
r2,True,True,True,True,True,True,True,True,True,True
r3,True,True,True,True,True,True,True,True,True,True
r4,True,True,True,True,True,True,True,True,True,True
r5,True,True,True,True,True,True,True,True,True,True
r6,True,True,True,True,True,True,True,True,True,True
r7,True,True,True,True,True,True,True,True,True,True
r8,True,True,True,True,True,True,True,True,True,True
r9,True,True,True,True,True,True,True,True,True,True
r10,True,True,True,True,True,True,True,True,True,True


In [171]:
# lets create a bool_mask for values that are divisible by 3
bool_mask = (df % 3) == 0
df[bool_mask]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0.0,,,3.0,,,6.0,,,9.0
r2,,,12.0,,,15.0,,,18.0,
r3,,21.0,,,24.0,,,27.0,,
r4,30.0,,,33.0,,,36.0,,,39.0
r5,,,42.0,,,45.0,,,48.0,
r6,,51.0,,,54.0,,,57.0,,
r7,60.0,,,63.0,,,66.0,,,69.0
r8,,,72.0,,,75.0,,,78.0,
r9,,81.0,,,84.0,,,87.0,,
r10,90.0,,,93.0,,,96.0,,,99.0


Good to know:
Its not common to use such a masking operation on an entire dataframe.
We usually use them on a column or rows instead.
For example, we dont want a row with NaN values
What to do? Lets have a look at one example

In [172]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


We want the rows based on a condition to any column
For e.g., we want to grab all the rows for which the values are greater than 11 in column c1

In [173]:
df[['c1']] > 11

Unnamed: 0,c1
r1,False
r2,False
r3,True
r4,True
r5,True
r6,True
r7,True
r8,True
r9,True
r10,True


So we dont want r1 and r2, as the condition is False there (it will add NaN - see above)
Let's filter the rows based on condition on column values (c1>11)

In [174]:
df[df['c1'] > 11] # df[boolean_mask]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [175]:
# Lets store our result
bool_ser = df['c1'] > 11
result = df[bool_ser]
result['c1']

r3     20
r4     30
r5     40
r6     50
r7     60
r8     70
r9     80
r10    90
Name: c1, dtype: int32