

## Basics of Pandas 
1. [Commonly used operations](#1-Commonly-used-operations)
2. [Series](#2-Series)
3. [Dataframes](#3-Data-Frames)
4. [Index Objects](#4-Index-objects)
5. [Data Selections](#5-Data-Selections)
   - [Boolean Maskings](#51-Boolean-Maskings)
   - [Implict Explicit Selection](#52-Implict-Explicit-Selection)
   - [Operations with Dataframes](#53-Operations-with-Dataframes)

anchor # must must 


In [11]:
import numpy as np 
import pandas as pd 

#### 1 Commonly used functions  

In [18]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

df.dtypes # all dtypes of all columns 

A    int64
B    int64
dtype: object

In [29]:
df.info() # summary of dtypes of each column 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
dtypes: int64(2)
memory usage: 176.0 bytes


In [28]:
df.shape  # get the shape of the column => Recall that shape = "length" of each dimension (row, col) in this case 

(3, 2)

In [24]:
df.size # get the size of the column => Recall that siuze = "Volume" of the data frame 

6

In [27]:
df.describe() # common statistics of numerical columns 

Unnamed: 0,A,B
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


In [31]:
df.columns # columns of the df (features)

Index(['A', 'B'], dtype='object')

#### Helpful Reminder 


`Shape` => Length of each sides 

`size` => Volume of the shape 

--------------------------------------------------------------------------

In [33]:
df.rows # makes no sense, since rows are just values 

AttributeError: 'DataFrame' object has no attribute 'rows'

In [37]:
df.index # gets the indexes of the df -  In this case, the default is just the numerical index 

RangeIndex(start=0, stop=3, step=1)

#### 2 Series 

In [2]:
# the most basic array - Pandas Series are glorifed array, in the sense that they have explicit indexes 
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data 

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [5]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data 

# another way to think about series is that they are like dict with key-value pairs 

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [7]:
population_dict = {
    'California': 38332521,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135
}

population = pd.Series(population_dict)

population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [10]:
population['California':'New York'] # Note that when implicitly indexed, we use the indexs, rather than the position itself, thus this is quite similar to an array 

# series => key-dict init behavior + array indexing behaviour 

California    38332521
Texas         26448193
New York      19651127
dtype: int64

#### 3 Data Frames 

Dataframes are just a generalization of the series, meaning that instead of 1D (index + value), we havve multiple features now.

And the most intuitive behaviour is to align them on the same index 

In [41]:
population_dict = {
    'California': 38332521,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135
}

area_dict = {
    'California': 423967, 
    'Texas': 695662, 
    'New York': 141297,
    'Florida': 170312, 
    'Illinois': 149995
}

states = pd.DataFrame({'population': population, 'area': area_dict}) # in this case, we init 2 series, then align on the same index 
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [46]:
states['population'] # note that when we perform as such, we infact takle the FIRST column, and the assoicated index, rather than the first row 

# COnverntionally on a 2D array, mat[0] would return the first ROW since mat[row][col]

# But in the case of 2D array, to ensure that it fits with our intuition that 
# Columns = Features, taking the first element would simply mean taking the first feature. And not to forget the indexes as it references what are we talking about

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

#### 4 Index objects 

Indexes are the foundations of everything in pandas. Since its what we refer to generally, rather than the features.

In [83]:
ind = pd.Index([2, 3, 5, 7, 11])
ind 

Index([2, 3, 5, 7, 11], dtype='int64')

In [50]:
# We can perform the same operations on it 

ind[1]

ind[:1]

ind.size
ind.shape
ind.ndim # number of dimensions
ind.dtype


dtype('int64')

In [52]:
ind[1] = 0 # This is not allowed since indexes are not mutable for obvious reasons

TypeError: Index does not support mutable operations

#### 5 Data Selections

In [53]:
# For the case of series, just think of it as key-value of dicts 

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [None]:
for key, val in enumerate(data.items()): # very similar to dicts ! 
    print("key", key)
    print("val", val, "\n")

key 0
val ('a', 0.25) 

key 1
val ('b', 0.5) 

key 2
val ('c', 0.75) 

key 3
val ('d', 1.0) 



In [58]:
# explict by the NAME of the index 
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [59]:
# implicit by the positional index. Note that while the index may be of different value such as A B C ..., still inheritently has an integer attached to it .
data[0:2]

a    0.25
b    0.50
dtype: float64

##### 5.1 Boolean Maskings

In [None]:
data[(data > 0.3) & (data < 0.8)] # masking 

# Note that masking is a technique to FILTER the dataframe. 
# In this example, we are using a BOOLEAN MASK. And that the result for each conditinon would return a BOOLEAN DATAFRAME. 

b    0.50
c    0.75
dtype: float64

In [64]:
(data > 0.3)

a    False
b     True
c     True
d     True
dtype: bool

In [65]:
(data < 0.8)

a     True
b     True
c     True
d    False
dtype: bool

In [68]:
# Since boolean, we can juse use binary operators - Essentially return the rows where the operator AND returns TRUE 
data[(data > 0.3) & (data < 0.8)] 

# Since we are working with Boolean DF, AND/OR/NOT that only works with singular values would not work here. Hence must use bitwise.

b    0.50
c    0.75
dtype: float64

#### 5.2 Implict Explicit Selection

In [None]:
# Previously we seen that:
# implicit = by using the very default numerical indexing 
# explicit = by using the actual value of the index (that may or may not be integers)

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])

# But this creates a source of confusion, that is that when indexing, using numbers are explicit (since its int here). 
# when using during splicing, the same numbers used becomes an IMPLICIT manner.

In [None]:
data[1] # explicit index when indexing 

'a'

In [71]:
data[1:3] # implicit when using the actual numerical values 

3    b
5    c
dtype: object

In [76]:
# To solve this, we use loc and iloc 
# loc = explicit 
# iloc = implicit 
data.loc[1] # eplictly say that we use the VALUE of the index, which is 1:a 

'a'

In [77]:
data.loc[1:3] # eplictly say that we use the VALUE of the index, which is 1:a + 3:b

1    a
3    b
dtype: object

In [79]:
data.iloc[1] # implicity say that we use the DEFAULT NUMERIC INDEX of the element, which in this case refer to the second element in the dict 

'b'

In [80]:
data.iloc[1:3]

3    b
5    c
dtype: object

In [82]:
"""
As a rule of thumb: Explicit > Implicit
"""


'\nAs a rule of thumb: Explicit > Implicit\n'

#### 5.3 Operations with Dataframes

More towards commonly used operations with DFs 