### <center> Pandas
<center> library used for working with datasets

- Pandas has functions for analyzing, cleaning, exploring & manipulating data
- The name "pandas" has a reference to both - ` "Panel Data" `, and ` "Python Data Analysis" `
- was create by ` Wes McKinney ` in 2008

#### Package Overview
- pandas is a Python package that provides fast, flexible and expressive data structures designed to make working with ` "relational" ` or ` "labelled" ` data easy & intuitive
- It has two primary data structure - ` Series ` &  `DataFrame`

#### Things that Pandas does well: 
- easy `missing data`
- ` size mutability `: columns can be inserted & deleted from DataFrames
- automatic & explicit ` data alignment `
- powerful & flexible ` group by ` functionality: to perform ` split-apply-combine ` operations on datasets for both ` aggregatinf & transforming data.
- Intelligent label-based ` slicing, fancy indexing & subsetting ` of large datasets
- Intuitive ` merging & joining ` datasets
- flexible ` reshaping ` and pivoting of datasets
- ` hierarchical labelling ` of axes
- ` time series-specific functionality `: date range generation & frequency conversion, moving window statistics, date shifting and lagging

In [1]:
# importing pandas
import numpy as np  # its good to import np
import pandas as pd # pd is standard pandas alias

### <center> <font color = 'orange'> Basic data structures in pandas

In [2]:
# Series
# 1D labelled array holding data of any type

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [3]:
# DataFrame
# 2D data structure that holds data like a 2D array 
# or a table with rows & columns

df1 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

df1

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


### <center> <font color = 'orange'> Viewing data

In [4]:
# df.head()
# top rows of the dataframe

df1.head()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [5]:
# df.tail()
# last rows of the dataframe

df1.tail()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [6]:
# df.index
# displays index of the dataframe as a list

df1.index

Index([0, 1, 2, 3], dtype='int64')

In [7]:
# df.columns
# displays columns of the dataframe
df1.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [8]:
# df.to_numpy()
# to return a numpy representation of the underlying data

df1.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

In [9]:
# df.describe()
# quick statistic summary of data

df1.describe()

Unnamed: 0,A,B,C,D
count,4.0,4,4.0,4.0
mean,1.0,2013-01-02 00:00:00,1.0,3.0
min,1.0,2013-01-02 00:00:00,1.0,3.0
25%,1.0,2013-01-02 00:00:00,1.0,3.0
50%,1.0,2013-01-02 00:00:00,1.0,3.0
75%,1.0,2013-01-02 00:00:00,1.0,3.0
max,1.0,2013-01-02 00:00:00,1.0,3.0
std,0.0,,0.0,0.0


### ***NumPy arrays have one dtype for the entire array while Pandas DataFrame have one dtype per column***

#### Sorting and Transposing



In [10]:
# df.T
# transposing data

df1.T

Unnamed: 0,0,1,2,3
A,1.0,1.0,1.0,1.0
B,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00
C,1.0,1.0,1.0,1.0
D,3,3,3,3
E,test,train,test,train
F,foo,foo,foo,foo


In [11]:
# df.sort_index()
# sorts by an axis

df1.sort_index()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [12]:
# df.sort_values()
# sorts by values

df1.sort_values(by = "B")

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


### <center> <font color = 'orange'> Getting items

In [13]:
# df["label/column"]
# selects a column for the desired label/column

df1['A']

0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

In [14]:
# df[index: index1]
# slicing
# row level
# excludes last index

df1[0: 2]

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo


### <center> <font color = 'orange'> Selection by label

- df.loc()
- df.at()

In [15]:
# df.loc[ ]

df1.loc[[0]]

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo


In [16]:
df1.loc[:, ['A', 'B']]

Unnamed: 0,A,B
0,1.0,2013-01-02
1,1.0,2013-01-02
2,1.0,2013-01-02
3,1.0,2013-01-02


In [17]:
# label slicing
# both endpoints are included

df1.loc[0: 1, ["A", "B"]]

Unnamed: 0,A,B
0,1.0,2013-01-02
1,1.0,2013-01-02


In [18]:
# selecting single row and column label returns a scalar

df1.loc[0, "A"]

1.0

### <center> <font color = 'orange'> Selection by Position
- df.iloc()

In [19]:
df1.iloc[3]

A                    1.0
B    2013-01-02 00:00:00
C                    1.0
D                      3
E                  train
F                    foo
Name: 3, dtype: object

In [20]:
df1.iloc[1: 3, 0: 2]

Unnamed: 0,A,B
1,1.0,2013-01-02
2,1.0,2013-01-02


In [21]:
# lists of integer position locations

df1.iloc[[1, 2, 3], [0, 2]]

Unnamed: 0,A,C
1,1.0,1.0
2,1.0,1.0
3,1.0,1.0


In [22]:
# for slicing rows explicitly

df1.iloc[1: 3, :]

Unnamed: 0,A,B,C,D,E,F
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo


In [23]:
# for slicing columns explicitly
df1.iloc[: , 1: 3]

Unnamed: 0,B,C
0,2013-01-02,1.0
1,2013-01-02,1.0
2,2013-01-02,1.0
3,2013-01-02,1.0


In [24]:
# for getting a value explicitly

df1.iloc[1, 1]

Timestamp('2013-01-02 00:00:00')

### <center> <font color = 'orange'> Boolean Indexing

In [25]:
df1[df1["A"] > 0]

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [28]:
# isin() method for filtering

df2 = df1.copy()

df2["G"] = ["one", "two", "three", "four"]

df2 = df2[df2["G"].isin(["two", "four"])]

df2

Unnamed: 0,A,B,C,D,E,F,G
1,1.0,2013-01-02,1.0,3,train,foo,two
3,1.0,2013-01-02,1.0,3,train,foo,four
