<a href="https://colab.research.google.com/github/saeedalig/Pandas-Tutorial/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Pandas

Pandas is an open-source library that is made mainly for working with relational or labeled data. It provides various data structures and operations for manipulating numerical data and time series. This library is built on top of the NumPy library. Pandas is fast and it has high performance & productivity for users.

Advantages

1.Fast and efficient for manipulating and analyzing data.\
2.Data from different file objects can be loaded.\
3.Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data\
4.Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects\
5.Data set merging and joining.\
6.Flexible reshaping and pivoting of data sets\
7.Provides time-series functionality.


Pandas generally provide two data structures for manipulating data, They are:

Series \
DataFrame  

### Series [ pd.Series()]

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.


### DataFrame  [ pd.DataFrame() ]

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
ser = pd.Series([1,2,3,4,5])  ## 0-4 --> index( by default )  1-5 --> values
print(ser)
print()
print('values :', ser.values)
print()
print('index :', ser.index)

0    1
1    2
2    3
3    4
4    5
dtype: int64

values : [1 2 3 4 5]

index : RangeIndex(start=0, stop=5, step=1)


In [None]:
# sclicing
print(ser[0:4])
print()
print(ser > 3)  ## give true/false
print()
print(ser[ser > 3])

0    1
1    2
2    3
3    4
dtype: int64

0    False
1    False
2    False
3     True
4     True
dtype: bool

3    4
4    5
dtype: int64


In [None]:
ser = pd.Series([1,2,3,4,5])  ## 0-4 --> index( by default )  1-5 --> values
print(ser)
print()

print(2 in ser) ## whether 2 is present or not in given values (Boolean)
2 in ser.values
print()

print(2 in ser.index)  ## ## whether 2 is present or not in given indexes (Boolean)

0    1
1    2
2    3
3    4
4    5
dtype: int64

True

True


In [None]:
ser1 = pd.Series([30,10,50,40,20], index=['a','f','c','z','e'])  ## you can give your own indexes
print(ser1)
print()

## sorted the index
print(ser1.sort_index())
print()

## sorted the values (ascending order)
print(ser1.sort_values())

a    30
f    10
c    50
z    40
e    20
dtype: int64

a    30
c    50
e    20
f    10
z    40
dtype: int64

f    10
e    20
a    30
z    40
c    50
dtype: int64


In [None]:
ser2 = pd.Series([10,50, np.nan,60,30,15], index=['e','t','s', 'a', 'l', 'g'])
print(ser2)
print()

print(ser2.sort_index())

print()

print(ser2.sort_values())

e    10.0
t    50.0
s     NaN
a    60.0
l    30.0
g    15.0
dtype: float64

a    60.0
e    10.0
g    15.0
l    30.0
s     NaN
t    50.0
dtype: float64

e    10.0
g    15.0
l    30.0
t    50.0
a    60.0
s     NaN
dtype: float64


### rank() 
DataFrame.rank(axis=0, method=’average’, numeric_only=None, na_option=’keep’, ascending=True, pct=False)

In [None]:
ser2 = pd.Series([10,50, np.nan,60,30,15], index=['e','t','s', 'a', 'l', 'g'])
print(ser2.rank(ascending=True ,method='dense'))

e    1.0
t    4.0
s    NaN
a    5.0
l    3.0
g    2.0
dtype: float64


In [None]:
ser3 = pd.Series([30,10,50,40,20,20], index=['a','f','c','z','e','e'])  ## you can give your own indexes
print(ser3)
print()

print(ser3['e'])       ## print(ser3['e']) == print(ser3.loc['e']) --> gives the same output
print()

print(ser3.loc['e'])

print()

print('iloc[4] :',ser3.iloc[4])   ## diff btw loc and iloc--> loc signifies INDEX NAME, iloc INDEX NUMBER
print()

print(ser3.index.is_unique)   ## is_unique --> gives boolean(true/false)


a    30
f    10
c    50
z    40
e    20
e    20
dtype: int64

e    20
e    20
dtype: int64

e    20
e    20
dtype: int64

iloc[4] : 20

False


In [None]:
ser4 = pd.Series([30,10,50,40,40,20], index=['a','f','c','z','e','e'])  ## you can give your own indexes

print(type(ser4.loc['e'].values.tolist()))

print(ser4.loc['e'].values.tolist())

<class 'list'>
[40, 20]


In [None]:
print(ser4[ser4 == 30])

# return True/ False
print(ser4 == 30)

a    30
dtype: int64
a     True
f    False
c    False
z    False
e    False
e    False
dtype: bool


In [None]:
# Two Indexes
ser5 = pd.Series([30,10,50,40,35,60], index=[['i1', 'i1', 'i2', 'i2', 'i3', 'i3'],['a','f','c','z','r', 'y']])
ser5

i1  a    30
    f    10
i2  c    50
    z    40
i3  r    35
    y    60
dtype: int64

In [None]:
print(ser5.index)
print(ser5.values)

MultiIndex([('i1', 'a'),
            ('i1', 'f'),
            ('i2', 'c'),
            ('i2', 'z'),
            ('i3', 'r'),
            ('i3', 'y')],
           )
[30 10 50 40 35 60]


In [None]:
print(ser5.loc['i1'])
print(ser5.loc['i1']['a'])
print(ser5.loc['i1', 'f'])
print('-------------')

print(ser5.loc['i2'])
print('-------------')

print(ser5.loc['i3'])

a    30
f    10
dtype: int64
30
10
-------------
c    50
z    40
dtype: int64
-------------
r    35
y    60
dtype: int64


In [None]:
print(ser5.iloc[:5])

i1  a    30
    f    10
i2  c    50
    z    40
i3  r    35
dtype: int64


In [None]:
ser6 = pd.Series([30,10,50,40,35,60], index=['a','f','c','z','r', 'y'])
print(ser6)


a    30
f    10
c    50
z    40
r    35
y    60
dtype: int64


In [None]:
ser6.iloc[0] = None

ser6.iloc[1]= np.nan

print(ser6)

a     NaN
f     NaN
c    50.0
z    40.0
r    35.0
y    60.0
dtype: float64


In [None]:
ser6.loc['c'] = 34

ser6.loc['z']= np.nan

print(ser6)

a     NaN
f     NaN
c    34.0
z     NaN
r    35.0
y    60.0
dtype: float64


In [None]:
print(pd.isnull(ser6))
print("----------------")
print(pd.isnull(ser6).sum())

a     True
f     True
c     True
z     True
r    False
y    False
dtype: bool
----------------
4


In [None]:
ser6.dropna()

c    34.0
r    35.0
y    60.0
dtype: float64

In [None]:
ser6[ser6.notnull()]   ## ser6.dropna()  and ser6[ser6.notnull()]  return the same result

c    34.0
r    35.0
y    60.0
dtype: float64

In [None]:
age = [34,25,20,37,28]
sal = [20000, 40000, 35000, 25000, 15000]

age_sal = pd.Series(sal, index = age)
age_sal

34    20000
25    40000
20    35000
37    25000
28    15000
dtype: int64

In [None]:
# Assign name to index and value
age_sal.name = 'Emp_Sal'
age_sal.index.name = 'Emp_Age'

In [None]:
age_sal

Emp_Age
34    20000
25    40000
20    35000
37    25000
28    15000
Name: Emp_Sal, dtype: int64

In [None]:
age_sal_drop = age_sal.drop([34,25])  ## drop the specific values
age_sal_drop 

Emp_Age
20    35000
37    25000
28    15000
Name: Emp_Sal, dtype: int64