## PANDAS: INTRODUCTION

### PANDAS: Data Manipulation

> It is often said that 80% of data analysis is spent on data cleaning and preparing data. To get a handle on the problem, this section focuses on a small, but important aspect of data manipulation and cleaning with Pandas.

### PANDAS: Data Structures

*There are two different data structures are there in Pandas -*

* **Series -** It is a one-dimensional labeled array capable of holding any data type (e.g. integer, string, floating point number, Python objects etc.). The axis are collectively referred to as the index.

* **Data Frame -** It is a two-dimensional labeled data structure with columns of potentially different types. We can think of it like a spreadsheet or SQL table, or a Series of objects.

### PANDAS: Series Data Structure

**Definition of Series data structure -**

pandas.core.series.Series(data, index, dtype, copy)

**data:** data takes various forms like ndarray, list, constants etc.<br>
**index:** it is unique and hashable for easy identification<br>
**dtype:** it is for data type<br>
**copy:** only affects when Series is getting defined from one dimensional ndarray<br>

In [1]:
# importing required modules
import pandas as pd
import numpy as np

In [2]:
# creating empty Series
s = pd.Series()
print (s, len(s), type(s), id(s))

Series([], dtype: float64) 0 <class 'pandas.core.series.Series'> 2465717728592


  s = pd.Series()


In [3]:
# creating Series from ndarray
nddata = np.array(['aaa', 'bbb', 'ccc', 'ddd'])
print (nddata, type(nddata))
s = pd.Series(data = nddata)
print (s, type(s))

['aaa' 'bbb' 'ccc' 'ddd'] <class 'numpy.ndarray'>
0    aaa
1    bbb
2    ccc
3    ddd
dtype: object <class 'pandas.core.series.Series'>


In [7]:
nddata = np.array([100, 200, 400, 500, 350])
print (nddata, type(nddata))
s = pd.Series(data = nddata, copy = False)
print (s, type(s))
nddata[2] = 99999
print (nddata, type(nddata))
print (s, type(s))

[100 200 400 500 350] <class 'numpy.ndarray'>
0    100
1    200
2    400
3    500
4    350
dtype: int32 <class 'pandas.core.series.Series'>
[  100   200 99999   500   350] <class 'numpy.ndarray'>
0      100
1      200
2    99999
3      500
4      350
dtype: int32 <class 'pandas.core.series.Series'>


In [8]:
nddata = np.array([100, 200, 400, 500, 350])
print (nddata, type(nddata))
s = pd.Series(data = nddata, copy = True)
print (s, type(s))
nddata[2] = 99999
print (nddata, type(nddata))
print (s, type(s))

[100 200 400 500 350] <class 'numpy.ndarray'>
0    100
1    200
2    400
3    500
4    350
dtype: int32 <class 'pandas.core.series.Series'>
[  100   200 99999   500   350] <class 'numpy.ndarray'>
0    100
1    200
2    400
3    500
4    350
dtype: int32 <class 'pandas.core.series.Series'>


In [12]:
nddata = np.array(['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff', 'ggg'])
print (nddata, type(nddata))
s = pd.Series(data = nddata, index = [100, 101, 130, 120, 101, 303, 404])
print (s, type(s))
print (s[100])
print (s[101])
print (s[404])

['aaa' 'bbb' 'ccc' 'ddd' 'eee' 'fff' 'ggg'] <class 'numpy.ndarray'>
100    aaa
101    bbb
130    ccc
120    ddd
101    eee
303    fff
404    ggg
dtype: object <class 'pandas.core.series.Series'>
aaa
101    bbb
101    eee
dtype: object
ggg


In [14]:
nddata = np.array(['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff', 'ggg'])
print (nddata, type(nddata))
s = pd.Series(data = nddata, index = ['a', 'b', 'd', 'c', 'h', 'e', 'y'])
print (s, type(s))
print (s[0], s['a'])
print (s[1], s['b'])
print (s[4], s['h'])

['aaa' 'bbb' 'ccc' 'ddd' 'eee' 'fff' 'ggg'] <class 'numpy.ndarray'>
a    aaa
b    bbb
d    ccc
c    ddd
h    eee
e    fff
y    ggg
dtype: object <class 'pandas.core.series.Series'>
aaa aaa
bbb bbb
eee eee


In [15]:
# creating Series from dictionary
dictdata = {'apple':100, 'banana':220, 'orange':450, 'pineapple':320}
print (dictdata, type(dictdata))
s = pd.Series(data = dictdata)
print (s, type(s))

{'apple': 100, 'banana': 220, 'orange': 450, 'pineapple': 320} <class 'dict'>
apple        100
banana       220
orange       450
pineapple    320
dtype: int64 <class 'pandas.core.series.Series'>


In [18]:
dictdata = {'apple':100, 'banana':220, 'orange':450, 'pineapple':320}
print (dictdata, type(dictdata))
s = pd.Series(data = dictdata, index = ['apple', 'orange', 'orange', 'apple', 'orange', 'pineapple', 'banana'])
print (s, type(s))
print (s['banana'])
print (s['orange'])
print (s['apple'])

{'apple': 100, 'banana': 220, 'orange': 450, 'pineapple': 320} <class 'dict'>
apple        100
orange       450
orange       450
apple        100
orange       450
pineapple    320
banana       220
dtype: int64 <class 'pandas.core.series.Series'>
220
orange    450
orange    450
orange    450
dtype: int64
apple    100
apple    100
dtype: int64


In [19]:
# creating Series from scaler
s = pd.Series(data = 5, index = [0, 1, 2, 3, 4, 5, 6])
print (s, type(s))

0    5
1    5
2    5
3    5
4    5
5    5
6    5
dtype: int64 <class 'pandas.core.series.Series'>


In [25]:
# creating Series from list
listdata = ['Monday', 'Friday', 'Saturday', 'Tuesday']
print (listdata, type(listdata))
s = pd.Series(data = listdata)
print (s, type(s))
s = pd.Series(data = listdata, index = ['1st', '2nd', '3rd', '4th'])
print (s, type(s))
s = pd.Series(data = listdata, index = ['3rd', '1st', '4th', '2nd'])
print (s, type(s))
print (s.sort_values())
print (s.sort_index())

['Monday', 'Friday', 'Saturday', 'Tuesday'] <class 'list'>
0      Monday
1      Friday
2    Saturday
3     Tuesday
dtype: object <class 'pandas.core.series.Series'>
1st      Monday
2nd      Friday
3rd    Saturday
4th     Tuesday
dtype: object <class 'pandas.core.series.Series'>
3rd      Monday
1st      Friday
4th    Saturday
2nd     Tuesday
dtype: object <class 'pandas.core.series.Series'>
1st      Friday
3rd      Monday
4th    Saturday
2nd     Tuesday
dtype: object
1st      Friday
2nd     Tuesday
3rd      Monday
4th    Saturday
dtype: object


In [32]:
listdata = ['Monday', 'Friday', 'Saturday', 'Tuesday', 'Wednesday', 'Sunday']
print (listdata, type(listdata))
s = pd.Series(data = listdata)
print (s, type(s))
print (s[2])
print (s[:3])
print (s[3:])
print (s[2:4])

['Monday', 'Friday', 'Saturday', 'Tuesday', 'Wednesday', 'Sunday'] <class 'list'>
0       Monday
1       Friday
2     Saturday
3      Tuesday
4    Wednesday
5       Sunday
dtype: object <class 'pandas.core.series.Series'>
Saturday
0      Monday
1      Friday
2    Saturday
dtype: object
3      Tuesday
4    Wednesday
5       Sunday
dtype: object
2    Saturday
3     Tuesday
dtype: object


### PANDAS: Data Frame Structure

### Create Data Frame

In [76]:
column_names = ['name', 'age', 'gender', 'job']
data_list = [['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']]   # list of lists
user1 = pd.DataFrame(data = data_list, columns = column_names)
user1

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student


In [77]:
data_dict = {'name':['eric', 'paul'], 'age':[22, 58], 'gender':['M', 'F'], 'job':['student', 'manager']}   # dictionary
print (data_dict)
user2 = pd.DataFrame(data = data_dict)
user2

{'name': ['eric', 'paul'], 'age': [22, 58], 'gender': ['M', 'F'], 'job': ['student', 'manager']}


Unnamed: 0,name,age,gender,job
0,eric,22,M,student
1,paul,58,F,manager


In [78]:
data_dict = dict(name = ['peter', 'julie'], age = [33, 44], gender = ['M', 'F'], job = ['engineer', 'scientist'])   # dictionary
print (data_dict)
user3 = pd.DataFrame(data = data_dict)
user3

{'name': ['peter', 'julie'], 'age': [33, 44], 'gender': ['M', 'F'], 'job': ['engineer', 'scientist']}


Unnamed: 0,name,age,gender,job
0,peter,33,M,engineer
1,julie,44,F,scientist


In [79]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal'], 'emp_age':[40, 50, 45]}
emp_id = [100, 101, 102]
df = pd.DataFrame(data = data_dict, index = emp_id)
print (df.columns)
print (df.index)
df

Index(['emp_name', 'emp_age'], dtype='object')
Int64Index([100, 101, 102], dtype='int64')


Unnamed: 0,emp_name,emp_age
100,Amal,40
101,Kamal,50
102,Bimal,45


In [80]:
df = df.reset_index()
df

Unnamed: 0,index,emp_name,emp_age
0,100,Amal,40
1,101,Kamal,50
2,102,Bimal,45


### Concatenate Data Frame

In [81]:
result = user1.append(user2, ignore_index = True)
result

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager


In [82]:
result = user1.append(user2).append(user3, ignore_index = True)
result

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [83]:
users = pd.concat([user1, user2, user3], ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


### Join Data Frame

In [88]:
data_dict = dict(name = ['alice', 'john', 'eric', 'julie', 'alex'], height = [165, 180, 175, 171, 169])
print (data_dict)
user4 = pd.DataFrame(data = data_dict)
user4

{'name': ['alice', 'john', 'eric', 'julie', 'alex'], 'height': [165, 180, 175, 171, 169]}


Unnamed: 0,name,height
0,alice,165
1,john,180
2,eric,175
3,julie,171
4,alex,169


In [89]:
# only mathcing rows
merge_inner = pd.merge(users, user4, on = "name", how = "inner")
merge_inner

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165
1,john,26,M,student,180
2,eric,22,M,student,175
3,julie,44,F,scientist,171


In [90]:
merge_inner = pd.merge(users, user4, on = "name")   # by default inner join
merge_inner

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165
1,john,26,M,student,180
2,eric,22,M,student,175
3,julie,44,F,scientist,171


In [91]:
# all rows from both the tables
merge_outer = pd.merge(users, user4, on = "name", how = "outer")
merge_outer

Unnamed: 0,name,age,gender,job,height
0,alice,19.0,F,student,165.0
1,john,26.0,M,student,180.0
2,eric,22.0,M,student,175.0
3,paul,58.0,F,manager,
4,peter,33.0,M,engineer,
5,julie,44.0,F,scientist,171.0
6,alex,,,,169.0


In [92]:
# all rows from left table and matching rows from the right table
merge_left = pd.merge(users, user4, on = "name", how = "left")
merge_left

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165.0
1,john,26,M,student,180.0
2,eric,22,M,student,175.0
3,paul,58,F,manager,
4,peter,33,M,engineer,
5,julie,44,F,scientist,171.0


In [93]:
# all rows from right table and matching rows from the left table
merge_right = pd.merge(users, user4, on = "name", how = "right")
merge_right

Unnamed: 0,name,age,gender,job,height
0,alice,19.0,F,student,165
1,john,26.0,M,student,180
2,eric,22.0,M,student,175
3,julie,44.0,F,scientist,171
4,alex,,,,169


### Summarizing the Data Frame

In [94]:
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [105]:
print (users.dtypes)
print (users.columns)
print (users.index)
print (f"This DataFrame with shape = {users.shape} and row count = {users.shape[0]} and column count = {users.shape[1]}")
print (users.values, type(users.values))

name      object
age        int64
gender    object
job       object
dtype: object
Index(['name', 'age', 'gender', 'job'], dtype='object')
RangeIndex(start=0, stop=6, step=1)
This DataFrame with shape = (6, 4) and row count = 6 and column count = 4
[['alice' 19 'F' 'student']
 ['john' 26 'M' 'student']
 ['eric' 22 'M' 'student']
 ['paul' 58 'F' 'manager']
 ['peter' 33 'M' 'engineer']
 ['julie' 44 'F' 'scientist']] <class 'numpy.ndarray'>


In [106]:
users.describe()

Unnamed: 0,age
count,6.0
mean,33.666667
std,14.895189
min,19.0
25%,23.0
50%,29.5
75%,41.25
max,58.0


In [107]:
users.describe(include = ['object'])

Unnamed: 0,name,gender,job
count,6,6,6
unique,6,2,4
top,julie,F,student
freq,1,3,3


In [108]:
users.describe(include = 'all')

Unnamed: 0,name,age,gender,job
count,6,6.0,6,6
unique,6,,2,4
top,julie,,F,student
freq,1,,3,3
mean,,33.666667,,
std,,14.895189,,
min,,19.0,,
25%,,23.0,,
50%,,29.5,,
75%,,41.25,,


In [109]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    6 non-null      object
 1   age     6 non-null      int64 
 2   gender  6 non-null      object
 3   job     6 non-null      object
dtypes: int64(1), object(3)
memory usage: 320.0+ bytes


### Column Selection

In [110]:
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [112]:
print (users.gender, type(users.gender))

0    F
1    M
2    M
3    F
4    M
5    F
Name: gender, dtype: object <class 'pandas.core.series.Series'>


In [115]:
print (users['gender'], type(users['gender']))

0    F
1    M
2    M
3    F
4    M
5    F
Name: gender, dtype: object <class 'pandas.core.series.Series'>


In [117]:
users[['name', 'gender', 'job']]

Unnamed: 0,name,gender,job
0,alice,F,student
1,john,M,student
2,eric,M,student
3,paul,F,manager
4,peter,M,engineer
5,julie,F,scientist


In [118]:
my_cols = ['name', 'gender', 'job']
users[my_cols]

Unnamed: 0,name,gender,job
0,alice,F,student
1,john,M,student
2,eric,M,student
3,paul,F,manager
4,peter,M,engineer
5,julie,F,scientist


### Row Selection

In [119]:
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [124]:
print (users.iloc[2])
print (users.loc[2])

name         eric
age            22
gender          M
job       student
Name: 2, dtype: object
name         eric
age            22
gender          M
job       student
Name: 2, dtype: object


In [136]:
print (users.iloc[4][3], users.iloc[4]['job'], users.loc[4][3], users.loc[4]['job'])
print (users.iloc[4, 3], users.iloc[4]['job'], users.loc[4][3], users.loc[4, 'job'])
print (users.shape, users.shape[0], users.shape[1], len(users), users.ndim, users.size)

engineer engineer engineer engineer
engineer engineer engineer engineer
(6, 4) 6 4 6 2 24


In [137]:
df = users.copy()
df

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [143]:
df = users.copy()
for i in range(df.shape[0]):
    row = df.iloc[i]
    # print (row)
    row.age += 100
    # print (row)
    df.iloc[i] = row
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,name,age,gender,job
0,alice,119,F,student
1,john,126,M,student
2,eric,122,M,student
3,paul,158,F,manager
4,peter,133,M,engineer
5,julie,144,F,scientist


### Row Selection / Filtering