# **Introduction to Pandas**
* **It is often said that 80% of data analysis activity is spent on the data cleaning and preparation of our data. To get the grip over this activity, this section will focus on a small but important aspect of data manipulation and cleaning with Pandas.**

## **Inbuilt Data Structure in Pandas**
**There are two types of data structures are there in Pandas -**
* **Series -** It is one-dimensional labeled array capable of holding any data type (integer, string, float, Python objects etc.) of data. The axis is collectively referred to as index.
* **Data Frame -** It is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL Table or a Series of objects

In [1]:
# importing all required modules
import pandas as pd
import numpy as np

## **Series Data Structure**
**pandas.core.series.Series(data, index, dtype, copy)**
* **data -** data takes various forms like ndarray, list, constants, dictionary etc.
* **index -** it is unique and hashable for easy identification.
* **dtype -** it is for data type.
* **copy -** copy data, and its default value is False. It only affects for Series or one dimensional ndarray inputs.

In [3]:
# creating empty Series
s = pd.Series()
print (s, type(s))

Series([], dtype: object) <class 'pandas.core.series.Series'>


In [7]:
arr_data = np.array([100, 200, 400, 500, 300])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data, copy = False)
print (s, type(s))

[100 200 400 500 300] <class 'numpy.ndarray'>
0    100
1    200
2    400
3    500
4    300
dtype: int64 <class 'pandas.core.series.Series'>


In [9]:
s[2] = 999
print(s)
print(arr_data)

0    100
1    200
2    999
3    500
4    300
dtype: int64
[100 200 999 500 300]


In [10]:
arr_data[3] = 888
print(s)
print(arr_data)

0    100
1    200
2    999
3    888
4    300
dtype: int64
[100 200 999 888 300]


In [11]:
arr_data = np.array([100, 200, 400, 500, 300])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data, copy = True)
print (s, type(s))

[100 200 400 500 300] <class 'numpy.ndarray'>
0    100
1    200
2    400
3    500
4    300
dtype: int64 <class 'pandas.core.series.Series'>


In [12]:
s[2] = 999
print (s)
print (arr_data)

0    100
1    200
2    999
3    500
4    300
dtype: int64
[100 200 400 500 300]


In [13]:
arr_data[3] = 888
print (s)
print (arr_data)

0    100
1    200
2    999
3    500
4    300
dtype: int64
[100 200 400 888 300]


In [14]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, len(arr_data))
s = pd.Series(data = arr_data)
print (s)
print (s[0], s[2])

['apple' 'banana' 'cherry' 'pineapple'] 4
0        apple
1       banana
2       cherry
3    pineapple
dtype: object
apple cherry


In [15]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, len(arr_data))
s = pd.Series(data = arr_data, index = [100, 103, 101, 103])
print (s)
print (s[100], type(s[100]))
print (s[103], type(s[103]))

['apple' 'banana' 'cherry' 'pineapple'] 4
100        apple
103       banana
101       cherry
103    pineapple
dtype: object
apple <class 'str'>
103       banana
103    pineapple
dtype: object <class 'pandas.core.series.Series'>


In [16]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, len(arr_data))
s = pd.Series(data = arr_data, index = [100, 103, 100, 103])
print (s)
print (s[100])
print (s[103])

['apple' 'banana' 'cherry' 'pineapple'] 4
100        apple
103       banana
100       cherry
103    pineapple
dtype: object
100     apple
100    cherry
dtype: object
103       banana
103    pineapple
dtype: object


In [17]:
import warnings
warnings.filterwarnings('ignore')

In [18]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, len(arr_data))
s = pd.Series(data = arr_data, index = ['fruit-1', 'fruit-2', 'fruit-3', 'fruit-4'])
print (s)
print (s['fruit-1'], s[0])
print (s['fruit-3'], s[2])

['apple' 'banana' 'cherry' 'pineapple'] 4
fruit-1        apple
fruit-2       banana
fruit-3       cherry
fruit-4    pineapple
dtype: object
apple apple
cherry cherry


In [20]:
# create a Series from a dictionary
dict_data = {'apple':104, 'banana':306, 'pineapple': '455', 'coconut':897}
print (dict_data, len(dict_data))
s = pd.Series(dict_data)
print (s)

{'apple': 104, 'banana': 306, 'pineapple': '455', 'coconut': 897} 4
apple        104
banana       306
pineapple    455
coconut      897
dtype: object


In [21]:
dict_data = {'apple':104, 'banana':306, 'pineapple': '455', 'coconut':897}
print (dict_data, len(dict_data))
s = pd.Series(dict_data, index = ['pineapple', 'apple', 'cherry', 'apple', 'coconut', 'mango'])
print (s)

{'apple': 104, 'banana': 306, 'pineapple': '455', 'coconut': 897} 4
pineapple    455
apple        104
cherry       NaN
apple        104
coconut      897
mango        NaN
dtype: object


In [23]:
# create a Series from a scalar
s = pd.Series(50, index = [0, 1, 2, 3, 4, 5])
print (s)
s = pd.Series("abc", index = [0, 2, 1, 3, 2, 3, 5, 4, 6])
print (s)

0    50
1    50
2    50
3    50
4    50
5    50
dtype: int64
0    abc
2    abc
1    abc
3    abc
2    abc
3    abc
5    abc
4    abc
6    abc
dtype: object


In [24]:
# create a Series from a list
s = pd.Series(data = [101, 103, 102, 105, 104], index = ('red', 'blue', 'brown', 'black', 'white'))
print (s)
print (s['blue'], s[1])
print (s.sort_index())
print (s.sort_values())
print (s.argmax(), s.argmin())
print (s.count())

red      101
blue     103
brown    102
black    105
white    104
dtype: int64
103 103
black    105
blue     103
brown    102
red      101
white    104
dtype: int64
red      101
brown    102
blue     103
white    104
black    105
dtype: int64
3 0
5


## **Data Frame Data Structure**

In [25]:
data_dict = {'emp_name':['Amal','Kamal','Bimal','Shymal'], 'emp_age':[45, 47, 46, 49]}
emp_id = [1001, 1002, 1003, 1004]
df = pd.DataFrame(data = data_dict)
print (df)
df

  emp_name  emp_age
0     Amal       45
1    Kamal       47
2    Bimal       46
3   Shymal       49


Unnamed: 0,emp_name,emp_age
0,Amal,45
1,Kamal,47
2,Bimal,46
3,Shymal,49


In [26]:
data_dict = {'emp_name':['Amal','Kamal','Bimal','Shymal'], 'emp_age':[45, 47, 46, 49]}
emp_id = [1001, 1002, 1003, 1004]
df = pd.DataFrame(data = data_dict, index = emp_id)
df

Unnamed: 0,emp_name,emp_age
1001,Amal,45
1002,Kamal,47
1003,Bimal,46
1004,Shymal,49


In [27]:
data_dict = {'emp_name':['Amal','Kamal','Bimal','Shymal'], 'emp_age':[45, 47, 46, 49]}
emp_id = [1001, 1002, 1003, 1004]
df = pd.DataFrame(data = data_dict, index = emp_id)
df = df.reset_index()
print (df)
df = df.rename(columns = {"index":"emp_id", 'emp_name':'emp_fullname'})
df

   index emp_name  emp_age
0   1001     Amal       45
1   1002    Kamal       47
2   1003    Bimal       46
3   1004   Shymal       49


Unnamed: 0,emp_id,emp_fullname,emp_age
0,1001,Amal,45
1,1002,Kamal,47
2,1003,Bimal,46
3,1004,Shymal,49


## **Different ways to create a Data Frame**

In [29]:
user_data = [['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']]  # list of lists
user_columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame(data = user_data, columns = user_columns)

user1

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student


In [30]:
user_data = dict(name = ['eric', 'paul'], age = [22, 58], gender = ['M', 'F'], job = ['student', 'manager'])
print (user_data)
user2 = pd.DataFrame(data = user_data)

user2

{'name': ['eric', 'paul'], 'age': [22, 58], 'gender': ['M', 'F'], 'job': ['student', 'manager']}


Unnamed: 0,name,age,gender,job
0,eric,22,M,student
1,paul,58,F,manager


In [31]:
user_data = {'name':['peter', 'julie'], 'age':[33, 44], 'gender':['M', 'F'], 'job':['engineer', 'scientist']}
user3 = pd.DataFrame(data = user_data)

user3

Unnamed: 0,name,age,gender,job
0,peter,33,M,engineer
1,julie,44,F,scientist


## **Concatenate Data Frames**

In [32]:
users = pd.concat([user1, user2])
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
0,eric,22,M,student
1,paul,58,F,manager


In [33]:
users = pd.concat([user1, user2, user3])
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
0,eric,22,M,student
1,paul,58,F,manager
0,peter,33,M,engineer
1,julie,44,F,scientist


In [34]:
users = pd.concat([user1, user2], ignore_index=True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager


In [35]:
users = pd.concat([user1, user2, user3], ignore_index=True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [36]:
# DataFrame to NumPy ndArray
arr_data = user1.to_numpy()
print (arr_data, type(arr_data))
arr_data = np.array(user1)
print (arr_data, type(arr_data))
arr_data = np.array(user1[['name', 'gender']])
print (arr_data, type(arr_data))

[['alice' 19 'F' 'student']
 ['john' 26 'M' 'student']] <class 'numpy.ndarray'>
[['alice' 19 'F' 'student']
 ['john' 26 'M' 'student']] <class 'numpy.ndarray'>
[['alice' 'F']
 ['john' 'M']] <class 'numpy.ndarray'>


## **Join Data Frame**

In [37]:
dict_data = dict(name = ['alice','john','eric','julie','rita'], height = [165, 168, 170, 180, 167])
user4 = pd.DataFrame(data = dict_data)
user4

Unnamed: 0,name,height
0,alice,165
1,john,168
2,eric,170
3,julie,180
4,rita,167


In [38]:
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [39]:
# inner join: all matching rows from both the data frames
merge_inner = pd.merge(users, user4, on = 'name', how = 'inner')
merge_inner

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165
1,john,26,M,student,168
2,eric,22,M,student,170
3,julie,44,F,scientist,180


In [40]:
# inner join: all matching rows from both the data frames
merge_inner = pd.merge(users, user4, on = 'name')
merge_inner

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165
1,john,26,M,student,168
2,eric,22,M,student,170
3,julie,44,F,scientist,180


In [41]:
# outer join: all rows from both the data frames matching or not matching
merge_outer = pd.merge(users, user4, on = 'name', how = 'outer')
merge_outer

Unnamed: 0,name,age,gender,job,height
0,alice,19.0,F,student,165.0
1,eric,22.0,M,student,170.0
2,john,26.0,M,student,168.0
3,julie,44.0,F,scientist,180.0
4,paul,58.0,F,manager,
5,peter,33.0,M,engineer,
6,rita,,,,167.0


In [42]:
print (merge_outer.size)

35


In [43]:
merge_outer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    7 non-null      object 
 1   age     6 non-null      float64
 2   gender  6 non-null      object 
 3   job     6 non-null      object 
 4   height  5 non-null      float64
dtypes: float64(2), object(3)
memory usage: 408.0+ bytes


In [44]:
# left outer join: all rows from the left data frame and only matching rows from the right data frame
merge_left_outer = pd.merge(users, user4, on = 'name', how = 'left')
merge_left_outer

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165.0
1,john,26,M,student,168.0
2,eric,22,M,student,170.0
3,paul,58,F,manager,
4,peter,33,M,engineer,
5,julie,44,F,scientist,180.0


In [45]:
# right outer join: only all matching rows from the left data frame and all rows from the right data frame
merge_right_outer = pd.merge(users, user4, on = 'name', how = 'right')
merge_right_outer

Unnamed: 0,name,age,gender,job,height
0,alice,19.0,F,student,165
1,john,26.0,M,student,168
2,eric,22.0,M,student,170
3,julie,44.0,F,scientist,180
4,rita,,,,167
