# **PANDAS: INTRODUCTION**
> It is often said that 80% of data analysis is spent on the data cleaning and preparing data. To get a handle on the problem, this section focuses on a small, but important aspect of data manipulation and cleaning with Pandas.
## **Data Structures**
**There are two data structures are there in Pandas -**<br>
* **Series -** It is one-dimensional labeled array capable of holding any data type (integer, strings, floating point numbers, Python objects etc.) of data. The axis is collectively referred to as index.

* **Data Frame -** It is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL Table or a Series of objects.

## **Series Data Structure:**
**pandas.core.series.Series(data, index, dtype, copy)**<br>
* **data -** data takes various forms like ndarray, list, constants, dictionary etc.<br>
* **index -** it is unique and hashable for easy identification.<br>
* **dtype -** it is for data type.<br>
* **copy -** copy data, and its default value is False. It only affects for Series or one dimensional ndarray inputs.

In [1]:
# importing required modules
import pandas as pd
import numpy as np

In [3]:
# creating empty Series
import warnings
warnings.filterwarnings('ignore')
s = pd.Series()
print (s, type(s))

Series([], dtype: float64) <class 'pandas.core.series.Series'>


In [6]:
# create a Series from a ndarray
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data)
print (s)
print (type(s), s[0], s[3])

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
0        apple
1       banana
2       cherry
3    pineapple
dtype: object
<class 'pandas.core.series.Series'> apple pineapple


In [9]:
arr_data = np.array([100, 300, 200, 600, 500])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data, copy = False)
print (s)
s[0] = 999; arr_data[2] = 888
print (arr_data, type(arr_data))
print (s)

[100 300 200 600 500] <class 'numpy.ndarray'>
0    100
1    300
2    200
3    600
4    500
dtype: int32
[999 300 888 600 500] <class 'numpy.ndarray'>
0    999
1    300
2    888
3    600
4    500
dtype: int32


In [10]:
arr_data = np.array([100, 300, 200, 600, 500])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data, copy = True)
print (s)
s[0] = 999; arr_data[2] = 888
print (arr_data, type(arr_data))
print (s)

[100 300 200 600 500] <class 'numpy.ndarray'>
0    100
1    300
2    200
3    600
4    500
dtype: int32
[100 300 888 600 500] <class 'numpy.ndarray'>
0    999
1    300
2    200
3    600
4    500
dtype: int32


In [17]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))

s = pd.Series(data = arr_data, index = [100, 101, 102, 103])
print (s)
print (s[100], type(s[100]), s[103], type(s[103]))

s = pd.Series(data = arr_data, index = [100, 101, 100, 103])
print (s)
print (s[100], type(s[100]))
print (s[103], type(s[103]))

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
100        apple
101       banana
102       cherry
103    pineapple
dtype: object
apple <class 'str'> pineapple <class 'str'>
100        apple
101       banana
100       cherry
103    pineapple
dtype: object
100     apple
100    cherry
dtype: object <class 'pandas.core.series.Series'>
pineapple <class 'str'>


In [15]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
s = pd.Series(data = arr_data, index = ['fruit-1', 'fruit-2', 'fruit-3', 'fruit-4'])
print (s)
print (s['fruit-1'], s[0], s['fruit-3'], s[2])

fruit-1        apple
fruit-2       banana
fruit-3       cherry
fruit-4    pineapple
dtype: object
apple apple cherry cherry


In [21]:
# create a Series from a dictionary
dict_data = {'apple':100, 'banana':202, 'coconut':450, 'mango':435}
print (dict_data)
s = pd.Series(dict_data)
print (s)
s = pd.Series(dict_data, index = ['banana', 'mango', 'apple', 'coconut'])
print (s)
s = pd.Series(dict_data, index = ['banana', 'mango', 'apple', 'coconut', 'lime'])
print (s)
print (s[0], s[2])

{'apple': 100, 'banana': 202, 'coconut': 450, 'mango': 435}
apple      100
banana     202
coconut    450
mango      435
dtype: int64
banana     202
mango      435
apple      100
coconut    450
dtype: int64
banana     202.0
mango      435.0
apple      100.0
coconut    450.0
lime         NaN
dtype: float64
202.0 100.0


In [25]:
# create a Series from a scalar
s = pd.Series(data = 5, index = [0, 1, 2, 3, 4])
print (s)
s = pd.Series(5, index = [0, 1, 2, 0, 1, 2])
print (s)
print (s[0])

0    5
1    5
2    5
3    5
4    5
dtype: int64
0    5
1    5
2    5
0    5
1    5
2    5
dtype: int64
0    5
0    5
dtype: int64


In [29]:
# Create a Series from a list
s = pd.Series(data = [101, 303, 202, 404, 505], index = ['red', 'blue', 'brown', 'black', 'silver'])
print (s)
print (s['blue'], s[1])
print ()
print (s[0:4])
print ()
print (s[-5:-1])

red       101
blue      303
brown     202
black     404
silver    505
dtype: int64
303 303

red      101
blue     303
brown    202
black    404
dtype: int64

red      101
blue     303
brown    202
black    404
dtype: int64


In [30]:
print (s[['brown', 'red', 'silver', 'blue']])

brown     202
red       101
silver    505
blue      303
dtype: int64


In [31]:
print (s.sort_values())

red       101
brown     202
blue      303
black     404
silver    505
dtype: int64


In [32]:
print (s.sort_index())

black     404
blue      303
brown     202
red       101
silver    505
dtype: int64


In [37]:
print (s)
print (s.argmin(), s.argmax(), s.count())
print (s.min(), s.max(), s.mean(), s.sum(), len(s))

red       101
blue      303
brown     202
black     404
silver    505
dtype: int64
0 4 5
101 505 303.0 1515 5


## **Data Frame Data Structure:**

### **Create DataFrame**

In [45]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age':[34, 35, 45, 43]}
print (data_dict)
emp_id = [100, 101, 102, 103]
df = pd.DataFrame(data = data_dict, index = emp_id)
df

{'emp_name': ['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age': [34, 35, 45, 43]}


Unnamed: 0,emp_name,emp_age
100,Amal,34
101,Kamal,35
102,Bimal,45
103,Shyamal,43


In [48]:
df.reset_index(inplace=True)
print (df)
df

   level_0  index emp_name  emp_age
0        0    100     Amal       34
1        1    101    Kamal       35
2        2    102    Bimal       45
3        3    103  Shyamal       43


Unnamed: 0,level_0,index,emp_name,emp_age
0,0,100,Amal,34
1,1,101,Kamal,35
2,2,102,Bimal,45
3,3,103,Shyamal,43


### **Creating user1, user2, user3**

In [59]:
user_data = [['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']]
user_columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame(data = user_data, columns = user_columns)
user1

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student


In [64]:
user_data = dict(name = ['eric', 'paul'], age = [22, 58], gender = ['M', 'F'], job = ['student', 'manager'])
print (user_data)
user2 = pd.DataFrame(data = user_data)
user2

{'name': ['eric', 'paul'], 'age': [22, 58], 'gender': ['M', 'F'], 'job': ['student', 'manager']}


Unnamed: 0,name,age,gender,job
0,eric,22,M,student
1,paul,58,F,manager


In [51]:
user_data = {'name': ['peter', 'julie'], 'age': [33, 44], 'gender': ['M', 'F'], 'job': ['engineer', 'scientist']}
print (user_data)
user3 = pd.DataFrame(data = user_data)
user3

{'name': ['peter', 'julie'], 'age': [33, 44], 'gender': ['M', 'F'], 'job': ['engineer', 'scientist']}


Unnamed: 0,name,age,gender,job
0,peter,33,M,engineer
1,julie,44,F,scientist


### **Concatenate DataFrame**

In [65]:
users = user1.append(user2)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
0,eric,22,M,student
1,paul,58,F,manager


In [53]:
users = user1.append(user2, ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager


In [54]:
users = users.append(user3, ignore_index=True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [66]:
users = user1.append(user2).append(user3, ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [67]:
users = pd.concat([user1, user2, user3])
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
0,eric,22,M,student
1,paul,58,F,manager
0,peter,33,M,engineer
1,julie,44,F,scientist


In [68]:
users = pd.concat([user1, user2, user3], ignore_index=True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


## **DataFrame to NumPy ndarray**

In [71]:
arr_data = user1[['name', 'job']].to_numpy()
print (arr_data, type(arr_data))

[['alice' 'student']
 ['john' 'student']] <class 'numpy.ndarray'>


In [72]:
arr_data = user1.to_numpy()
print (arr_data, type(arr_data))

[['alice' 19 'F' 'student']
 ['john' 26 'M' 'student']] <class 'numpy.ndarray'>


In [73]:
arr_data = np.array(user1)
print (arr_data, type(arr_data))

[['alice' 19 'F' 'student']
 ['john' 26 'M' 'student']] <class 'numpy.ndarray'>


## **Join DataFrame**

In [75]:
dict_data = dict(name = ['alice', 'john', 'eric', 'julie', 'anderson'], height = [165, 180, 175, 171, 169])
print (dict_data)
user4 = pd.DataFrame(data = dict_data)
user4

{'name': ['alice', 'john', 'eric', 'julie', 'anderson'], 'height': [165, 180, 175, 171, 169]}


Unnamed: 0,name,height
0,alice,165
1,john,180
2,eric,175
3,julie,171
4,anderson,169


In [76]:
# inner join: All common rows from both data frames
merge_inner = pd.merge(users, user4, on = "name", how = "inner")
merge_inner

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165
1,john,26,M,student,180
2,eric,22,M,student,175
3,julie,44,F,scientist,171


In [78]:
# inner join: All matching rows from both data frames
merge_inner = pd.merge(users, user4, on = "name")
merge_inner  # by default inner join will take place

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165
1,john,26,M,student,180
2,eric,22,M,student,175
3,julie,44,F,scientist,171


In [79]:
# outer join: All rows from both data frames
merge_outer = pd.merge(users, user4, on = "name", how = "outer")
merge_outer

Unnamed: 0,name,age,gender,job,height
0,alice,19.0,F,student,165.0
1,john,26.0,M,student,180.0
2,eric,22.0,M,student,175.0
3,paul,58.0,F,manager,
4,peter,33.0,M,engineer,
5,julie,44.0,F,scientist,171.0
6,anderson,,,,169.0


In [80]:
# left outer join: All rows from the left data frame and matching rows from the right data frame
merge_left = pd.merge(users, user4, on = "name", how = "left")
merge_left

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165.0
1,john,26,M,student,180.0
2,eric,22,M,student,175.0
3,paul,58,F,manager,
4,peter,33,M,engineer,
5,julie,44,F,scientist,171.0


In [81]:
# right outer join: Only matching rows from the left data frame and all rows from the right data frame
merge_right = pd.merge(users, user4, on = "name", how = "right")
merge_right

Unnamed: 0,name,age,gender,job,height
0,alice,19.0,F,student,165
1,john,26.0,M,student,180
2,eric,22.0,M,student,175
3,julie,44.0,F,scientist,171
4,anderson,,,,169
