# **PANDAS: INTRODUCTION**
> It is often said that 80% of data analysis is spent on the data cleaning and preparing data. To get a handle on the problem, this section focuses on a small, but important aspect of data manipulation and cleaning with Pandas.
## **Data Structures**
**There are two data structures are there in Pandas -**<br>
* **Series -** It is one-dimensional labeled array capable of holding any data type (integer, strings, floating point numbers, Python objects etc.) of data. The axis is collectively referred to as index.

* **Data Frame -** It is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL Table or a Series of objects.

## **Series Data Structure:**
**pandas.core.series.Series(data, index, dtype, copy)**<br>
* **data -** data takes various forms like ndarray, list, constants, dictionary etc.<br>
* **index -** it is unique and hashable for easy identification.<br>
* **dtype -** it is for data type.<br>
* **copy -** copy data, and its default value is False. It only affects for Series or one dimensional ndarray inputs.

In [1]:
# importing required modules
import pandas as pd
import numpy as np

In [3]:
# creating empty Series
import warnings
warnings.filterwarnings('ignore')
s = pd.Series()
print (s, type(s))

Series([], dtype: float64) <class 'pandas.core.series.Series'>


In [5]:
# create a Series from a ndarray
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data)
print (s)
print (type(s), s[0], s[3])

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
0        apple
1       banana
2       cherry
3    pineapple
dtype: object
<class 'pandas.core.series.Series'> apple pineapple


In [6]:
arr_data = np.array([100, 300, 200, 600, 500])
s = pd.Series(data = arr_data, copy = False)
print (arr_data)
print (s)
s[0] = 999; arr_data[2] = 888
print (arr_data)
print (s)

[100 300 200 600 500]
0    100
1    300
2    200
3    600
4    500
dtype: int32
[999 300 888 600 500]
0    999
1    300
2    888
3    600
4    500
dtype: int32


In [7]:
arr_data = np.array([100, 300, 200, 600, 500])
s = pd.Series(data = arr_data, copy = True)
print (arr_data)
print (s)
s[0] = 999; arr_data[2] = 888
print (arr_data)
print (s)

[100 300 200 600 500]
0    100
1    300
2    200
3    600
4    500
dtype: int32
[100 300 888 600 500]
0    999
1    300
2    200
3    600
4    500
dtype: int32


In [9]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data)
print (s)
s = pd.Series(data = arr_data, index = [100, 101, 102, 103])
print (s)
print (s[100], type(s[100]), s[103], type(s[103]))

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
0        apple
1       banana
2       cherry
3    pineapple
dtype: object
100        apple
101       banana
102       cherry
103    pineapple
dtype: object
apple <class 'str'> pineapple <class 'str'>


In [16]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data, index = [100, 101, 100, 103])
print (s)
print (s[100], type(s[100]), s[103], type(s[103]))
s[100] = 'lemon'
print (s)
print (s[100])

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
100        apple
101       banana
100       cherry
103    pineapple
dtype: object
100     apple
100    cherry
dtype: object <class 'pandas.core.series.Series'> pineapple <class 'str'>
100        lemon
101       banana
100        lemon
103    pineapple
dtype: object
100    lemon
100    lemon
dtype: object


In [14]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data)
s = pd.Series(data = arr_data, index = ['fruit-1', 'fruit-2', 'fruit-3', 'fruit-4'])
print (s)
print (s['fruit-1'], s[0], s['fruit-3'], s[2])

['apple' 'banana' 'cherry' 'pineapple']
fruit-1        apple
fruit-2       banana
fruit-3       cherry
fruit-4    pineapple
dtype: object
apple apple cherry cherry


In [19]:
# create a Series from a dictionary
dict_data = {'apple':100, 'banana':202, 'coconut':450, 'mango':435}
print (dict_data)
s = pd.Series(data = dict_data)
print (s)
s = pd.Series(dict_data, index = ['banana', 'mango', 'apple', 'coconut'])

{'apple': 100, 'banana': 202, 'coconut': 450, 'mango': 435}
apple      100
banana     202
coconut    450
mango      435
dtype: int64


In [23]:
dict_data = {'apple':100, 'banana':202, 'coconut':450, 'mango':435}
s = pd.Series(dict_data, index = ['banana', 'mango', 'apple', 'coconut'])
print (s)
s = pd.Series(data = dict_data, index = ['banana', 'lime', 'coconut', 'mango', 
                                         'guava', 'apple', 'mango', 'apple', 'coconut'])
print (s)
print (s['banana'], s['lime'], s[4], s[5])

banana     202
mango      435
apple      100
coconut    450
dtype: int64
banana     202.0
lime         NaN
coconut    450.0
mango      435.0
guava        NaN
apple      100.0
mango      435.0
apple      100.0
coconut    450.0
dtype: float64
202.0 nan nan 100.0


In [25]:
# create a Series from a scalar
s = pd.Series(data = 5, index = [0, 1, 2, 3, 4])
print (s)
s = pd.Series(5, index = [0, 1, 2, 0, 1, 2])
print (s)

0    5
1    5
2    5
3    5
4    5
dtype: int64
0    5
1    5
2    5
0    5
1    5
2    5
dtype: int64


In [27]:
# Create a Series from a list
s = pd.Series(data = [101, 303, 202, 404, 505], index = ['red', 'blue', 'brown', 'black', 'silver'])
print (s)
print (s['blue'], s[1])
print (s[0:4])
print (s[-5:-1])
print (s[['brown', 'red', 'silver', 'blue']])

red       101
blue      303
brown     202
black     404
silver    505
dtype: int64
303 303
red      101
blue     303
brown    202
black    404
dtype: int64
red      101
blue     303
brown    202
black    404
dtype: int64
brown     202
red       101
silver    505
blue      303
dtype: int64


In [30]:
print (s)
print (s.sort_values())
print (s.sort_index())

red       101
blue      303
brown     202
black     404
silver    505
dtype: int64
red       101
brown     202
blue      303
black     404
silver    505
dtype: int64
black     404
blue      303
brown     202
red       101
silver    505
dtype: int64


In [34]:
print (s)
print (s.argmin(), s.argmax(), s.count(), len(s))

red       101
blue      303
brown     202
black     404
silver    505
dtype: int64
0 4 5 5


## **Data Frame Data Structure:**

## **Create DataFrame**

In [37]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age':[34, 35, 45, 43]}
emp_id = [100, 101, 102, 103]
df = pd.DataFrame(data = data_dict)
df

Unnamed: 0,emp_name,emp_age
0,Amal,34
1,Kamal,35
2,Bimal,45
3,Shyamal,43


In [44]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age':[34, 35, 45, 43]}
emp_id = [100, 101, 102, 103]
df = pd.DataFrame(data = data_dict, index = emp_id)
print (df)
df

    emp_name  emp_age
100     Amal       34
101    Kamal       35
102    Bimal       45
103  Shyamal       43


Unnamed: 0,emp_name,emp_age
100,Amal,34
101,Kamal,35
102,Bimal,45
103,Shyamal,43


In [40]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age':[34, 35, 45, 43]}
emp_id = [100, 101, 102, 103]
df = pd.DataFrame(data = data_dict, index = emp_id)
df = df.reset_index()
df

Unnamed: 0,index,emp_name,emp_age
0,100,Amal,34
1,101,Kamal,35
2,102,Bimal,45
3,103,Shyamal,43


In [43]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age':[34, 35, 45, 43]}
emp_id = [100, 101, 102, 103]
df = pd.DataFrame(data = data_dict, index = emp_id)
df.reset_index(inplace = True)
df

Unnamed: 0,index,emp_name,emp_age
0,100,Amal,34
1,101,Kamal,35
2,102,Bimal,45
3,103,Shyamal,43


## **Creating three users: user1, user2 and user3**

In [57]:
user_data = [['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']]
user_columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame(data = user_data, columns = user_columns)
user1

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student


In [62]:
user_data = dict(name = ['eric', 'paul'], age = [22, 58], gender = ['M', 'F'], job = ['student', 'manager'])
print (user_data)
user2 = pd.DataFrame(data = user_data)
user2

{'name': ['eric', 'paul'], 'age': [22, 58], 'gender': ['M', 'F'], 'job': ['student', 'manager']}


Unnamed: 0,name,age,gender,job
0,eric,22,M,student
1,paul,58,F,manager


In [48]:
user_data = {'name': ['peter', 'julie'], 'age': [33, 44], 'gender': ['M', 'F'], 'job': ['engineer', 'scientist']}
user3 = pd.DataFrame(data = user_data)
user3

Unnamed: 0,name,age,gender,job
0,peter,33,M,engineer
1,julie,44,F,scientist


## **Concatenation of DataFrames**

In [63]:
users = user1.append(user2)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
0,eric,22,M,student
1,paul,58,F,manager


In [64]:
users = user1.append(user2, ignore_index=True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager


In [65]:
users = users.append(user3, ignore_index=True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [66]:
users = user1.append(user2).append(user3, ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [68]:
users = pd.concat([user1, user2, user3])
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
0,eric,22,M,student
1,paul,58,F,manager
0,peter,33,M,engineer
1,julie,44,F,scientist


In [70]:
users = pd.concat([user1, user2, user3], ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


## **DataFrame to NumPy ndarray**

In [71]:
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [73]:
arr_data = users[['name', 'job']].to_numpy()
print (arr_data, type(arr_data), arr_data.shape)

[['alice' 'student']
 ['john' 'student']
 ['eric' 'student']
 ['paul' 'manager']
 ['peter' 'engineer']
 ['julie' 'scientist']] <class 'numpy.ndarray'> (6, 2)


In [74]:
arr_data = users.to_numpy()
print (arr_data, type(arr_data), arr_data.shape)

[['alice' 19 'F' 'student']
 ['john' 26 'M' 'student']
 ['eric' 22 'M' 'student']
 ['paul' 58 'F' 'manager']
 ['peter' 33 'M' 'engineer']
 ['julie' 44 'F' 'scientist']] <class 'numpy.ndarray'> (6, 4)


In [76]:
arr_data = np.array(users)
print (arr_data, type(arr_data), arr_data.shape)

[['alice' 19 'F' 'student']
 ['john' 26 'M' 'student']
 ['eric' 22 'M' 'student']
 ['paul' 58 'F' 'manager']
 ['peter' 33 'M' 'engineer']
 ['julie' 44 'F' 'scientist']] <class 'numpy.ndarray'> (6, 4)
