## **Introduction to Pandas**

> It is often said that 80% of data analysis is spent on the data cleaning and preparing data. To get a handle on the problem, this article will focus on a small but important aspect of data manipulation and cleaning with Pandas.

### **Data Structures in Pandas**

There are two data structures are there in Pandas -<br>
* **Series -** It is one-dimensional labeled array capable of holding any datatype (integer, strings, floating point numbers, Python objects etc).<br>
* **Data Frame -** It is a two-dimensional labeled data structure with columns of potemtially different datatypes. We can think of it like a spreadsheet or SQL Table or a Series of objects.

### **Series Data Structure:**

**pandas.core.series.Series(data, index, dtype, copy)**

* **data -** data takes various forms like ndarray, list, constants, dictionary etc.<br>
* **index -** it is unique and hashable for easy identification<br>
* **dtype -** it is for the data type<br>
* **copy -** copy data, and its default value is False. It only affects for Series or one dimensional ndarray inputs

In [1]:
# importing the required modules
import numpy as np
import pandas as pd

In [3]:
# create an empty Series
import warnings
warnings.filterwarnings('ignore')
s = pd.Series()
print (s, type(s))

Series([], dtype: float64) <class 'pandas.core.series.Series'>


In [6]:
# create a Series from a ndarray
arr_data = np.array(['apple', 'guava', 'banana', 'pineapple', 'coconut'])
s = pd.Series(data = arr_data)
print (s, type(s))
print (s[0], s[1], s[4])

0        apple
1        guava
2       banana
3    pineapple
4      coconut
dtype: object <class 'pandas.core.series.Series'>
apple guava coconut


In [7]:
arr_data = np.array([100, 400, 500, 200, 700])
s = pd.Series(data = arr_data)
print (s, type(s))
print (s[0], s[1], s[4])

0    100
1    400
2    500
3    200
4    700
dtype: int32 <class 'pandas.core.series.Series'>
100 400 700


In [8]:
arr_data = np.array([100, 400, 500, True, 200, False, 700])
s = pd.Series(data = arr_data)
print (s, type(s))
print (s[0], s[1], s[4])

0    100
1    400
2    500
3      1
4    200
5      0
6    700
dtype: int32 <class 'pandas.core.series.Series'>
100 400 200


In [9]:
arr_data = np.array([100., 400, 500, True, 200, False, 700])
s = pd.Series(data = arr_data)
print (s, type(s))
print (s[0], s[1], s[4])

0    100.0
1    400.0
2    500.0
3      1.0
4    200.0
5      0.0
6    700.0
dtype: float64 <class 'pandas.core.series.Series'>
100.0 400.0 200.0


In [10]:
arr_data = np.array([100., 400, 500, True, 200, False, 700, 'Amit'])
s = pd.Series(data = arr_data)
print (s, type(s))
print (s[0], s[1], s[4])

0    100.0
1      400
2      500
3     True
4      200
5    False
6      700
7     Amit
dtype: object <class 'pandas.core.series.Series'>
100.0 400 200


In [13]:
arr_data = np.array([100, 400, 500, 200, 700])
s = pd.Series(data = arr_data)
print (arr_data, type(arr_data))
print (s, type(s))

arr_data[2] = 50000
s[1] = 40000
print (arr_data, type(arr_data))
print (s, type(s))

[100 400 500 200 700] <class 'numpy.ndarray'>
0    100
1    400
2    500
3    200
4    700
dtype: int32 <class 'pandas.core.series.Series'>
[  100 40000 50000   200   700] <class 'numpy.ndarray'>
0      100
1    40000
2    50000
3      200
4      700
dtype: int32 <class 'pandas.core.series.Series'>


In [14]:
arr_data = np.array([100, 400, 500, 200, 700])
s = pd.Series(data = arr_data, copy=False)
print (arr_data, type(arr_data))
print (s, type(s))

arr_data[2] = 50000
s[1] = 40000
print (arr_data, type(arr_data))
print (s, type(s))

[100 400 500 200 700] <class 'numpy.ndarray'>
0    100
1    400
2    500
3    200
4    700
dtype: int32 <class 'pandas.core.series.Series'>
[  100 40000 50000   200   700] <class 'numpy.ndarray'>
0      100
1    40000
2    50000
3      200
4      700
dtype: int32 <class 'pandas.core.series.Series'>


In [15]:
arr_data = np.array([100, 400, 500, 200, 700])
s = pd.Series(data = arr_data, copy=True)
print (arr_data, type(arr_data))
print (s, type(s))

arr_data[2] = 50000
s[1] = 40000
print (arr_data, type(arr_data))
print (s, type(s))

[100 400 500 200 700] <class 'numpy.ndarray'>
0    100
1    400
2    500
3    200
4    700
dtype: int32 <class 'pandas.core.series.Series'>
[  100   400 50000   200   700] <class 'numpy.ndarray'>
0      100
1    40000
2      500
3      200
4      700
dtype: int32 <class 'pandas.core.series.Series'>


In [28]:
arr_data = np.array(['apple', 'guava', 'banana', 'pineapple', 'coconut'])
print (arr_data, type(arr_data))

s = pd.Series(arr_data)
print (s, type(s), s[1], s[3])

s = pd.Series(arr_data, index = [100, 101, 102, 103, 104])
print (s, type(s), s[101], s[103])

s = pd.Series(arr_data, index = [105, 101, 103, 101, 103])
print (s, type(s))
print (s[101])
print (s[103])

['apple' 'guava' 'banana' 'pineapple' 'coconut'] <class 'numpy.ndarray'>
0        apple
1        guava
2       banana
3    pineapple
4      coconut
dtype: object <class 'pandas.core.series.Series'> guava pineapple
100        apple
101        guava
102       banana
103    pineapple
104      coconut
dtype: object <class 'pandas.core.series.Series'> guava pineapple
105        apple
101        guava
103       banana
101    pineapple
103      coconut
dtype: object <class 'pandas.core.series.Series'>
101        guava
101    pineapple
dtype: object
103     banana
103    coconut
dtype: object


In [29]:
arr_data = np.array(['apple', 'guava', 'banana', 'pineapple', 'coconut'])
print (arr_data, type(arr_data))

s = pd.Series(arr_data, index = ['fruit-1', 'fruit-2', 'fruit-3', 'fruit-4', 'fruit-5'])
print (s, type(s))
print (s['fruit-1'], s['fruit-3'])
print (s[0], s[2], s[1])

['apple' 'guava' 'banana' 'pineapple' 'coconut'] <class 'numpy.ndarray'>
fruit-1        apple
fruit-2        guava
fruit-3       banana
fruit-4    pineapple
fruit-5      coconut
dtype: object <class 'pandas.core.series.Series'>
apple banana
apple banana guava


In [31]:
# create a Series from a dictionary
dict_data = {'apple':100, 'banana':202, 'coconut':301, 'mango':709}
s = pd.Series(data = dict_data)
print (s, type(s))
print (s['banana'], s['mango'])

apple      100
banana     202
coconut    301
mango      709
dtype: int64 <class 'pandas.core.series.Series'>
202 709


In [33]:
dict_data = {'apple':100, 'banana':202, 'coconut':301, 'mango':709}
s = pd.Series(data = dict_data, index = ['coconut', 'apple', 'mango', 'coconut', 'mango', 'coconut', 'banana', 'lime'])
print (s, type(s))
print (s['banana'], s['mango'])

coconut    301.0
apple      100.0
mango      709.0
coconut    301.0
mango      709.0
coconut    301.0
banana     202.0
lime         NaN
dtype: float64 <class 'pandas.core.series.Series'>
202.0 mango    709.0
mango    709.0
dtype: float64


In [34]:
# create a Series from a scalar
s = pd.Series(5, index = [0, 1, 2, 3, 4, 5, 6, 7])
print (s, type(s))

0    5
1    5
2    5
3    5
4    5
5    5
6    5
7    5
dtype: int64 <class 'pandas.core.series.Series'>


In [36]:
# create a Series from a list
list_data = [101, 303, 202, 505, 404]
s = pd.Series(data = list_data, index = ['red', 'blue', 'purple', 'white', 'brown'])
print (s, type(s))

red       101
blue      303
purple    202
white     505
brown     404
dtype: int64 <class 'pandas.core.series.Series'>


### **Data Frame Data Structure:**