# Pandas 

Pandas is an open-source Python Library providing high-performance data manipulation
and analysis tool using its powerful data structures. The name Pandas is derived from the
word Panel Data – an Econometrics from Multidimensional data.

In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data. 

Prior to Pandas, Python was majorly used for data munging and preparation. It had very
less contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin
of data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.

### Key Features of Pandas
<ul>
    <li>Fast and efficient DataFrame object with default and customized indexing.</li>
    <li>Tools for loading data into in-memory data objects from different file formats.</li>
    <li>Data alignment and integrated handling of missing data.</li>
    <li>Reshaping and pivoting of date sets.</li>
    <li>Label-based slicing, indexing and subsetting of large data sets.</li>
    <li>Columns from a data structure can be deleted or inserted.</li>
    <li> Group by data for aggregation and transformations.</li>
    <li>High performance merging and joining of data.</li>
    <li>Time Series functionality.
</ul>     

Pandas deals with the following three data structures:
<ul>
    <li> Series</li>
    <li> DataFrame</li>
    <li> Panel</li>
</ul> 
These data structures are built on top of Numpy array, which means they are fast.

### Mutability
All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable.

Note − DataFrame is widely used and one of the most important data structures. Panel is used much less.

# Introduction to Data Structures

## Series
Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …

### Key Points
<ul>
    <li>Homogeneous data</li>
    <li>Size Immutable</li>
    <li>Values of Data Mutable</li>
</ul>

## DataFrame

DataFrame is a two-dimensional array with heterogeneous data. 

### Key Points
<ul>
    <li>Hetrogenoeous data</li>
    <li>Size Mutable</li>
    <li> Data Mutable</li>
</ul>

## Panel
Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame.

### Key Points
<ul>
    <li>Hetrogenoeous data</li>
    <li>Size Mutable</li>
    <li> Data Mutable</li>
</ul>


# Python Pandas - Series
Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

### pandas.Series
A pandas Series can be created using the following constructor −

pandas.Series( data, index, dtype, copy)


A series can be created using various inputs like −

<ul>
    <li>Array</li>
    <li>Dic</li>
    <li> Scalar Value or constant</li>
</ul>

In [2]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

### Create a Series from ndarray
If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].

In [7]:
data=np.array(['a','b','c','d'])
s=pd.Series(data)
s

0    a
1    b
2    c
3    d
dtype: object

In [8]:
# We passed the index values here. Now we can see the customized indexed values in the output.
data=np.array(['a','b','c','d'])
s=pd.Series(data,index=[11,12,13,14])
s

11    a
12    b
13    c
14    d
dtype: object

### Create a Series from dict

A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [10]:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
s

a    0.0
b    1.0
c    2.0
dtype: float64

In [11]:
# Observe − Index order is persisted and the missing element is filled with NaN (Not a Numb
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
s

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

### Create a Series from Scalar

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [13]:
s = pd.Series(5, index=[0, 1, 2, 3])
s

0    5
1    5
2    5
3    5
dtype: int64

### Accessing Data from Series with Position
Data in the series can be accessed similar to that in an ndarray.

#### Example 1
Retrieve the first element. As we already know, the counting starts from zero for the array, which means the first element is stored at zeroth position and so on.

In [15]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
s[0]

1

### Example 2
Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index)



In [18]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
s[0:3]

a    1
b    2
c    3
dtype: int64

In [19]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element
s[-3:]

c    3
d    4
e    5
dtype: int64

### Retrieve Data Using Label (Index)
A Series is like a fixed-size dict in that you can get and set values by index label.

#### Example 1
Retrieve a single element using index label value.



In [20]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element
print( s['a'])

1


In [22]:
s= pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

# retrieve multiple elements
s[['a','c','d']]

# At the timme of reterving if the label is not contained it will give key error

a    1
c    3
d    4
dtype: int64

# Python Pandas - DataFrame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

#### Features of DataFrame
<ul>
<li>Potentially columns are of different types</li>
<li>Size – Mutable</li>
<li>Labeled axes (rows and columns)</li>
<li>Can Perform Arithmetic operations on rows and columns</li>
<ul>






#### pandas.DataFrame
A pandas DataFrame can be created using the following constructor −

pandas.DataFrame( data, index, columns, dtype, copy)

# Create DataFrame
A pandas DataFrame can be created using various inputs like −
<ul>
<li>Lists</li>
<li>dict</li>
<li>Series</li>
<li>Numpy ndarrays</li>
<li>Another DataFrame</li>
</ul>

### Create an Empty DataFrame
A basic DataFrame, which can be created is an Empty Dataframe.

In [26]:
df= pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


### Create a DataFrame from Lists
The DataFrame can be created using a single list or a list of lists.

In [33]:
data=np.arange(1,6)
df=pd.DataFrame(data)
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [34]:
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
df

Unnamed: 0,Name,Age
0,Alex,10
1,Bob,12
2,Clarke,13


### Create a DataFrame from Dict of ndarrays / Lists
All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

If no index is passed, then by default, index will be range(n), where n is the array length.

In [37]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Tom,28
1,Jack,34
2,Steve,29
3,Ricky,42


### Create a DataFrame from List of Dicts
List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

#### Example 1
The following example shows how to create a DataFrame by passing a list of dictionaries.

In [38]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
df

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [41]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b','c'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1','c'])
print (df1)
print (df2)

        a   b     c
first   1   2   NaN
second  5  10  20.0
        a  b1     c
first   1 NaN   NaN
second  5 NaN  20.0


### Note − 
Observe, df2 DataFrame is created with a column index other than the dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended.

### Column Selection
We will understand this by selecting a column from the DataFrame.

In [42]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [43]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

### Column Addition
We will understand this by adding a new column to an existing data frame.

In [46]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by passing new series

print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)

print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print(df)


Adding a new column by passing as Series:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Adding a new column using the existing columns in DataFrame:
   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN


### Column Deletion
Columns can be deleted or popped; let us take an example to understand how


In [47]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
   'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print (df)

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN


In [49]:
# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print (df)

Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN


In [50]:
# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print (df)

Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN


### Row Selection, Addition, and Deletion
We will now understand row selection, addition and deletion through examples. Let us begin with the concept of selection.

### Selection by Label
Rows can be selected by passing row label to a loc function.

In [55]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df)
print (df.loc['b'])

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
one    2.0
two    2.0
Name: b, dtype: float64


### Selection by integer location
Rows can be selected by passing integer location to an iloc function.



In [59]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df.iloc[3])

one    NaN
two    4.0
Name: d, dtype: float64


### Slice Rows
Multiple rows can be selected using ‘ : ’ operator.

In [60]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df)
print (df[2:4])

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
   one  two
c  3.0    3
d  NaN    4


### Addition of Rows
Add new rows to a DataFrame using the append function. This function will append the rows at the end.

In [61]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
df

Unnamed: 0,a,b
0,1,2
1,3,4
0,5,6
1,7,8


### Deletion of Rows

Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped.

In [63]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print (df)
# In the above example, two rows were dropped because those two contain the same label 0.

   a  b
1  3  4
1  7  8


# Python Pandas - Basic Functionality
By now, we learnt about the three Pandas DataStructures and how to create them. We will majorly focus on the DataFrame objects because of its importance in the real time data processing and also discuss a few other DataStructures.

In [7]:
# Create a series with  random numbers
s = pd.Series(np.random.randn(4))
print (s)

0    1.383566
1    0.628647
2    0.117377
3    1.659509
dtype: float64


### empty
Returns the Boolean value saying whether the Object is empty or not. True indicates that the object is empty.

In [8]:
s = pd.Series(np.random.randn(4))
print ("Is the Object empty?")
print (s.empty)

Is the Object empty?
False


### ndim
Returns the number of dimensions of the object. By definition, a Series is a 1D data structure, so it returns

In [10]:
s = pd.Series(np.random.randn(4))
print (s)

print ("The dimensions of the object:")
s.ndim

0   -1.642980
1   -1.099684
2   -1.083159
3    0.002185
dtype: float64
The dimensions of the object:


1

In [11]:
#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print (s)
print ("The size of the object:")
print (s.size)

0    1.029386
1    1.248727
2    1.569909
3    1.803823
dtype: float64
The size of the object:
4


### Head & Tail
To view a small sample of a Series or the DataFrame object, use the head() and the tail() methods.

head() returns the first n rows(observe the index values). The default number of elements to display is five, but you may pass a custom number.

In [15]:
#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print ("The original series is:")
print (s)
print('------------------------')
print ("The first two rows of the data series:")
print (s.head(2))

The original series is:
0   -0.735031
1   -1.130519
2   -0.044270
3   -0.294953
dtype: float64
------------------------
The first two rows of the data series:
0   -0.735031
1   -1.130519
dtype: float64


tail() returns the last n rows(observe the index values). The default number of elements to display is five, but you may pass a custom number.

In [16]:
print ("The last two rows of the data series:")
print (s.tail(2))

The last two rows of the data series:
2   -0.044270
3   -0.294953
dtype: float64


# DataFrame Basic Functionality
Let us now understand what DataFrame Basic Functionality is. The following tables lists down the important attributes or methods that help in DataFrame Basic Functionality.

In [17]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data series is:")
print (df)

Our data series is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80


### T (Transpose)
Returns the transpose of the DataFrame. The rows and columns will interchange.

In [20]:
print(df.T)

           0      1      2     3      4      5     6
Name     Tom  James  Ricky   Vin  Steve  Smith  Jack
Age       25     26     25    23     30     29    23
Rating  4.23   3.24   3.98  2.56    3.2    4.6   3.8


# axes
Returns the list of row axis labels and column axis labels.

In [22]:
print ("Row axis labels and column axis labels are:")
df.axes

Row axis labels and column axis labels are:


[RangeIndex(start=0, stop=7, step=1),
 Index(['Name', 'Age', 'Rating'], dtype='object')]

### dtypes
Returns the data type of each column.



In [23]:
print ("The data types of each column are:")
df.dtypes

The data types of each column are:


Name       object
Age         int64
Rating    float64
dtype: object

### empty

Returns the Boolean value saying whether the Object is empty or not; True indicates that the object is empty.

In [25]:
print ("Is the object empty?")
df.empty

Is the object empty?


False

### ndim
Returns the number of dimensions of the object. By definition, DataFrame is a 2D object.



In [31]:
print ("The dimension of the object is:")
df.ndim

The dimension of the object is:


2

### shape
Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b), where a represents the number of rows and b represents the number of columns.



In [32]:
print ("The shape of the object is:")
df.shape

The shape of the object is:


(7, 3)

### size
Returns the number of elements in the DataFrame.

In [34]:
print ("The total number of elements in our object is:")
df.size

The total number of elements in our object is:


21

### values
Returns the actual data in the DataFrame as an NDarray.

In [35]:
df.values

array([['Tom', 25, 4.23],
       ['James', 26, 3.24],
       ['Ricky', 25, 3.98],
       ['Vin', 23, 2.56],
       ['Steve', 30, 3.2],
       ['Smith', 29, 4.6],
       ['Jack', 23, 3.8]], dtype=object)

### Head & Tail
To view a small sample of a DataFrame object, use the head() and tail() methods. head() returns the first n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number.

In [36]:
df.head()

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,James,26,3.24
2,Ricky,25,3.98
3,Vin,23,2.56
4,Steve,30,3.2


In [37]:
df.tail()

Unnamed: 0,Name,Age,Rating
2,Ricky,25,3.98
3,Vin,23,2.56
4,Steve,30,3.2
5,Smith,29,4.6
6,Jack,23,3.8


# Python Pandas - Descriptive Statistics

A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. Most of these are aggregations like sum(), mean(), but some of them, like sumsum(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer

DataFrame − “index” (axis=0, default), “columns” (axis=1)

Let us create a DataFrame and use this object throughout this chapter for all the operations.

In [48]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}  


#Create a DataFrame
df = pd.DataFrame(d)
print (df)

      Name  Age  Rating
0      Tom   25    4.23
1    James   26    3.24
2    Ricky   25    3.98
3      Vin   23    2.56
4    Steve   30    3.20
5    Smith   29    4.60
6     Jack   23    3.80
7      Lee   34    3.78
8    David   40    2.98
9   Gasper   30    4.80
10  Betina   51    4.10
11  Andres   46    3.65


### sum()
Returns the sum of the values for the requested axis. By default, axis is index (axis=0).

In [49]:
# Each individual column is added individually (Strings are appended).
df.sum()

Name      TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Age                                                     382
Rating                                                44.92
dtype: object

In [50]:
# "axis 0" represents rows and “axis 1” represents columns.
df.sum(1)

0     29.23
1     29.24
2     28.98
3     25.56
4     33.20
5     33.60
6     26.80
7     37.78
8     42.98
9     34.80
10    55.10
11    49.65
dtype: float64

### mean()
Returns the average value

In [51]:
df.mean()

Age       31.833333
Rating     3.743333
dtype: float64

### std()
Returns the Bressel standard deviation of the numerical columns.

In [52]:
df.std()

Age       9.232682
Rating    0.661628
dtype: float64

# Functions & Description
Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions −

|Sr.No.|     Function|    Description
|------|--------|-----------
|  1   |  count()      |   Number of non-null observations   
|  2    |   sum()     |     Sum of values      
|   3   |    mean()    |    Mean of Values       
|   4   |    median()    |  Median of Values         
|   5   | mode()       |     Mode of values      
|   6   |  std()      |      Standard Deviation of the Values     
|   7   |   min()     |       Minimum Value    
|   8   |   max()     |        Maximum Value   
|   9   |   abs()     |     Absolute Value
|   10   |   prod()     | Product of Values
|   11   |  cumsum()      | Cumulative Sum
|   12   |  cumprod()      | Cumulative Product

### Note − 
Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions.

Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. Though n practice, character aggregations are never used generally, these functions do not throw any exception.

Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.

### Summarizing Data
The describe() function computes a summary of statistics pertaining to the DataFrame columns.

In [54]:
df.describe()

Unnamed: 0,Age,Rating
count,12.0,12.0
mean,31.833333,3.743333
std,9.232682,0.661628
min,23.0,2.56
25%,25.0,3.23
50%,29.5,3.79
75%,35.5,4.1325
max,51.0,4.8


This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns. 'include' is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. Takes the list of values; by default, 'number'.
<ul>
<li>object − Summarizes String columns</li>
<li>number − Summarizes Numeric columns</li>
<li>all − Summarizes all columns together (Should not pass it as a list value)</li>
</ul>
Now, use the following statement in the program and check the output −



In [55]:
df.describe(include=['object'])

Unnamed: 0,Name
count,12
unique,12
top,Gasper
freq,1


In [58]:
df. describe(include='all')

Unnamed: 0,Name,Age,Rating
count,12,12.0,12.0
unique,12,,
top,Gasper,,
freq,1,,
mean,,31.833333,3.743333
std,,9.232682,0.661628
min,,23.0,2.56
25%,,25.0,3.23
50%,,29.5,3.79
75%,,35.5,4.1325


# 