## Python Pandas Introduction 

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc. In this tutorial, we will learn the various features of Python Pandas and how to use them in practice.

## Python Pandas - Series
Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.
### pandas.Series
A pandas Series can be created using the following constructor −

pandas.Series( data, index, dtype, copy)


In [2]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print (s)

0    a
1    b
2    c
3    d
dtype: object


### Create a Series from dict
A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [3]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print (s)

a    0.0
b    1.0
c    2.0
dtype: float64


### Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [4]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)


0    5
1    5
2    5
3    5
dtype: int64


## Accessing Data from Series with Position
Data in the series can be accessed similar to that in an ndarray.

Example 2
Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index)

In [6]:
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
print (s[:3])

a    1
b    2
c    3
dtype: int64


### Python Pandas - DataFrame


A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

### Features of DataFrame
Potentially columns are of different types
Size – Mutable
Labeled axes (rows and columns)
Can Perform Arithmetic operations on rows and columns
Structure

### Create an Empty DataFrame
A basic DataFrame, which can be created is an Empty Dataframe.

In [8]:
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print (df)

Empty DataFrame
Columns: []
Index: []


In [11]:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print (df)

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42


### Col Selection, Addition, and Deletion

#### column Selection

In [13]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df ['one'])

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64


#### Column Addition 

In [14]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)

print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print (df)

Adding a new column by passing as Series:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Adding a new column using the existing columns in DataFrame:
   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN


#### column deletion 

In [19]:
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
   'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print(df)

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print(df)


Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN


### Row Selection, Addition, and Deletion
We will now understand row selection, addition and deletion through examples. Let us begin with the concept of selection.

#### Selection by Label
Rows can be selected by passing row label to a loc function

In [20]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df.loc['b'])

one    2.0
two    2.0
Name: b, dtype: float64


#### Selection by integer location
Rows can be selected by passing integer location to an iloc function.

In [21]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print( df.iloc[2])

one    3.0
two    3.0
Name: c, dtype: float64


Slice Rows
Multiple rows can be selected using ‘ : ’ operator

In [22]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df[2:4])

   one  two
c  3.0    3
d  NaN    4


#### Addition of Rows
Add new rows to a DataFrame using the append function. This function will append the rows at the end.

In [26]:
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print(df)

   a  b
0  1  2
1  3  4
0  5  6
1  7  8


  df = df.append(df2)


#### Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.



In [31]:
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print (df)

   a  b
1  3  4
1  7  8


  df = df.append(df2)



### Python Pandas - Basic Functionality

#### axes
    Returns the list of the labels of the series.
#### empty
    Returns the Boolean value saying whether the Object is empty or not. True indicates that the object is empty.
#### ndim
    Returns the number of dimensions of the object. By definition, a Series is a 1D data structure, so it returns.
    
#### size
    Returns the size(length) of the series.
#### values
     Returns the actual data in the series as an array



In [32]:
import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print ("The axes are:")
print (s.axes)

The axes are:
[RangeIndex(start=0, stop=4, step=1)]


In [33]:
import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print ("Is the Object empty?")
print (s.empty)

Is the Object empty?
False


In [35]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print(s)

print ("The dimensions of the object:")
print (s.ndim)

0    0.989530
1   -0.979176
2    1.116517
3   -0.338629
dtype: float64
The dimensions of the object:
1


In [36]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(2))
print(s)
print ("The size of the object:")
print( s.size)

0   -2.174941
1    1.876833
dtype: float64
The size of the object:
2


In [37]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print(s)

print ("The actual data series is:")
print (s.values)


0    0.161757
1   -0.027505
2   -1.431016
3   -1.748477
dtype: float64
The actual data series is:
[ 0.16175749 -0.02750458 -1.43101648 -1.74847696]


## Python Pandas - Descriptive Statistics
A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. Most of these are aggregations like sum(), mean(), but some of them, like sumsum(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer



sum()

Returns the sum of the values for the requested axis. By default, axis is index (axis=0).

In [38]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.sum())

Name      TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Age                                                     382
Rating                                                44.92
dtype: object


mean()

Returns the average value

In [39]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.mean())

Age       31.833333
Rating     3.743333
dtype: float64


  print (df.mean())


Summarizing Data

The describe() function computes a summary of statistics pertaining to the DataFrame columns.



In [41]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.describe())


             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000


## Python Pandas - Iteration
The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects.

### Iterating a DataFrame
Iterating a DataFrame gives column names. Let us consider the following example to understand the same.

In [42]:
import pandas as pd
import numpy as np
 
N=20
df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
   })

for col in df:
   print (col)

A
x
y
C
D


### iteritems()
Iterates over each column as key, value pair with label as key and column value as a Series object.

In [43]:
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
for key,value in df.iteritems():
   print (key,value)

col1 0   -0.144780
1   -1.137252
2   -0.393201
3   -1.174424
Name: col1, dtype: float64
col2 0    0.629921
1   -1.064675
2    0.668694
3    0.057245
Name: col2, dtype: float64
col3 0    1.282125
1    0.557263
2    1.126450
3    1.067517
Name: col3, dtype: float64


### iterrows()
iterrows() returns the iterator yielding each index value along with a series containing the data in each row.

In [44]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
   print (row_index,row)

0 col1    0.551029
col2   -0.985834
col3    0.618071
Name: 0, dtype: float64
1 col1    2.599006
col2   -0.252242
col3    1.258656
Name: 1, dtype: float64
2 col1   -0.011138
col2    0.522679
col3    1.641390
Name: 2, dtype: float64
3 col1    0.407409
col2   -1.555207
col3    1.101620
Name: 3, dtype: float64


### itertuples()
itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

In [45]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row in df.itertuples():
    print (row)

Pandas(Index=0, col1=-1.0400342914540386, col2=-0.3785273190765454, col3=-2.124693807546158)
Pandas(Index=1, col1=1.0535845207329122, col2=-0.17211623490409356, col3=0.452007221262738)
Pandas(Index=2, col1=-0.021424167147544984, col2=-0.0709675406625094, col3=-0.055019914409245675)
Pandas(Index=3, col1=-0.2938931814370565, col2=-2.1312326054115616, col3=0.14717690449656037)


## Python Pandas - Sorting

There are two kinds of sorting available in Pandas. They are −

1. By label

2. By Actual Value




By Label
Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order.

In [47]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df=unsorted_df.sort_index()
print (sorted_df)

       col2      col1
0  0.034127  0.249576
1 -1.120308 -0.294553
2 -1.527101 -0.138440
3 -0.077729  0.629723
4  0.671189  0.975154
5 -1.544846 -0.714451
6  1.317441  0.175971
7 -0.111541  0.873286
8 -0.877007 -0.414802
9 -1.496553 -1.149032


#### Order of Sorting
By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same.

In [49]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)


       col2      col1
9 -1.068057  0.694881
8  0.211590 -0.612372
7 -1.045698 -0.941659
6  1.615111  0.500149
5  0.602361 -0.201479
4 -1.224870  1.053665
3  1.029309 -0.836148
2 -0.068003  0.752866
1 -1.514852 -0.025829
0 -1.237865 -0.601210


#### Sort the Columns
By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same.

In [51]:
import pandas as pd
import numpy as np
 
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])
 
sorted_df=unsorted_df.sort_index(axis=1)

print( sorted_df)

       col1      col2
1 -0.624134  0.311273
4 -0.612044  0.135751
6  0.854400  0.367433
2 -0.052941  1.511061
3  0.107984 -0.078167
5  0.101007 -1.225813
9  0.673503 -1.130567
8  1.454890 -0.015976
0  0.635424  0.590550
7  0.291630  0.060108


### By Value
Like index sorting, sort_values() is the method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted.

In [53]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')

print (sorted_df)

   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1


## Python Pandas - Working with Text Data

#### lower()


In [55]:
import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print (s.str.lower())

0             tom
1    william rick
2            john
3         alber@t
4             NaN
5            1234
6      stevesmith
dtype: object


#### upper()

In [56]:
import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print (s.str.upper())


0             TOM
1    WILLIAM RICK
2            JOHN
3         ALBER@T
4             NaN
5            1234
6      STEVESMITH
dtype: object


####  len()


In [57]:
import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print (s.str.len())

0     3.0
1    12.0
2     4.0
3     7.0
4     NaN
5     4.0
6    10.0
dtype: float64


#### strip()

In [59]:
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s)
print ("After Stripping:")
print (s.str.strip())

0             Tom 
1     William Rick
2             John
3          Alber@t
dtype: object
After Stripping:
0             Tom
1    William Rick
2            John
3         Alber@t
dtype: object


#### split(pattern)

In [60]:
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print(s)
print ("Split Pattern:")
print (s.str.split(' '))

0             Tom 
1     William Rick
2             John
3          Alber@t
dtype: object
Split Pattern:
0              [Tom, ]
1    [, William, Rick]
2               [John]
3            [Alber@t]
dtype: object


#### count(pattern)

In [61]:
import pandas as pd
 
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("The number of 'm's in each string:")
print( s.str.count('m'))


The number of 'm's in each string:
0    1
1    1
2    0
3    0
dtype: int64


#### repeat(value)

In [62]:
import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print (s.str.repeat(2))


0                      Tom Tom 
1     William Rick William Rick
2                      JohnJohn
3                Alber@tAlber@t
dtype: object


## Python Pandas - Indexing and Selecting Data

The Python and NumPy indexing operators "[ ]" and attribute operator "." provide quick and easy access to Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. 

### .loc()
Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. Integers are valid labels, but they refer to the label and not the position.

In [63]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

#select all rows for a specific column
print (df.loc[:,'A'])

a   -1.279543
b    0.678393
c   -0.145238
d    1.050149
e    0.537356
f    1.775951
g    0.965558
h   -2.277329
Name: A, dtype: float64


### .iloc()
Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing.

In [64]:
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# select all rows for a specific column
print (df.iloc[:4])

          A         B         C         D
0 -0.052872  0.052454  0.713103  0.746756
1  0.014312  0.833862 -1.229130 -0.872991
2  0.835159  1.711911 -0.078685  0.190433
3 -1.091687 -0.018216 -0.584016 -1.548482


## Python Pandas - Statistical Functions

Statistical methods help in the understanding and analyzing the behavior of data. We will now learn a few statistical functions, which we can apply on Pandas objects.

### Percent_change
Series, DatFrames and Panel, all have the function pct_change(). This function compares every element with its prior element and computes the change percentage.

In [86]:
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5,4])
print (s.pct_change())

df = pd.DataFrame(np.random.randn(5, 2))
print (df.pct_change())


0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
5   -0.200000
dtype: float64
          0          1
0       NaN        NaN
1  0.251574  -3.957328
2 -1.886184  -0.834899
3  0.263831 -11.858683
4 -1.436333  -1.166894


### Covariance
Covariance is applied on series data. The Series object has a method cov to compute covariance between series objects. NA will be excluded automatically.

In [87]:
import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print(s1.cov(s2))

0.570865906363806


### Correlation
Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall.

In [89]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])

print( frame['a'].corr(frame['b']))
print (frame.corr())

0.1380369972602199
          a         b         c         d         e
a  1.000000  0.138037 -0.083512 -0.720666 -0.392037
b  0.138037  1.000000 -0.385885  0.308000 -0.354674
c -0.083512 -0.385885  1.000000 -0.028248  0.337160
d -0.720666  0.308000 -0.028248  1.000000  0.091474
e -0.392037 -0.354674  0.337160  0.091474  1.000000
