# 1) Pandas

Pandas is an open-source Python Library providing high-performance data manipulation 
and analysis tool using its powerful data structures. 

The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

## Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

## Pandas deals with the following three data structures −
- Series
- DataFrame
- Panel
These data structures are built on top of Numpy array, which means they are fast.

## Dimension & Description
The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. 

For example, 

DataFrame is a container of Series, 

Panel is a container of DataFrame.

|Data| Structure|	Dimensions	Description|
|-|-|-|
|Series|	1|	1D labeled homogeneous array, sizeimmutable.|
|Data Frames|	2	|General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.|
|Panel	|3|	General 3D labeled, size-mutable array.|

Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced.

For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1.

# 2) Python Pandas - Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

A pandas Series can be created using the following constructor −

In [None]:
pandas.Series( data, index, dtype, copy)

In [5]:
# A basic series, which can be created is an Empty Series.
#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print(s)

Series([], dtype: float64)


  after removing the cwd from sys.path.


In [7]:
# Create a Series from ndarray
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)

0    a
1    b
2    c
3    d
dtype: object


In [9]:
# We passed the index values here. Now we can see the customized indexed values in the output.
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print(s)

100    a
101    b
102    c
103    d
dtype: object


In [11]:
# Create a Series from dict
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

a    0.0
b    1.0
c    2.0
dtype: float64


In [13]:
# Passing index
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print(s)

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


In [15]:
# Create a Series from Scalar
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)

0    5
1    5
2    5
3    5
dtype: int64


In [16]:
# Accessing Data from Series with Position
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
print(s[0])

1


In [18]:
# Retrieve the first three elements in the Series.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
print(s[:3])

a    1
b    2
c    3
dtype: int64


In [20]:
# Retrieve the last three elements.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element
print(s[-3:])

c    3
d    4
e    5
dtype: int64


In [21]:
# Retrieve Data Using Label (Index)
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element
print(s['a'])

1


In [23]:
# Retrieve multiple elements using a list of index label values.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print(s[['a','c','d']])

a    1
c    3
d    4
dtype: int64


In [None]:
# If a label is not contained, an exception is raised.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print(s['f'])

# 3) Python Pandas - DataFrame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

- Features of DataFrame
- Potentially columns are of different types
- Size – Mutable
- Labeled axes (rows and columns)
- Can Perform Arithmetic operations on rows and columns

A pandas DataFrame can be created using the following constructor −

In [None]:
pandas.DataFrame( data, index, columns, dtype, copy)

In [27]:
# Create an Empty DataFrame
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


In [29]:
# Create a DataFrame from Lists
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)

   0
0  1
1  2
2  3
3  4
4  5


In [31]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


In [32]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)

     Name   Age
0    Alex  10.0
1     Bob  12.0
2  Clarke  13.0


## 3.1) Create a DataFrame from Dict of ndarrays / Lists
All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

If no index is passed, then by default, index will be range(n), where n is the array length.

In [33]:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42


In [34]:
# Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)

        Name  Age
rank1    Tom   28
rank2   Jack   34
rank3  Steve   29
rank4  Ricky   42


## 3.2) Create a DataFrame from List of Dicts

List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

In [36]:
# The following example shows how to create a DataFrame by passing a list of dictionaries.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

# Note − Observe, NaN (Not a Number) is appended in missing areas.

   a   b     c
0  1   2   NaN
1  5  10  20.0


In [37]:
# The following example shows how to create a DataFrame by passing a list of dictionaries and the row indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print(df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [38]:
# The following example shows how to create a DataFrame with a list of dictionaries, row indices, and column indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(df1)
print(df2)

        a   b
first   1   2
second  5  10
        a  b1
first   1 NaN
second  5 NaN


## 3.3) Create a DataFrame from Dict of Series

Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.

Note − Observe, for the series one, there is no label ‘d’ passed, but in the result, for the d label, NaN is appended with NaN.

Let us now understand column selection, addition, and deletion through examples.

In [39]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df)

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


## 3.4) Column Selection

We will understand this by selecting a column from the DataFrame.

In [40]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df['one'])

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64


## 3.5) Column Addition
We will understand this by adding a new column to an existing data frame.

In [41]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)

print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print(df)

Adding a new column by passing as Series:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Adding a new column using the existing columns in DataFrame:
   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN


## 3.6) Column Deletion
Columns can be deleted or popped; let us take an example to understand how.

In [42]:
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
   'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print(df)

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print(df)

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN


## 3.7) Row Selection, Addition, and Deletion
We will now understand row selection, addition and deletion through examples. 

Let us begin with the concept of selection.

### 3.7.1) Selection by Label

Rows can be selected by passing row label to a loc function.

In [43]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.loc['b'])

one    2.0
two    2.0
Name: b, dtype: float64


### 3.7.2) Selection by integer location
Rows can be selected by passing integer location to an iloc function.

In [44]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.iloc[2])

one    3.0
two    3.0
Name: c, dtype: float64


### 3.7.3) Slice Rows
Multiple rows can be selected using ‘ : ’ operator.

In [45]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df[2:4])

   one  two
c  3.0    3
d  NaN    4


## 3.7.4) Addition of Rows
Add new rows to a DataFrame using the append function. This function will append the rows at the end.

In [46]:
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print(df)

   a  b
0  1  2
1  3  4
0  5  6
1  7  8


## 3.7.5) Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped.

In [47]:
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print(df)

   a  b
1  3  4
1  7  8


# 4) Python Pandas - Xarray
-Pandas panel has been removed, explore Xarray later


# 5) Python Pandas - Basic Functionality

### Series Basic Functionality
|Sr.No.|	Attribute or Method & Description|
|-|-|
|1|axes<br> Returns a list of the row axis labels|
|2|dtype<br> Returns the dtype of the object.|
|3|empty<br> Returns True if series is empty.|
|4|ndim<br> Returns the number of dimensions of the underlying data, by definition 1.|
|5|size<br> Returns the number of elements in the underlying data.|
|6|values<br> Returns the Series as ndarray.|
|7|head()<br> Returns the first n rows.|
|8|tail()<br> Returns the last n rows.|

In [9]:
import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print(s)

0   -1.458714
1    0.475064
2    0.456089
3   -0.015047
dtype: float64


In [10]:
# axes
# Returns the list of the labels of the series.
s = pd.Series(np.random.randn(4))
print("The axes are:")
print(s.axes)

The axes are:
[RangeIndex(start=0, stop=4, step=1)]


In [11]:
# empty
# Returns the Boolean value saying whether the Object is empty or not. True indicates that the object is empty.
s = pd.Series(np.random.randn(4))
print("Is the Object empty?")
print(s.empty)

Is the Object empty?
False


In [12]:
# ndim
# Returns the number of dimensions of the object. 
# By definition, a Series is a 1D data structure, so it returns
s = pd.Series(np.random.randn(4))
print ("The dimensions of the object:")
print(s.ndim)

The dimensions of the object:
1


In [13]:
# size
# Returns the size(length) of the series.
s = pd.Series(np.random.randn(2))
print("The size of the object:")
print(s.size)

The size of the object:
2


In [17]:
# values
# Returns the actual data in the series as an array.
s = pd.Series(np.random.randn(4))
print ("The actual data series is:")
print(s.values)

The actual data series is:
[ 2.12879776  1.5501582   0.34871063 -1.58515656]


In [20]:
# Head & Tail
# To view a small sample of a Series or the DataFrame object, use the head() and the tail() methods.
s = pd.Series(np.random.randn(4))
print ("The original series is:")
print (s)
print ("The first two rows of the data series:")
print (s.head(2))
print ("The last two rows of the data series:")
print (s.tail(2))

The original series is:
0   -0.526445
1   -0.602228
2    0.022171
3   -1.246760
dtype: float64
The first two rows of the data series:
0   -0.526445
1   -0.602228
dtype: float64
The last two rows of the data series:
2    0.022171
3   -1.246760
dtype: float64


## DataFrame Basic Functionality

|Sr.No.	| Attribute or Method & Description|
|-|-|
|1	|T <br> Transposes rows and columns.|
|2	|axes <br> Returns a list with the row axis labels and column axis labels as the only members.|
|3	|dtypes <br> Returns the dtypes in this object.|
|4	|empty <br> True if NDFrame is entirely empty [no items]; if any of the axes are of length 0.|
|5 |ndim <br> Number of axes / array dimensions.|
|6 |shape <br> Returns a tuple representing the dimensionality of the DataFrame.|
|7 |size <br> Number of elements in the NDFrame.|
|8 |values <br> Numpy representation of NDFrame.|
|9 |head() <br> Returns the first n rows.|
|10 |tail() <br> Returns last n rows.|

In [21]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data series is:")
print (df)

Our data series is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80
