<img src = '' width="240" height="360">

# Advanced Programming in Python
   ------

## Table of Contents

1. [Introduction to Numpy](#section1)<br/>
     - 1.1. [Speed Comparision between Numpy and Python Lists](#section101)<br/> 
     - 1.2. [Importing the package](#section102)<br/>
     - 1.3. [Creating Numpy Arrays](#section103)<br/>
     - 1.4. [Checking the Attributes of Arrays](#section104)<br/>
     - 1.5. [Array Initialization](#section105)<br/>
     - 1.6. [Array Initialization using Random Numbers](#section106)<br/>
     - 1.7. [Numpy Indexing](#section107)<br/>
          - 1.7.1. [Array Slicing](#arr_slice)<br/>
          - 1.7.2. [Conditional Indexing in an Array](#conditionalarray)<br/>
     - 1.8. [Numpy Array Operations](#arroperation)<br>
          - 1.8.1. [Numpy Broadcasting](#section108)<br/>
          - 1.8.2. [Numpy Mathematical Functions](#section109)<br/>
          - 1.8.3. [Array Manipulation](#section110)<br>
               -1.8.3.1. [Change in Structure or Shape of Array](#reshapearray)<br>
               -1.8.3.2. [Merging and Splitting Array](#concatarray)<br><br>
     
2. [Pandas](#section2)<br>
     - 2.1. [Importing the Package](#section201)<br>
     - 2.2. [Series](#section202)<br>
          - 2.2.1 [Series Indexing](#section203)<br>
     - 2.3. [DataFrames](#section204)<br>
          - 2.3.1. [Loading Files Into DataFrames](#section205)<br>
          - 2.3.2. [DataFrames Attributes](#section206)<br>
          - 2.3.3. [Selection, Addition and Deletion](#section207)<br>
          - 2.3.4. [Indexing In DataFrames](#section208)<br>
          - 2.3.5. [Merging, Concatenating and Appending](#section212)<br>
          - 2.3.6. [Conditionals In DataFrames](#section209)<br>
          - 2.3.7. [MultiIndex DataFrame](#section210)<br>
          - 2.3.8. [GroupBy](#section211)<br>
          - 2.3.9. [Operations in DataFrame](#section213)<br>
     - 2.4. [Time Series In Pandas](#section214)





<a id=section1></a>

### 1. Introduction to Numpy
**Numpy** is a library developed for Python which can handle large, multi-dimensional arrays and matrices. It has
a large collections of mathematical functions to operate on these arrays.

Lets have a brief comparison of Numpy with Python Lists.<br>
Numpy is remarkably faster than Python Lists for many reasons.

- It was designed for __efficient data storage__. All the elements of numpy arrays are stored __sequentially__ with a fixed width for each value. On the other hand Lists are pointers to data stored elsewhere. _The number of separate reads the computer has to do is smaller for numpy_.

- Numpy has __uniform datatypes__ instead of Lists. The computer performs a logic for each different element type. This is completely avoided with Numpy.

- Numpy has __optimized functions__ for many mathematical operations on arrays and matrices. This is why they are faster than regular math operations on lists.

<a id=section101></a>

### 1.1. Speed comparision between Numpy and Python Lists

In [39]:
import time
import numpy as np

size_of_vec = 1000000

def pure_python_version():                                                # This function will return the time for python calculation
    time_python = time.time()                                             # Start time before operation
    my_list1 = range(size_of_vec)                                         # Creating a list with 1000000 values
    my_list2 = range(size_of_vec)
    sum_list = [my_list1[i] + my_list2[i] for i in range(len(my_list1))]  # Calculating the sum
    return time.time() - time_python                                      # Return Current time - start time

def numpy_version():                                                      # This function will return the time for numpy calculation
    time_numpy = time.time()                                              # Start time before operation
    my_arr1 = np.arange(size_of_vec)                                      # Creating a numpy array of 1000000 values
    my_arr2 = np.arange(size_of_vec)
    sum_array = my_arr1 + my_arr2                                         # Calculate the sum
    return time.time() - time_numpy                                       # Return current time - start time


python_time = pure_python_version()                                       # Time taken for Python expression
numpy_time = numpy_version()                                              # Time taken for numpy operation
print("Pure Python version {:0.4f}".format(python_time))
print("Numpy version {:0.4f}".format(numpy_time))
print("Numpy is in this example {:0.4f} times faster!".format(python_time/numpy_time))

Pure Python version 0.4478
Numpy version 0.0070
Numpy is in this example 64.1312 times faster!


__Takeaways__<br>
- We observed that Numpy is way more faster than Python list. And it could be as fast as 40 or more.<br/>
- Also its more convenient when handling large datasets at once.

<a id=section102></a>  

### 1.2. Importing the package

Import numpy module and give an alias name np, so that we dont have to repeatedly use the longer form of the name.

In [48]:
import numpy as np                                     

__Takeaways__<br>
__import numpy as np__ will create an _alias_ for the namespace, So now you dont need to call numpy again and again instead you can call np.

 <a id=section103></a>

### 1.3. Creating numpy arrays
The key feature of numpy is its __N-dimensional array__ or __ndarray__, having the below specialities:<br/>
- It is fast and flexible container for large datasets in Python.
- These arrays enable you to carry mathematical operations on __whole block of data__.
- It is a generic multi-dimensional container for data, where each element is of the __same type__. Below we will see some         examples of ndarray.
- Below is an  image showing 1D, 2D and 3D arrays.<br/>
![image.png](attachment:image.png)

In [40]:
my_list = [1,2,3,4,5]                                # This is an example of python list
my_list

[1, 2, 3, 4, 5]

In [41]:
type(my_list)                                        # Type function is used to know the type of Python objects

list

In [42]:
arr = np.array(my_list)                              # This is a one dimensional array
arr
print(arr)

[1 2 3 4 5]


In [43]:
type(arr)                                                    

numpy.ndarray

In [44]:
my_mat = [[1,2,3],[4,5,6],[7,8,9],[10,11,12]] # Creating a 2D list or list of lists
print(my_mat)

[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]


In [45]:
mat = np.array(my_mat)                               # This is a 2 dimensional array
mat     

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [46]:
print(mat)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


In [47]:
type(mat)

numpy.ndarray

__Takeaways__<br>
- Numpy arrays have the same functionalities as Python lists but the difference lies in their functionality and speed of doing     operations. <br>
- And the important diference in working with Numpy arrays is its more __convenient and fast__.
- ndarrays are widely used for __handling images__ as huge matrices of numbers. And its easier and faster to do any image         operations in numpy arrays as compared to Python lists.

<a id=section104></a>

### 1.4. Checking the attributes of array
Once you know how to create an array, you would be interested to know the __shape, dimensionality and datatype__ of the elements in the array.<br>
Here are some numpy functions to know these attributes.

In [58]:
arr.shape                                               # This gives the shape of the array

(5,)

In [59]:
mat.shape                                               # This gives the number of rows and columns of the array.

(4, 3)

**shape function is used to know the resolution of an array**

In [60]:
arr.ndim                                                # This is a 1 dimensional array

1

In [61]:
mat.ndim                                                # This is a 2 dimensional array or matrix

2

__ndim function is for checking dimension__

In [62]:
arr.dtype                                             # The datatype of elements inside array is int32

dtype('int32')

In [63]:
mat.dtype

dtype('int32')

__To know more about the layout of the array, for example for information regarding type, size and byte order of the data, dtype function is used__

__Takeaways__<br>
We have seen how easy it is to make use of _shape, ndim and dtype_ function to check the _shape, dimension and datatype_ of the array.

<a id=section105></a>

### 1.5. Array Initialization

Here we will learn generating _arrays_ with varying __step size__.

In [64]:
np.arange(0,10)     

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Generate a numpy array with numbers between 0 and 10, as no step size is defined so it will take 1 as default step size.

In [65]:
np.arange(0,10,2)     

array([0, 2, 4, 6, 8])

As you can see a numpy array with numbers between 0 and 10 and a step size of 2 is generated.<br/>
__NOTE:__ There are 3 parameters within __arange(start, stop, step)__
 start and stop specifies range of the array, while step defines the distance between two consequetive values.

In [66]:
np.zeros(4)     

array([0., 0., 0., 0.])

Generate a numpy array of 4 zeros

In [67]:
np.zeros((3,3))     

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

Generate a 3*3 null matrix

In [68]:
np.ones((3,3))     

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

Generate a 3*3 matrix filled with ones

In [6]:
np.linspace(0,3,10)     

array([0.        , 0.33333333, 0.66666667, 1.        , 1.33333333,
       1.66666667, 2.        , 2.33333333, 2.66666667, 3.        ])

Generate 10 points between 0 and 3.<br>
**linspace** is used for making **high resolution plots**

In [70]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Generate an identity matrix of size 3*3<br>
This is used in **singular value decomposition(SVD)**<br>

__Takeaways__<br>
- __numpy.linspace()__ returns __evenly spaced numbers__ over a specified interval<br>
- __numpy.eye__ is used to generate an __identity matrix__<br>
- __numpy.zeros__ generates a matrix with all elements __0__.<br>
So, as you can see there are many built in function in numpy each faciliating performing vaious mathematical operations on different types of data.

<a id=section106></a>

### 1.6. Array Initialization using Random Numbers

Here we are going to see how to populate array with random numbers.<br>
Generate an array of 5 random numbers

In [50]:
np.random.rand(5)    

array([0.30656669, 0.73203548, 0.7246578 , 0.96050271, 0.91170044])

Generate a 2D matrix where elements are random

In [49]:
np.random.rand(2,2)     

array([[0.35538364, 0.97302251],
       [0.2161639 , 0.82892921]])

Generate a 4*4 matrix where the elements are distributed in random distribition.

In [73]:
np.random.randn(4,4)     

array([[-0.07138326, -0.45044096, -1.05820819,  0.10041897],
       [-1.02867588,  0.91236814, -0.69569482, -0.16977258],
       [ 1.75264026, -0.11137428, -1.29100504,  0.92692509],
       [ 1.3948626 , -1.45848797,  0.89973743,  0.26236084]])

Generate a random array of 9 elements and convert it into 3*3 matrix using reshape

In [74]:
np.random.rand(9).reshape(3,3)   

array([[0.97910912, 0.21576191, 0.95511979],
       [0.74788062, 0.02560652, 0.54124288],
       [0.62897362, 0.80066166, 0.35138283]])

**Takeaways**<br>
- In various applications( like __assigning weights in Artificial Neural Networks__) arrays need to be initialised randomly.
- for this purpose there are various predefined functions in Numpy, and we have just seen how to make use of reshape and rand     functions.

<a id=section107></a>

### 1.7. Numpy Indexing
Once we have learned how to create arrays in Numpy, lets see how they are indexed and how we can access the contained elements.

In [52]:
my_arr = np.arange(0,11)                                 # It will return an array from 0 to 10
my_arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

** Accessing one element **

In [76]:
my_arr[10]                                               # It will return element at index 10

10

__Takeaways__<br>
We just saw that we can access _individual elements_ of an array by calling them by their _indices_.

<a id='arr_slice'></a>

### 1.7.1. Array Slicing
To access __more than one element__ of the array use slicing.

** Accessing a list of elements **

In [77]:
my_arr[1:5]                                             # It will return all the elements between 1 and 5 excluding 5

array([1, 2, 3, 4])

In [78]:
my_arr[8:]                                              # It will return all elements from index 8 and beyond

array([ 8,  9, 10])

In [79]:
my_arr[:6]                                              # It will return all elements from first index to 5

array([0, 1, 2, 3, 4, 5])

Numpy arrays are __mutable__. You can change the values of the array.

In [53]:
my_arr[0:5] = -5                                         
my_arr

array([-5, -5, -5, -5, -5,  5,  6,  7,  8,  9, 10])

Let us create a 2 dimensional array

In [54]:
arr_2d = np.array([[0,1,2],[3,4,5],[6,7,8],[9,10,11]])    

In [55]:
arr_2d

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

** Accessing a sub-array from the entire array ** 

In [83]:
arr_2d[0][2]                                              # Element in 0th row and 2nd column

2

In [84]:
arr_2d[0:2,0:2]                                           # Elements in rows 0 and 1 and columns 0 and 1

array([[0, 1],
       [3, 4]])

In [85]:
arr_2d[:2,1:]                                             # Elements from rows 0 and 1 and columns 1 and 2

array([[1, 2],
       [4, 5]])

In the below figure you can see how we can access multiple elements as a one dimensional and two dimensional arrays form.

![image.png](attachment:image.png)

__Takeaways__<br>
- So, through these examples we have seen that we can access __one or multiple elements__ of an array using __slicing__,
- Also at times when needed we can extract _subarrays_ from an array. 
- Also note that, as in the first example my_arr[ 1:5 ], the first element that is extracted is of index 1 and the last element   to be extracted is of index 5-1=4.

<a id=conditionalarray></a>

### 1.7.2. Conditional Indexing in an Array
Here we can filter elements of an array based on some conditions. Below are the steps:
- First we need to create a __boolean array__ based on an conditional statement using conditional operators for comparison.<br/>
- Then this boolean array is passed as _index of the original array_ to return the filtered elements.

In [56]:
arr_cond = np.arange(0,11)

In [57]:
arr_cond>5                                                # This will return a boolean array

array([False, False, False, False, False, False,  True,  True,  True,
        True,  True])

In [58]:
bool_arr = arr_cond > 5                                   # This will return all the elements of an array where element size > 5
arr_cond[bool_arr]

array([ 6,  7,  8,  9, 10])

In [89]:
arr_cond[arr_cond > 5]                                   # This is the same thing without using another object.

array([ 6,  7,  8,  9, 10])

**Takeaways**<br>
Use conditional indexing to filter out some values like __null or outliers__ in an array.<br/>
Given below is a table where you can see what are the various conditional operators available in Python. <br> 
![image.png](attachment:image.png)

<a id=arroperation></a>

### 1.8. Numpy Array Operations
Here we will observe how we can perform different 
_arithmatic operations_ on Numpy arrays.

<a id=section108></a>

### 1.8.1. Numpy Broadcasting
Broadcasting is used to describe how arrays are treated during mathematical operations.
The term broadcast is used because a small array is stretched or *broadcasted* over a larger array so that they have compatible sizes.

** Broadcasting Rules: **
Starting from the last axis and working backwards, Numpy compares the array dimensions:
* If two dimensions are __equal__, you can __continue__
* If one of the __operand is 1__, __stretch__ it to match the largest one
* When one of the __shapes run out of dimension__ (because it has less dimensions than the other shape), Numpy will __assign 1__ in the comparision process until the other shape's dimensions run out as well. 


In [59]:
arr_2d = np.array([[1,2,3],[5,6,7],[8,9,10],[12,13,14]])    # Creating numpy array
arr_2d

array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 8,  9, 10],
       [12, 13, 14]])

In [91]:
arr_2d.shape

(4, 3)

In [2]:
scaler = 3                                                  # scaler
arr_1d = np.array([10,10,10])                               # array with different shape
arr_1d                                                        

NameError: name 'np' is not defined

In [1]:
arr_1d.shape

NameError: name 'arr_1d' is not defined

In [94]:
arr_2d + scaler                                             # operation with a scaler

array([[ 4,  5,  6],
       [ 8,  9, 10],
       [11, 12, 13],
       [15, 16, 17]])

arr_2d --> 4 * 3

scaler --> 1 * 1<br>
Here the scaler is stretched and is added to all elements of the array. 

In [95]:
arr_2d + arr_1d                                              # operation with a array of different shape

array([[11, 12, 13],
       [15, 16, 17],
       [18, 19, 20],
       [22, 23, 24]])

arr_2d --> 4 * 3

arr_1d --> 1 * 3<br>
In this case the array with lower dimension i.e. arr_1d is stretched such that it matches the dimensions of arr_2d <br/>
And then addition is performed. Lets learn it better with one diagramatic example. 

** Example of Image Broadcasting ** 

<img src="https://github.com/insaid2018/Term-1/blob/master/Images/broadcasting.png?raw=true">

From the above image you can see how and when the data is stretched to carry out the mathematical operations. Here we have performed addition on arrays of varying dimensions. Always Remember to stretch the data as per the __broadcasting rules.__

** Now lets look at a different shape of array. **

In [60]:
# arr --> 1 * 4
arr = np.array([[1,1,1,1]])
arr_2d + arr

ValueError: operands could not be broadcast together with shapes (4,3) (1,4) 

**If the dimensions do not match it will throw an exception**

__Takeaways__<br>
Keep a note of the __shape or dimension of the arrays__ before applying any mathematical operation and make sure they satisfy the _Numpy broadcasting rules_ else value error may occur.

<a id=section109></a>

### 1.8.2. Numpy Mathematical Functions
Here you can see a lot of commonly used built-in functions of numpy for mathematical operations.
These functions are faster and optimized for large size arrays.

In [62]:
arr = np.arange(1,11)  # Lets first create a numpy array
arr

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [98]:
arr.min()                                             # Minimum value of array

1

In [99]:
arr.max()                                             # Maximum of the array

10

**Min and Max** are used to calculate the **range of array**

In [100]:
arr.argmin()                                          # Index position of minimum of array

0

In [101]:
arr.argmax()                                          # Index position of maximum of array

9

**Argmax** can be used to observe the output of a **softmax layer in Neural Networks**

In [63]:
np.sqrt(arr)                                         # To calculate square root of all elements in an array

array([1.        , 1.41421356, 1.73205081, 2.        , 2.23606798,
       2.44948974, 2.64575131, 2.82842712, 3.        , 3.16227766])

**sqrt** is used to calculate **Root Mean Squared Error**

In [103]:
arr.mean()                                           # To calculate mean of all the values in an array

5.5

**Missing values** of a column can be replaced by the __mean__ of that column in some datasets**

In [64]:
np.exp(arr)                                          # To calculate exponential value of each element in an array

array([2.71828183e+00, 7.38905610e+00, 2.00855369e+01, 5.45981500e+01,
       1.48413159e+02, 4.03428793e+02, 1.09663316e+03, 2.98095799e+03,
       8.10308393e+03, 2.20264658e+04])

**Exponential** function is used in **Activation Function of Neural Networks**<br>

__Takeaways__<br>
We have seen that Numpy have various methods to do various mathematical operations like:<br/>
- finding _minimum, maximum_,
- calulating _average values_,
- finding exponential value of each element of array.
There are more functions available to calculate various statistical parameters.

<a id=section110></a>

### 1.8.3. Array Manipulation
Here we will see some numpy functions that can change the structure(shape) of an array.

<a id='reshapearray'></a>

### 1.8.3.1. Change in Structure or Shape of Array

** Reshape Function **

In [67]:
arr = np.arange(0,16)                                # Using reshape we can change the dimensions of the array
print(arr)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]


In [68]:
arr_2D = arr.reshape(4,4) 
arr_2D

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

** Flatten Function **

In [9]:
arr_2D.flatten()                                      # Flatten is used to convert a 2D array to 1D array

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

This is used in **Convolutional Neural Networks** to flatten a 2D-image to individual pixels**

** Transpose Function **

In [71]:
arr_2D.transpose()                                    # Transpose is used to convert the rows into columns and vice-versa

array([[ 0,  4,  8, 12],
       [ 1,  5,  9, 13],
       [ 2,  6, 10, 14],
       [ 3,  7, 11, 15]])

<a id='concatarray'></a>

### 1.3.8.2. Merging and Splitting Arrays

In [72]:
arr_x = np.array([[1,2,3,4],[5,6,7,8]])                # Lets create 2 arrays
arr_y = np.array([[21,22,23,24],[25,26,27,28]])

**Concatenate** is used to *join* 2 arrays either along rows or columns

In [73]:
np.concatenate((arr_x, arr_y), axis=1)                 # Join 2 arrays along columns

array([[ 1,  2,  3,  4, 21, 22, 23, 24],
       [ 5,  6,  7,  8, 25, 26, 27, 28]])

Set __axis = 1__ if you want to merge arrays __columnwise__ and for doing it rowwise axis should be set at 0.

**This is used to add new features into a dataset**

In [74]:
arr_z = np.concatenate((arr_x, arr_y), axis=0)         # Join 2 arrays along rows
arr_z

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [21, 22, 23, 24],
       [25, 26, 27, 28]])

After joining arrays we are going to look at ways for splitting arrays.

**Horizontal Split using hsplit**

In [77]:
np.hsplit(arr_z, 2)                                   # It will split the array into 2 equal halves along the columns

[array([[ 1,  2],
        [ 5,  6],
        [21, 22],
        [25, 26]]), array([[ 3,  4],
        [ 7,  8],
        [23, 24],
        [27, 28]])]

**Vertical Split using vsplit**

In [112]:
np.vsplit(arr_z, 2)                                   # It will split the array into 2 equal halves along the rows

[array([[1, 2, 3, 4],
        [5, 6, 7, 8]]), array([[21, 22, 23, 24],
        [25, 26, 27, 28]])]

Vertical Splits can be used in **cross-validation in Machine Learning** <br>

__Takeaways__<br>
Using _concatenate_ function we can _merge_ arrays columnwise and rowwise. Also arrays can be horizontally and vertically spliited using _hsplit_ and _vsplit_.

<a id=section2></a>

__Conclusion__<br>
Numpy is open source add on module to Python.<br>
- By using NumPy you can __speed up__ your workflow and _interface with other packages_ in the Python ecosystem that use NumPy under the hood.
- A growing plethora of scientific and mathematical Python-based packages are using NumPy arrays; though these typically support   Python-sequence input, they convert such input to NumPy arrays prior to processing, and they often output NumPy arrays.
- It provide common __mathematical and numerical routines__ in pre-compiled, fast functions. 
- It provides _basic routines_ for manipulating __large arrays and matrices__ of numeric data.

__Key Features__<br/>
- NumPy arrays have a __fixed size__ decided at the time of creation. _Changing the size of an ndarray will create a new array and delete the original._
- The elements in a NumPy array are all required to be of the __same data type__, and thus will be the same size in memory.
- NumPy arrays facilitate __advanced mathematical__ and other types of __operations__ on large numbers of data.

### 2. Pandas

**Pandas** is a fast and efficient __DataFrame object__ for data manipulation. 
- It can read and write data in a variety of formats like csv and Text files, Excel, json and SQL database.<br>
- It provides __high performance DataFrame operations__ for Numerical and Time series data.<br>
- It is highly optimized for performance and most of its critical code in written in C.
- **Pandas** are built on top of **numpy**.

<a id=section201></a>


### 2.1. Importing the package

As we have imported Numpy package, similarly we will import pandas with alias name __pd__.

In [78]:
import pandas as pd

<a id=section202></a>

### 2.2. Series

These are __1D labelled ndarray__ capable of storing __any datatype__(int, float, string).

- The axis labels are called index and they need not to be unique but must be _hashable type_ 
- An object is hashable if it has a hash value which never changes during its lifetime.
- All of Python’s immutable built-in objects are hashable, while no mutable containers (such as lists or dictionaries) are.

Take a look at the below example of Pandas series. <br/>![image.png](attachment:image.png)

In [79]:
import numpy as np

countries = ['India','France','England']                                   # Creating two lists containing name of the country and its capital
capitals = ['New Delhi','Paris','London']                                      
arr = np.array(capitals)                                                   # Making an array with the list capitals
dicts = {'a':10, 'b':20, 'c':30, 'd':40, 'e':50, 'f':60, 'g':70, 'h':80}   # Defining a dictionary

**Creating a series from a list**

In [80]:
pd.Series(data = capitals)                                                 # Creating Pandas Series

0    New Delhi
1        Paris
2       London
dtype: object

__Setting custom labels for our series__

In [81]:
pd.Series(data = capitals, index = countries)                              # Modifying series by adding index

India      New Delhi
France         Paris
England       London
dtype: object

 ** Creating a Series from Numpy array **

In [82]:
pd.Series(arr)                                                 

0    New Delhi
1        Paris
2       London
dtype: object

**Creating a Series from a Dictionary**

In [83]:
pd.Series(dicts)                                                

a    10
b    20
c    30
d    40
e    50
f    60
g    70
h    80
dtype: int64

__Takeaways__<br>
Pandas series can be created using various inputs. 
- Through above examples we learnt how to create pandas series using __list, array and dictionary__.

<a id=section203></a>

### 2.2.1. Series Indexing
Here you will observe how the elements of the series are indexed and how we can access any element using it's index.

In [84]:
my_series = pd.Series(dicts)    

In [85]:
my_series.shape                                                          # This Series has 8 elements

(8,)

** Accessing a single element using its index **

In [86]:
my_series[0]                                                   # Series indexing starts with 0

10

In [87]:
my_series[:3]                                                  # Accessing the first 3 elements in series

a    10
b    20
c    30
dtype: int64

** Accessing a group of elements using index **

In [88]:
my_series[3:7]                                                 # Accessing all the elements from 3rd to 7th index

d    40
e    50
f    60
g    70
dtype: int64

In [89]:
my_series['a']                                                 # Accessing the data using labels

10

In [90]:
my_series[['a','c','e']]                                       # Accessing multiple elements in a series using labels

a    10
c    30
e    50
dtype: int64

In [91]:
my_series + my_series                                          # Performing element wise mathematical operations on series

a     20
b     40
c     60
d     80
e    100
f    120
g    140
h    160
dtype: int64

__Takeaways__<br>
As we have accessed one and multiple elements in Numpy arrays, we can do it here also in a similar fashion using __index or labels__.<br/> In the last example we saw that _elementwise mathematical operations_ can be performed on series.

<a id=section204></a>

### 2.3. DataFrames

DataFrames are __2D data structures__ having data aligned in tabular format.

- Data is aligned in rows (also called index) and columns and can store __any datatypes__ like int, string, float, boolean.

- They are highly _flexible_ and offer a lot of mathematical functions.

Shown below is a simple example to make stuffs clearer.<br/>
![image.png](attachment:image.png)

In [92]:
df = pd.DataFrame()                                          # Creating an Empty DataFrame
print(df)

Empty DataFrame
Columns: []
Index: []


**Creating a DataFrame from a List**

In [93]:
fruits = ['Apple','Banana','Coconut','Dates']
fruits_df = pd.DataFrame(fruits, columns=['Fruit'])         # columns is used to set the column name
fruits_df

Unnamed: 0,Fruit
0,Apple
1,Banana
2,Coconut
3,Dates


**Creating a DataFrame from Nested Lists**

In [94]:
people = [['Rick',60, 'O+'], ['Morty', 10, 'O+'], ['Summer', 45,'A-'], ['Beth',18,'B+']]
people_df = pd.DataFrame(people, columns=['Name','Age', 'Blood Group'])
people_df

Unnamed: 0,Name,Age,Blood Group
0,Rick,60,O+
1,Morty,10,O+
2,Summer,45,A-
3,Beth,18,B+


**Creating a DataFrame from a Dictionary**

In [95]:
people = {'Name':['Rick', 'Morty', 'Summer', 'Beth'], 'Age':[60,10,45,18], 'Blood Group':['O+','O+','A-','B+']}
people_df = pd.DataFrame(people)       
people_df

Unnamed: 0,Name,Age,Blood Group
0,Rick,60,O+
1,Morty,10,O+
2,Summer,45,A-
3,Beth,18,B+


You can observe that the dictionary keys have automatically become the column names<br> 

__Takeaways__<br>
From the above examples you got familiar with creating dataframes using various inputs like<br/>
- __Lists__
- __nested lists__
- __dicitionary__<br/>
Also note that pandas dataframes are __mutable__ and potentially __hetrogenous tabular data structure__.

<a id=section205></a>

### 2.3.1. Loading files into DataFrame

DataFrames can load data from many types of files like csv, json, excel sheets, text, etc. Lets learn how to do this one by one, first by using:

- **Comma Separated Values or CSV**

In [96]:
csv_df = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/supermarkets.csv')                               # read_csv is used to read csv file
csv_df

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


- **JSON**

In [97]:
json_df = pd.read_json('https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/supermarkets.json')                             # read_json is used to read json file
json_df

Unnamed: 0,Address,City,Country,Employees,ID,Name,State
0,3666 21st St,San Francisco,USA,8,1,Madeira,CA 94114
1,735 Dolores St,San Francisco,USA,15,2,Bready Shop,CA 94119
2,332 Hill St,San Francisco,USA,25,3,Super River,California 94114
3,3995 23rd St,San Francisco,USA,10,4,Ben's Shop,CA 94114
4,1056 Sanchez St,San Francisco,USA,12,5,Sanchez,California
5,551 Alvarado St,San Francisco,USA,20,6,Richvalley,CA 94114


- **Excel Sheets**<br>
Here we use an additional parameter __sheet name__

In [98]:
import pandas as pd
excel_df = pd.read_excel('https://github.com/insaid2018/Term-1/blob/master/Data/Casestudy/supermarkets.xlsx?raw=true', sheet_name=0)              # read_excel is used to read excel file
excel_df

Unnamed: 0,ID,Address,City,State,Country,Supermarket Name,Number of Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


- **Data Structure separated by semi-colon ;**

In [99]:
txt_df = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/supermarkets-semi-colons.txt', sep=';')            # sep is used to separate the dataset
txt_df               

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


- **CSV file from the web**

In [135]:
web_df = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/supermarkets.csv')            # write the url of the csv file within ''.
web_df

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


__Takeaways__<br>
We have seen how to load data from various types of files.
- for __csv_ file use __read_csv,
- for __json__ file use __read_json__
- for excel files use read_excel__ funnctions.<br/>
Also we made use of the __sep__ argument to __segment__ the datasheet.

<a id=section206></a>

### 2.3.2. Attributes of a DataFrame
After loading your DataFrame you may be interested to know the __columns, shape, datatypes__.<br/> 
There are many functions in pandas to check all the different attributes of a DataFrame

**Checking the number of rows and columns**

In [136]:
people_df.shape                                                           # shape function is used to know the dimensions

(4, 3)

**Checking the datatypes of elements in each column**

In [137]:
print(people_df.dtypes)                                                   # dtypes function for information about layout 

Name           object
Age             int64
Blood Group    object
dtype: object


** Checking the column names, number of records, datatype of records**

In [138]:
people_df.info()                                                            

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
Name           4 non-null object
Age            4 non-null int64
Blood Group    4 non-null object
dtypes: int64(1), object(2)
memory usage: 176.0+ bytes


** Checking the number of records of each column **

In [139]:
people_df.count()                                       

Name           4
Age            4
Blood Group    4
dtype: int64

** Checking the index of the dataframe **

In [140]:
people_df.index                                         

RangeIndex(start=0, stop=4, step=1)

**Checking the list of all columns **

In [141]:
people_df.columns                                       

Index(['Name', 'Age', 'Blood Group'], dtype='object')

** Create a new df with Name set as Index **

In [142]:
new_people_df = people_df.set_index('Name')                
new_people_df

Unnamed: 0_level_0,Age,Blood Group
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Rick,60,O+
Morty,10,O+
Summer,45,A-
Beth,18,B+


__Takeaways__<br>
We have seen how to check various dataframe **attributes** like number of <br/>
 - rows and columns, 
 - datatypes of elements in each column, 
 - index, 
 - number of records in each column,
 
using specified functions for each task.

<a id=section207></a>

### 2.3.3. Selection, Addition and Deletion

Here we will see how to perform basic DataFrame operations like <br>
* Selecting a row or a column<br>
* Adding a column to existing dataframe. You can add a row using Append function which is discussed later.<br>
* Deleting a row or column


- __Selecting a specific column__

In [143]:
people_df['Blood Group']                            # Column Selection

0    O+
1    O+
2    A-
3    B+
Name: Blood Group, dtype: object

In [144]:
people_df['Score'] = [10,9,7,6]                          # Adding a new column
people_df

Unnamed: 0,Name,Age,Blood Group,Score
0,Rick,60,O+,10
1,Morty,10,O+,9
2,Summer,45,A-,7
3,Beth,18,B+,6


In [145]:
people_df['Sum'] = people_df['Score'] + people_df['Age'] # Addition of two columns. You can perform any math operation. 
people_df

Unnamed: 0,Name,Age,Blood Group,Score,Sum
0,Rick,60,O+,10,70
1,Morty,10,O+,9,19
2,Summer,45,A-,7,52
3,Beth,18,B+,6,24


These processes are generally used in __Feature Engineering__ where we combine 2 columns to create more meaningful features.

- **Column Deletion using drop**<br>
- It has 2 parameters axis and inplace
    -  **axis=1** implies delete __columnwise__,
    - __axis=0__ implies delete __rowwise__ <br>
    - __inplace=True__ implies __Modify__ the df<br>

In [146]:
people_df.drop('Sum', axis=1, inplace=True)             # Drop Sum and modify the dataframe
people_df

Unnamed: 0,Name,Age,Blood Group,Score
0,Rick,60,O+,10
1,Morty,10,O+,9
2,Summer,45,A-,7
3,Beth,18,B+,6


**These are used to remove Y labels from the dataframe**

In [147]:
del people_df['Score']                                  # Column Deletion using del
df

In [148]:
new_people_df = people_df.set_index('Name')             # Creates a new df with Name set as Index    
new_people_df

Unnamed: 0_level_0,Age,Blood Group
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Rick,60,O+
Morty,10,O+
Summer,45,A-
Beth,18,B+


In [149]:
people_df.drop(['Blood Group'],1)                       # Dropping 'Blood Group' column

Unnamed: 0,Name,Age
0,Rick,60
1,Morty,10
2,Summer,45
3,Beth,18


In [150]:
new_people_df.drop('Summer', axis=0)                    # Row Deletion using Row Index

Unnamed: 0_level_0,Age,Blood Group
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Rick,60,O+
Morty,10,O+
Beth,18,B+


__Takeaways__<br>
We have seen how to use __drop__ function to delete rows and columns, also note that to delete _columnwise_ set _axis = 1_ else set it to 0.<br/> Also keep a note that generally we need to delete rows and columns in case when our data have __too many null values__ or __outliers.__

<a id=section208></a>

### 2.3.4. Indexing in DataFrame
There are 2 types of indexing in Pandas<br>
* Using numbers(record or column index) using iloc
* Using names(record name or column name) using loc

In [108]:
market_df = pd.read_csv('http://pythonhow.com/supermarkets.csv')
market_df

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


In [101]:
market_df = market_df.set_index('ID')                     # Use the ID column as index

Indexing using iloc

In [102]:
market_df.iloc[0:4,0:5]                                   # It will return rows from 0-4 and columns from 0-5

Unnamed: 0_level_0,Address,City,State,Country,Name
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,3666 21st St,San Francisco,CA 94114,USA,Madeira
2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop
3,332 Hill St,San Francisco,California 94114,USA,Super River
4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop


In [104]:
market_df.iloc[:,0:5]                                     # Return the columns from 0-5

Unnamed: 0_level_0,Address,City,State,Country,Name
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,3666 21st St,San Francisco,CA 94114,USA,Madeira
2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop
3,332 Hill St,San Francisco,California 94114,USA,Super River
4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop
5,1056 Sanchez St,San Francisco,California,USA,Sanchez
6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley


In [155]:
market_df.iloc[0:4,:]                                     # Return all the rows from 0-4

Unnamed: 0_level_0,Address,City,State,Country,Name,Employees
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
3,332 Hill St,San Francisco,California 94114,USA,Super River,25
4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10


All the above operations can also be done by loc

In [105]:
market_df.loc[1:4,"Address"]                              # Returns ID 1-4 and column Address

ID
1      3666 21st St
2    735 Dolores St
3       332 Hill St
4      3995 23rd St
Name: Address, dtype: object

In [106]:
market_df.loc[1:4,"Address":"Country"]                    # Returns ID 1-4 and columns Address to Country

Unnamed: 0_level_0,Address,City,State,Country
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3666 21st St,San Francisco,CA 94114,USA
2,735 Dolores St,San Francisco,CA 94119,USA
3,332 Hill St,San Francisco,California 94114,USA
4,3995 23rd St,San Francisco,CA 94114,USA


In [107]:
market_df.loc[:,"State":]                                 # Return all the columns from State onwards and all the rows

Unnamed: 0_level_0,State,Country,Name,Employees
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,CA 94114,USA,Madeira,8
2,CA 94119,USA,Bready Shop,15
3,California 94114,USA,Super River,25
4,CA 94114,USA,Ben's Shop,10
5,California,USA,Sanchez,12
6,CA 94114,USA,Richvalley,20


__Takeaways__<br>
You can use these operations if you want to __separate features__ into _numerical and categorical columns_.
- As we have seen use _iloc and loc_ for indexing using _numbers and names_ respectively.
- You can also perform __train-test-split__ on the dataframe.

<a id=section212></a>

### 2.3.5. Merging, Concatenating and Appending

In the previous section we saw how to add rows or columns.<br>
Here we will see how to merge two dataframes. 


In [159]:
df1 = pd.DataFrame({                                       # Lets create 2 dataframes
    'id':[1,2,3,4,5],
    'name':['a','b','c','d','e'],
    'sub':['sub1','sub2','sub3','sub4','sub5']
})
df2 = pd.DataFrame({
    'id':[1,2,3,4,5],
    'name':['b','c','d','e','f'],
    'sub':['sub3','sub4','sub5','sub6','sub7']
})

### Concatenating 2 DataFrames
This is used to join 2 dataframes along _rows or columns_

In [160]:
pd.concat([df1, df2], axis=0)                              # Joining two DataFrame along the rows

Unnamed: 0,id,name,sub
0,1,a,sub1
1,2,b,sub2
2,3,c,sub3
3,4,d,sub4
4,5,e,sub5
0,1,b,sub3
1,2,c,sub4
2,3,d,sub5
3,4,e,sub6
4,5,f,sub7


In [161]:
pd.concat([df1, df2], axis=1)                              # Joining 2 DataFrame along the columns

Unnamed: 0,id,name,sub,id.1,name.1,sub.1
0,1,a,sub1,1,b,sub3
1,2,b,sub2,2,c,sub4
2,3,c,sub3,3,d,sub5
3,4,d,sub4,4,e,sub6
4,5,e,sub5,5,f,sub7


These methods are widely used in **Feature Engineering.**

### Merging 2 DataFrames

This is used to join two dataframes based on any **column** as **key**.

In [162]:
pd.merge(left=df1, right=df2, on='sub')                    # Joining 2 DataFrame using 'sub' as key

Unnamed: 0,id_x,name_x,sub,id_y,name_y
0,3,c,sub3,1,b
1,4,d,sub4,2,c
2,5,e,sub5,3,d


In [163]:
pd.merge(left=df1, right=df2, on='id')   

Unnamed: 0,id,name_x,sub_x,name_y,sub_y
0,1,a,sub1,b,sub3
1,2,b,sub2,c,sub4
2,3,c,sub3,d,sub5
3,4,d,sub4,e,sub6
4,5,e,sub5,f,sub7


From the below pictorial depiction see how two dataframes are merged.<br/>
![image.png](attachment:image.png)

### Appending a row on a DataFrame
Append is used to add rows on a DataFrame.<br>

In [164]:
df3 = pd.DataFrame({'id':[10],                             # Lets modify our row to a DataFrame
                    'name':['z'], 
                    'sub':['sub10']})
df3

Unnamed: 0,id,name,sub
0,10,z,sub10


In [165]:
df1

Unnamed: 0,id,name,sub
0,1,a,sub1
1,2,b,sub2
2,3,c,sub3
3,4,d,sub4
4,5,e,sub5


In [166]:
df4 = df1
df5 = df3

df4.append(df5)     

Unnamed: 0,id,name,sub
0,1,a,sub1
1,2,b,sub2
2,3,c,sub3
3,4,d,sub4
4,5,e,sub5
0,10,z,sub10


You can observe the index has been modified. Lets correct our index

In [167]:
df4 = df1
df5 = df3

df4.append(df5, ignore_index = True) 

Unnamed: 0,id,name,sub
0,1,a,sub1
1,2,b,sub2
2,3,c,sub3
3,4,d,sub4
4,5,e,sub5
5,10,z,sub10


**These functions are useful when we want to create a DataFrame by combining 2 datasets **

__Takeaways__<br>
We learned how to _concatenate, merge and append_ dataframes using corresponding functions.

<a id=section209></a>

### 2.3.6. Conditionals in DataFrame
- This is used to perform **comparisions** on the records of DataFrame.<br>
- The output of comparision is of **boolean** datatype.<br>
- You can use this boolean to filter out records from the DataFrame.


In [168]:
market_df['Employees'] >=15                                                       # This returns a Boolean Series

ID
1    False
2     True
3     True
4    False
5    False
6     True
Name: Employees, dtype: bool

In [169]:
market_df[market_df['Employees'] >=15]                                            # This returns all the rows for which the condition is True

Unnamed: 0_level_0,Address,City,State,Country,Name,Employees
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
3,332 Hill St,San Francisco,California 94114,USA,Super River,25
6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


In [170]:
market_df[(market_df['Employees'] >=15) & (market_df['State'] != 'CA 94119')]     # It will return rows where both the conditions are satisfied.

Unnamed: 0_level_0,Address,City,State,Country,Name,Employees
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,332 Hill St,San Francisco,California 94114,USA,Super River,25
6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


__Takeaways__<br>
As you have seen conditional operators helps us in accessing _selective rows_ of the dataframe as per the condition's criteria. <br>They also helps us to get a clear picture of data by logically operating on combinations of various features, therefore these functions are widely used in **Feature Engineering**.<br>

<a id=section210></a>

### 2.3.7. Multi Index DataFrame

Until now you have seen DataFrames with a single index. Lets see how we can use more than 1 index to gain better insights.

In [109]:
Company=['Google','Google','Google','Microsoft','Microsoft','Microsoft']
Year = [2008,2009,2010,2008,2009,2010]
Revenue = [11,15,16,9,12,14]
Employee = [300,400,500,350,450,550]

In [111]:
list(zip(Revenue, Employee))                                                        # Zip will collect one value from each of its container(Revenue, Employee)

[(11, 300), (15, 400), (16, 500), (9, 350), (12, 450), (14, 550)]

In [112]:
list(zip(Company, Year))

[('Google', 2008),
 ('Google', 2009),
 ('Google', 2010),
 ('Microsoft', 2008),
 ('Microsoft', 2009),
 ('Microsoft', 2010)]

In [113]:
hier_index = list(zip(Company, Year))                                               # These pair values will be our 2 indices

In [114]:
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index

MultiIndex(levels=[['Google', 'Microsoft'], [2008, 2009, 2010]],
           codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [115]:
multi_index_df = pd.DataFrame(data = list(zip(Revenue, Employee)), index = hier_index, columns=['Revenue','Employee'])
multi_index_df

Unnamed: 0,Unnamed: 1,Revenue,Employee
Google,2008,11,300
Google,2009,15,400
Google,2010,16,500
Microsoft,2008,9,350
Microsoft,2009,12,450
Microsoft,2010,14,550


In [116]:
multi_index_df.index.names =['Company','Year']                                      # Rename our indices to Company and Year
multi_index_df 

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Employee
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
Google,2008,11,300
Google,2009,15,400
Google,2010,16,500
Microsoft,2008,9,350
Microsoft,2009,12,450
Microsoft,2010,14,550


In [178]:
multi_index_df.loc['Google']                                                        # Accessing data using our first index

Unnamed: 0_level_0,Revenue,Employee
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2008,11,300
2009,15,400
2010,16,500


In [179]:
multi_index_df.loc['Google'].loc[2009]                                              # Accessing data using first index and then second index

Revenue      15
Employee    400
Name: 2009, dtype: int64

In [180]:
multi_index_df.xs('Microsoft', level='Company')                                     # Accessing data based on Index level (either from first or second index)

Unnamed: 0_level_0,Revenue,Employee
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2008,9,350
2009,12,450
2010,14,550


In [181]:
multi_index_df.xs(2010, level='Year')                                               # Accessing data based on Index level

Unnamed: 0_level_0,Revenue,Employee
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Google,16,500
Microsoft,14,550


Lets observe Multi level DataFrame in another dataset

In [121]:
import seaborn as sns                                                               # Import seaborn library for dataset
tips = sns.load_dataset('tips')
tips.head()                                                                         # head() shows the top 5 records of the dataframe

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [183]:
tips.shape

(244, 7)

In [184]:
tips.index                                                                         # This dataset has a numeric index having range between 0 and 244

RangeIndex(start=0, stop=244, step=1)

In [185]:
tip = tips.head(10)                                                                # Lets create a smaller dataset from the first 10 rows
tip

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2
7,26.88,3.12,Male,No,Sun,Dinner,4
8,15.04,1.96,Male,No,Sun,Dinner,2
9,14.78,3.23,Male,No,Sun,Dinner,2


In [186]:
tip.set_index(['sex'])                                                             # Usually you can create an Index using a categorical variable

Unnamed: 0_level_0,total_bill,tip,smoker,day,time,size
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,16.99,1.01,No,Sun,Dinner,2
Male,10.34,1.66,No,Sun,Dinner,3
Male,21.01,3.5,No,Sun,Dinner,3
Male,23.68,3.31,No,Sun,Dinner,2
Female,24.59,3.61,No,Sun,Dinner,4
Male,25.29,4.71,No,Sun,Dinner,4
Male,8.77,2.0,No,Sun,Dinner,2
Male,26.88,3.12,No,Sun,Dinner,4
Male,15.04,1.96,No,Sun,Dinner,2
Male,14.78,3.23,No,Sun,Dinner,2


In [123]:
multi_index_tips = tips.set_index(['sex','size'])                                   # Lets set 'sex' and 'size' as our index

In [124]:
multi_index_tips.sort_index(inplace = True)                                        # Sorting all the index of serving size in ascending order
multi_index_tips


Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time
sex,size,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Male,1,8.58,1.92,Yes,Fri,Lunch
Male,2,23.68,3.31,No,Sun,Dinner
Male,2,8.77,2.00,No,Sun,Dinner
Male,2,15.04,1.96,No,Sun,Dinner
Male,2,14.78,3.23,No,Sun,Dinner
Male,2,10.27,1.71,No,Sun,Dinner
Male,2,15.42,1.57,No,Sun,Dinner
Male,2,21.58,3.92,No,Sun,Dinner
Male,2,17.92,4.08,No,Sat,Dinner
Male,2,19.82,3.18,No,Sat,Dinner


In [189]:
multi_index_tips.loc['Male']                                                       # Accessing records if index is Male

Unnamed: 0_level_0,total_bill,tip,smoker,day,time
size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,23.68,3.31,No,Sun,Dinner
2,8.77,2.0,No,Sun,Dinner
2,15.04,1.96,No,Sun,Dinner
2,14.78,3.23,No,Sun,Dinner
3,10.34,1.66,No,Sun,Dinner
3,21.01,3.5,No,Sun,Dinner
4,25.29,4.71,No,Sun,Dinner
4,26.88,3.12,No,Sun,Dinner


In [190]:
multi_index_tips.xs(2, level='size')                                               # Accessing records if serving size is 2 (From second index)

Unnamed: 0_level_0,total_bill,tip,smoker,day,time
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Male,23.68,3.31,No,Sun,Dinner
Male,8.77,2.0,No,Sun,Dinner
Male,15.04,1.96,No,Sun,Dinner
Male,14.78,3.23,No,Sun,Dinner
Female,16.99,1.01,No,Sun,Dinner


__Takeaways__<br>
When dealing with multiple indices we have seen how to access data using __loc and xs__ function. Also we have seen _creating index_ using _categorical variable_.

<a id=section211></a>

### 2.3.8. Groupby

Groupby allows you to group together _rows_ based on a _column_ and perform an aggregate function on them.<br/>
This is quiet a handy tool if you don't want to change the index of the DataFrame.

In [125]:
byTime = tips.groupby('time')                                                    # Lets analyse the metrics bases on time of meal(Lunch, Dinner)

In [192]:
byTime.count()                                                                   # Number of meals for Lunch and Dinner

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,size
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Lunch,68,68,68,68,68,68
Dinner,176,176,176,176,176,176


We can evaluate the average bill, tip and serving size at Lunch and Dinner

In [193]:
byTime.mean()                                                                    # Average bill

Unnamed: 0_level_0,total_bill,tip,size
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lunch,17.168676,2.728088,2.411765
Dinner,20.797159,3.10267,2.630682


We can evaluate the total bill, tip and serving size at Lunch and Dinner

In [194]:
byTime.sum()                                                                     # Total bill

Unnamed: 0_level_0,total_bill,tip,size
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lunch,1167.47,185.51,164
Dinner,3660.3,546.07,463


Take a quick look at the below image to get a clear picture of _groupby_ function.
![image.png](attachment:image.png)

__Takeaways__<br>
As we have seen in above examples, we can use __groupby__ method to club together rows as per a column, then we can use it to calculate _average_ and _sum_ of various features.

<a id=section213></a>

### 2.3.9. Operations on a DataFrame
Here are some basic operations you can perform on a DataFrame

In [195]:
tips.head()                                                                     # Observe the first 5 elements of the dataframe

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [196]:
tips['size'].unique()                                                           # Observe the unique values of tips['size']

array([2, 3, 4, 1, 6, 5], dtype=int64)

In [197]:
tips['size'].nunique()                                                          # Observe the number of unique values of tips['size']

6

In [198]:
tips['size'].value_counts()                                                     # Observe the number of counts of tips['size']

2    156
3     38
4     37
5      5
6      4
1      4
Name: size, dtype: int64

### Applying a function on DataFrame
We can apply any function on the elements of a dataframe

In [199]:
people = [['Rick',60, 'O+'], ['Morty', 10, 'O+'], ['Summer', 45,'A-'], ['Beth',18,'B+']]
people_df = pd.DataFrame(people, columns=['Name','Score', 'Blood Group'])
people_df

Unnamed: 0,Name,Score,Blood Group
0,Rick,60,O+
1,Morty,10,O+
2,Summer,45,A-
3,Beth,18,B+


In [200]:
def times2(x):                                                                # We are going to apply this function
    return x * 2

In [201]:
people_df['Score'].apply(times2)                                              # Applying times2() function on a column of dataframe

0    120
1     20
2     90
3     36
Name: Score, dtype: int64

In [202]:
people_df['Score'].apply(lambda x: x * 2)                                     # Applying lambda function on a column of dataframe

0    120
1     20
2     90
3     36
Name: Score, dtype: int64

In [203]:
people_df.sort_values('Score')                                                # Sorting the records based on a column

Unnamed: 0,Name,Score,Blood Group
1,Morty,10,O+
3,Beth,18,B+
2,Summer,45,A-
0,Rick,60,O+


__Takeaways__<br>
In this manner we can perform any operation on the _column_ of a dataframe. <br>
These functions are widely used in **Feature Engineering**

<a id=section214></a>

### 2.4. Time Series in Pandas
Here we will explore the __DateTime__ functions provided by Pandas and how efficient it is to analyse Time Series data.<br/>
Shown below is a time series data of the pollution in South Africa.
![image.png](attachment:image.png)

In [127]:
air_df = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/AirQualityUCI.csv')                                      # Import the Dataset

In [128]:
air_df.describe(include = 'all')

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
count,9357,9357,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0
unique,391,24,,,,,,,,,,,,,
top,4/30/2004,21:00:00,,,,,,,,,,,,,
freq,24,390,,,,,,,,,,,,,
mean,,,-34.207524,1048.990061,-159.090093,1.865683,894.595276,168.616971,794.990168,58.148873,1391.479641,975.072032,9.778305,39.48538,-6.837604
std,,,77.65717,329.83271,139.789093,41.380206,342.333252,257.433866,321.993552,126.940455,467.210125,456.938184,43.203623,51.216145,38.97667
min,,,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0
25%,,,0.6,921.0,-200.0,4.0,711.0,50.0,637.0,53.0,1185.0,700.0,10.9,34.1,0.6923
50%,,,1.5,1053.0,-200.0,7.9,895.0,141.0,794.0,96.0,1446.0,942.0,17.2,48.6,0.9768
75%,,,2.6,1221.0,-200.0,13.6,1105.0,284.0,960.0,133.0,1662.0,1255.0,24.1,61.9,1.2962


In [129]:
air_df.head()                                                                  # Observe the columns and rows of the dataset

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888


Info of dataset provides column data type and the number of values 

In [261]:
air_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 15 columns):
Date             9357 non-null object
Time             9357 non-null object
CO(GT)           9357 non-null float64
PT08.S1(CO)      9357 non-null int64
NMHC(GT)         9357 non-null int64
C6H6(GT)         9357 non-null float64
PT08.S2(NMHC)    9357 non-null int64
NOx(GT)          9357 non-null int64
PT08.S3(NOx)     9357 non-null int64
NO2(GT)          9357 non-null int64
PT08.S4(NO2)     9357 non-null int64
PT08.S5(O3)      9357 non-null int64
T                9357 non-null float64
RH               9357 non-null float64
AH               9357 non-null float64
dtypes: float64(5), int64(8), object(2)
memory usage: 1.1+ MB


__Date and Time__ columns are _objects datatype_. We need to convert them to __datetime__ format.<br>
Convert the Date to datetime object. Format is used to specify MM/DD/YY

In [262]:
air_df['Date'] = pd.to_datetime(air_df['Date'], format='%m/%d/%Y')     

Checking the __unique values__ in air_df[ 'Time' ] and their count.

In [263]:
air_df['Time'].unique()     

array(['18:00:00', '19:00:00', '20:00:00', '21:00:00', '22:00:00',
       '23:00:00', '0:00:00', '1:00:00', '2:00:00', '3:00:00', '4:00:00',
       '5:00:00', '6:00:00', '7:00:00', '8:00:00', '9:00:00', '10:00:00',
       '11:00:00', '12:00:00', '13:00:00', '14:00:00', '15:00:00',
       '16:00:00', '17:00:00'], dtype=object)

In [264]:
air_df[air_df['NMHC(GT)']==-200].count()

Date             8443
Time             8443
CO(GT)           8443
PT08.S1(CO)      8443
NMHC(GT)         8443
C6H6(GT)         8443
PT08.S2(NMHC)    8443
NOx(GT)          8443
PT08.S3(NOx)     8443
NO2(GT)          8443
PT08.S4(NO2)     8443
PT08.S5(O3)      8443
T                8443
RH               8443
AH               8443
dtype: int64

Almost 90% of the elements of NMHC(GT) is -200.0. Therefore dropping this column.

In [265]:
air_df.drop(['NMHC(GT)'], axis=1)

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,2004-03-10,18:00:00,2.6,1360,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,2004-03-10,19:00:00,2.0,1292,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,2004-03-10,20:00:00,2.2,1402,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,2004-03-10,21:00:00,2.2,1376,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,2004-03-10,22:00:00,1.6,1272,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888
5,2004-03-10,23:00:00,1.2,1197,4.7,750,89,1337,96,1393,949,11.2,59.2,0.7848
6,2004-03-11,0:00:00,1.2,1185,3.6,690,62,1462,77,1333,733,11.3,56.8,0.7603
7,2004-03-11,1:00:00,1.0,1136,3.3,672,62,1453,76,1333,730,10.7,60.0,0.7702
8,2004-03-11,2:00:00,0.9,1094,2.3,609,45,1579,60,1276,620,10.7,59.7,0.7648
9,2004-03-11,3:00:00,0.6,1010,1.7,561,-200,1705,-200,1235,501,10.3,60.2,0.7517


Converting __time__ to pandas __datetime__ with format HH/MM/SS

In [266]:
air_df['Time'] = pd.to_datetime(air_df['Time'], format = '%H:%M:%S')     

In [267]:
air_df.dtypes                                         # Final check for our date and time column

Date             datetime64[ns]
Time             datetime64[ns]
CO(GT)                  float64
PT08.S1(CO)               int64
NMHC(GT)                  int64
C6H6(GT)                float64
PT08.S2(NMHC)             int64
NOx(GT)                   int64
PT08.S3(NOx)              int64
NO2(GT)                   int64
PT08.S4(NO2)              int64
PT08.S5(O3)               int64
T                       float64
RH                      float64
AH                      float64
dtype: object

In [268]:
year = air_df.Date.dt.year                            # Extracting Year from Date column
print(year.head())

0    2004
1    2004
2    2004
3    2004
4    2004
Name: Date, dtype: int64


In [269]:
month = air_df.Date.dt.month                          # Extracting Month from Date column
print(month.head())

0    3
1    3
2    3
3    3
4    3
Name: Date, dtype: int64


In [270]:
month.nunique()                                       # Counting the number of months

12

In [271]:
day = air_df.Date.dt.day                              # Extracting Day from Date column
print(day.head())    

0    10
1    10
2    10
3    10
4    10
Name: Date, dtype: int64


In [272]:
day.nunique()                                         # Counting the number of days

31

In [273]:
week = air_df.Date.dt.week                            # Extracting week number from Date column
print(week.head())

0    11
1    11
2    11
3    11
4    11
Name: Date, dtype: int64


The __index of week days__ are provided below.<br>
0 = Monday<br>1 = Tuesday<br>2 = Wednesday<br>3 = Thursday<br>4 = Friday<br>5 = Saturday<br>6 = Sunday

In [274]:
day_of_week = air_df.Date.dt.dayofweek                 # Extracting the day of the week number
print(day_of_week.head())

0    2
1    2
2    2
3    2
4    2
Name: Date, dtype: int64


In [275]:
day_name = air_df.Date.dt.weekday_name                 # Extracting the name of the day
print(day_name.head())

0    Wednesday
1    Wednesday
2    Wednesday
3    Wednesday
4    Wednesday
Name: Date, dtype: object


In [276]:
day_of_year = air_df.Date.dt.dayofyear                 # Extracting the day of the year
print(day_of_year.head())

0    70
1    70
2    70
3    70
4    70
Name: Date, dtype: int64


In [277]:
hour = air_df.Time.dt.hour                             # Extracting the hour from time
print(hour.head())

0    18
1    19
2    20
3    21
4    22
Name: Time, dtype: int64


In [278]:
hour.nunique()                                         # Counting the number of hours

24

In [279]:
minute = air_df.Time.dt.minute                         # Extracting the minutes from the time
print(minute.head())

0    0
1    0
2    0
3    0
4    0
Name: Time, dtype: int64


In [280]:
second = air_df.Time.dt.second                         # Extracting the seconds from the time
print(second.head())

0    0
1    0
2    0
3    0
4    0
Name: Time, dtype: int64


Performing Conditional operations on Time. <br>
Lets measure the number of records before 9 a.m.

In [281]:
timestamp = pd.to_datetime("09:00:00", format='%H:%M:%S')   
air_df[air_df['Time'] < timestamp].shape

(3510, 15)

There are 3510 records where time is before 9 am

Performing Conditional operations on Date. <br>
Lets measure the number of records before 01/01/2005

In [282]:
datestamp = pd.to_datetime("01/01/2005", format='%d/%m/%Y')

In [283]:
air_df[air_df['Date'] < datestamp].shape

(7110, 15)

There are 7109 records before jan 1 2005

__Takeaways__<br>
- We have seen that by using __datetime format__ we can extract information like __year, month, hour, minute__ etc from time.
- We can even retrieve name of the day and day of the year type information.
- All these things help to better understand our data which in turn help in drawing important conclusions and insights.

__Conclusion__<br/>
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages.<br/>
- __Pandas__ is one of those packages, and makes importing and analyzing data much easier.
- It is an __open-source__, BSD-licensed Python library providing __high-performance__, __easy-to-use data structures__ and data analysis tools for the Python programming language. 
- Python with Pandas is used in a wide range of fields including _academic and commercial domains_ like __finance, economics,         statistics, analytics__, etc.
- They are built on packages like __NumPy and matplotlib__ to give you a single, convenient, place to do most of your data analysis and visualization work.

__Key Features__<br/>
- __Fast and efficient DataFrame object__ with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of __missing data__.
- __Reshaping and pivoting__ of date sets.
- __Label-based slicing__, __indexing and subsetting__ of large data sets.
- Columns from a data structure can be _deleted or inserted_.
- __Groupby__ data for aggregation and transformations.
- High performance __merging and joining__ of data.
- __Time Series__ functionality.

### The End