### Reference: 
https://github.com/LearnDataSci/article-resources

### Agenda:
1. Why to use Pandas?
2. Pandas with other libraries 
3. Pre-requisites 
4. Pandas Installation 
5. Introduction to Data Structures
6. Python Pandas - Series
7. Pandas Operations in Series 
8. Type Conversion
9. Broadcasting using Arithmetic Operations 

# Python Pandas Introduction

<img src=Image"the-rise-in-popularity-of-pandas.png" width=500px />

## Why to Use Pandas ?

DATA SCIENCE = DATA + ALGORITHM 

DATA (Key Features of Pandas) 
1. Data Cleaning.  
    - Removing missing values and filtering rows or columns by some criteria.
    
    
2. Data Transforming.
    - Process of converting data from one format or structure into another format or structure.
   
   
   
3. Data Analyzing.
    - What's the average, median, max, or min of each column? 
    - Does column A correlate with column B?
    - What does the distribution of data in column C look like?


4. Data Visualization. 
    - Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more. 


5. Data Storage.
    - Store the cleaned, transformed data back into a CSV, other file or database


<b>Conclusion: </b>

Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best avenue through which to do that.

## Pandas with other libraries 

1. Pandas is built on top of the **NumPy** package, meaning a lot of the structure of NumPy is used or replicated in Pandas. 

2. Data in pandas is often used to feed statistical analysis in **SciPy**.

3. Plotting functions from **Matplotlib**, and 

4. Machine learning algorithms in **Scikit-learn**.


## Pre-requisites 

1. Python Basics 
    Eg: lists, tuples, dictionaries, functions, and iterations.

2. Python NumPy (Recommended)

## Pandas Installation 

Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

`conda install pandas`

OR 

`pip install pandas`

Alternatively, if you're currently viewing this article in a Jupyter notebook you can run this cell:

In [1]:
!pip install pandas



You should consider upgrading via the 'c:\users\user\appdata\local\programs\python\python37-32\python.exe -m pip install --upgrade pip' command.


The `!` at the beginning runs cells as if they were in a terminal.

To import pandas we usually import it with a shorter name since it's used so much:

## How to import Pandas?

In [2]:
import pandas as pd

# Introduction to Data Structures
Pandas deals with the following two data structures −

1. Series ( 1 Dimensional data) 
2. DataFrame ( Multidimensional data) 

## Core components of pandas: Series and DataFrames

The primary two components of pandas are the `Series` and `DataFrame`. 

A `Series` is essentially a column, and a `DataFrame` is a multi-dimensional table made up of a collection of Series. 

<img src="series-and-dataframe.png" width=600px />

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

You'll see how these components work when we start working with data below. 

Now to the basic components of pandas.

# Python Pandas - Series

In [3]:
import pandas as pd
import numpy as np

### Create an Empty Series

In [4]:
a = pd.Series()
print(a)

Series([], dtype: float64)


### Create a Series from ndarray

In [5]:
lst = np.array(['a','b','c','d','e'])
b = pd.Series(lst)
print(b)

0    a
1    b
2    c
3    d
4    e
dtype: object


In [6]:
lst = np.array(['a','b','c','d','e'])
b = pd.Series(lst,index=[50,51,52,53,54])
print(b)

50    a
51    b
52    c
53    d
54    e
dtype: object


### Create a Series from dict

In [7]:
# Dictionary keys are used to construct index.
lst = {'a' : 0, 'b' : 1, 'c' : 2,'d' : 3}
c = pd.Series(lst)
print(c)

a    0
b    1
c    2
d    3
dtype: int64


In [8]:
# Index order is persisted and the missing element is filled with NaN (Not a Number).
lst = {'a' : 0, 'b' : 1, 'c' : 2,'d' : 3}
c = pd.Series(lst,index=['b','c','d','a','e'])
print(c)


b    1.0
c    2.0
d    3.0
a    0.0
e    NaN
dtype: float64


### Create a Series from Scalar

In [9]:
# If data is a scalar value, an index must be provided. 
d = pd.Series(3,index=[0, 1, 2, 3, 4])
print(d)

0    3
1    3
2    3
3    3
4    3
dtype: int64


## Pandas Operations in Series 

In [10]:
# Details of students marks mentioned inside the list 
marks = [35,67,34,89,12,55,83,56,90,99]

### Creating series

In [11]:
ser = pd.Series(marks)
print(ser)

0    35
1    67
2    34
3    89
4    12
5    55
6    83
7    56
8    90
9    99
dtype: int64


### Accessing the Elements

In [12]:
# Accessing using iloc
ser.iloc[5]

55

In [13]:
# Accessing using loc
ser.loc[5]

55

### describe

In [14]:
print(ser)
print(ser.describe())

0    35
1    67
2    34
3    89
4    12
5    55
6    83
7    56
8    90
9    99
dtype: int64
count    10.000000
mean     62.000000
std      28.763403
min      12.000000
25%      40.000000
50%      61.500000
75%      87.500000
max      99.000000
dtype: float64


### count

In [15]:
print(ser.count())
print(len(ser))

10
10


### maximum

In [16]:
ser.max()

99

### minimum

In [17]:
ser.min()

12

### sum of all numbers

In [18]:
ser.sum()

620

### mean

In [19]:
ser.mean()

62.0

### median

In [20]:
ser.median()

61.5

### standard deviation

In [21]:
ser.std()

28.763402673072832

### variance

In [22]:
ser.var()

827.3333333333334

### mad - Mean Absolute Deviation

In [23]:
ser.mad()

23.6

### percentile
To calculate the percentile , pass all the list as an input and the inputs should lie between 0 to 1
* 0 means 0%
* 1 means 100%

In [24]:
ser.quantile([0,0.25,0.50,0.75,1])

0.00    12.0
0.25    40.0
0.50    61.5
0.75    87.5
1.00    99.0
dtype: float64

### Filter Conditions

In [25]:
# Filtering the students based on marks . Eg: Grade B 
print(ser)
bgrade= ser.loc[(ser>=80) & (ser<90)]
print(bgrade)

0    35
1    67
2    34
3    89
4    12
5    55
6    83
7    56
8    90
9    99
dtype: int64
3    89
6    83
dtype: int64


## Type Conversion

In [26]:
print(ser)
print(ser.dtype)

0    35
1    67
2    34
3    89
4    12
5    55
6    83
7    56
8    90
9    99
dtype: int64
int64


### Convert to String 

In [27]:
ser.astype(str)

0    35
1    67
2    34
3    89
4    12
5    55
6    83
7    56
8    90
9    99
dtype: object

### Convert to float 

In [28]:
ser.astype(float)

0    35.0
1    67.0
2    34.0
3    89.0
4    12.0
5    55.0
6    83.0
7    56.0
8    90.0
9    99.0
dtype: float64

### Series with multiple data type 

In [29]:
# How Series maintains its homogenous nature
multidt = pd.Series([35,67,"C",89,12,True,83.50,"Python",90,False])
multidt

0        35
1        67
2         C
3        89
4        12
5      True
6      83.5
7    Python
8        90
9     False
dtype: object

## Broadcasting using Arithmetic Operations 

In [30]:
# Broadcasting in Series
print(ser)
print("**********")
print(ser*2)

0    35
1    67
2    34
3    89
4    12
5    55
6    83
7    56
8    90
9    99
dtype: int64
**********
0     70
1    134
2     68
3    178
4     24
5    110
6    166
7    112
8    180
9    198
dtype: int64


In [31]:
# Creating two series for arithmetic operation
ser1 = pd.Series([10,20,30,40,50])
ser2 = pd.Series([5,15,25,35,45])

### Addition

In [32]:
ser1 + ser2

0    15
1    35
2    55
3    75
4    95
dtype: int64

### Subtraction

In [33]:
ser1 - ser2

0    5
1    5
2    5
3    5
4    5
dtype: int64

### Multiplication

In [34]:
ser1 * ser2

0      50
1     300
2     750
3    1400
4    2250
dtype: int64

### Division

In [35]:
ser1 / ser2

0    2.000000
1    1.333333
2    1.200000
3    1.142857
4    1.111111
dtype: float64

### Series with unequal length

In [36]:
# Creating two series for arithmetic operation
ser3 = pd.Series([10,20,30,40,50,60])
ser4 = pd.Series([5,15,25,35,45])

In [37]:
ser3+ser4

0    15.0
1    35.0
2    55.0
3    75.0
4    95.0
5     NaN
dtype: float64

### Same example using Numpy 

In [38]:
import numpy as np
arr1 = np.array([10,20,30,40,50,60])
arr2 =np.array([5,15,25,35,45])

In [39]:
arr1+arr2

ValueError: operands could not be broadcast together with shapes (6,) (5,) 