<div style="color:#006666; padding:0px 10px; border-radius:5px; font-size:18px;"><h1 style='margin:10px 5px'>Introduction to Pandas</h1>
</div>

© Copyright Machine Learning Plus

 <div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>1. What is pandas?</h2>
</div>

Pandas is Python's most popular and widely used library for Data Manipulation and Analysis. Data analysts and Scientists often work with data coming in various formats such as .csv, .xlsx files and such. Pandas makes it easy to import, process and analyse the data.

The __DataFrame__ is a popular data structure modeled after the dataframe object from R programming language. It is further enhanced to Python's object oriented capabilities. 

It is built on top of NumPy, which is famous for the arrays.

__Built By:__

_Wes McKinney_ and Team.

Originally started in the year 2008 at AQR Capital Management. 
2009 -> Open Source.

Today actively developed and maintained by contributors throughout the world. 

All of this is done via two primary data structures: 
1. Series
2. DataFrame


One column of a __DataFrame__ when extracted becomes a __Series__.

__Why Pandas?__

1. Do all sort of Data Manipulation
2. Default library used for handling tabular data.
3. Compatible with ML libraries like scikit-learn.
4. Excellent documentation
5. Wide adoption => Easy to find a solution for anything
6. Plotting for Data Analysis
7. Ability to handle multiple datatypes in same dataframe.

More here: https://pandas.pydata.org/about/

 <div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>2. DataFrame</h2>
</div>


It is customary to import `numpy` as `np` and `pandas` as `pd`. Everyone does it this way.

In [1]:
# Import convention
import numpy as np
import pandas as pd

In [2]:
# Create Numpy Array or List
np.random.seed(100)
arr = np.random.randint(0,100,(5,3))
arr

array([[ 8, 24, 67],
       [87, 79, 48],
       [10, 94, 52],
       [98, 53, 66],
       [98, 14, 34]], dtype=int32)

Create DataFrame from the Array. Default Column names and Row indexs starting from 0, 1, .. is generated.

In [3]:
df = pd.DataFrame(arr)
df

Unnamed: 0,0,1,2
0,8,24,67
1,87,79,48
2,10,94,52
3,98,53,66
4,98,14,34


In [4]:
type(df)

pandas.core.frame.DataFrame

You can create your own custom columns and rows as well.

In [5]:
rownames = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri']
columnnames = ['Jan', 'Feb', 'Mar']

In [6]:
df = pd.DataFrame(arr, index=rownames, columns=columnnames)
df

Unnamed: 0,Jan,Feb,Mar
Mon,8,24,67
Tue,87,79,48
Wed,10,94,52
Thu,98,53,66
Fri,98,14,34


### How to create a DataFrame from a Dictionary?

In [7]:
mydict = {
    'Jan' : [1, 2, 3, 4, 5],
    'Feb' : [10, 20, 30, 40, 50],
    'Mar' : [15, 25, 35, 45, 55],
}
  
# dataframe from dict
df = pd.DataFrame(mydict, index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
df

Unnamed: 0,Jan,Feb,Mar
Mon,1,10,15
Tue,2,20,25
Wed,3,30,35
Thu,4,40,45
Fri,5,50,55


A dictionary does not have an inherent ordering of key. So, to get your columns in specific order, explicitly mention it.

In [8]:
mydict = {
    'Jan' : [1, 2, 3, 4, 5],
    'Feb' : [10, 20, 30, 40, 50],
    'Mar' : [15, 25, 35, 45, 55],
}
  
# dataframe from dict
df = pd.DataFrame(mydict, 
                  index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
                  columns=['Mar', 'Jan', 'Feb'])
df

Unnamed: 0,Mar,Jan,Feb
Mon,15,1,10
Tue,25,2,20
Wed,35,3,30
Thu,45,4,40
Fri,55,5,50


__Reading Data From Files__

Data can come in various file formats. Pandas supports importing data from multiple different sources. 

One of the common file formats is csv files. You can use `pd.read_csv` to import data.

In [9]:
df = pd.read_csv('Datasets/ToothGrowth.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/ToothGrowth.csv'

In [None]:
df

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5
5,10.0,VC,0.5
6,11.2,VC,0.5
7,11.2,VC,0.5
8,5.2,VC,0.5
9,7.0,VC,0.5


To see top 5 rows alone use `df.head()`. Likewise `df.tail` returns the bottom 5 rows.

In [None]:
df.head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


See bottom 5 rows

In [None]:
df.tail()

Unnamed: 0,len,supp,dose
55,30.9,OJ,2.0
56,26.4,OJ,2.0
57,27.3,OJ,2.0
58,29.4,OJ,2.0
59,23.0,OJ,2.0


Shape of the dataframe. For tabular data, it is number of rows and columns.

In [None]:
df.shape

(60, 3)

To get the underlying numpy array behind the dataframe, use the `.values` attribute.

In [None]:
df.values

array([[4.2, 'VC', 0.5],
       [11.5, 'VC', 0.5],
       [7.3, 'VC', 0.5],
       [5.8, 'VC', 0.5],
       [6.4, 'VC', 0.5],
       [10.0, 'VC', 0.5],
       [11.2, 'VC', 0.5],
       [11.2, 'VC', 0.5],
       [5.2, 'VC', 0.5],
       [7.0, 'VC', 0.5],
       [16.5, 'VC', 1.0],
       [16.5, 'VC', 1.0],
       [15.2, 'VC', 1.0],
       [17.3, 'VC', 1.0],
       [22.5, 'VC', 1.0],
       [17.3, 'VC', 1.0],
       [13.6, 'VC', 1.0],
       [14.5, 'VC', 1.0],
       [18.8, 'VC', 1.0],
       [15.5, 'VC', 1.0],
       [23.6, 'VC', 2.0],
       [18.5, 'VC', 2.0],
       [33.9, 'VC', 2.0],
       [25.5, 'VC', 2.0],
       [26.4, 'VC', 2.0],
       [32.5, 'VC', 2.0],
       [26.7, 'VC', 2.0],
       [21.5, 'VC', 2.0],
       [23.3, 'VC', 2.0],
       [29.5, 'VC', 2.0],
       [15.2, 'OJ', 0.5],
       [21.5, 'OJ', 0.5],
       [17.6, 'OJ', 0.5],
       [9.7, 'OJ', 0.5],
       [14.5, 'OJ', 0.5],
       [10.0, 'OJ', 0.5],
       [8.2, 'OJ', 0.5],
       [9.4, 'OJ', 0.5],
       [16.5, 'OJ', 0

You can import data from text files as well. But mention the separator correctly.

In [None]:
df = pd.read_table('Datasets/ToothGrowth.txt', sep=',')
df.head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


__You can directly read a file from the internet__

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/ToothGrowth.csv")
df.head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


__From your clipboard as well__

In [None]:
df = pd.read_clipboard(sep="\t")
df.head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


In [None]:
df

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5
5,10.0,VC,0.5
6,11.2,VC,0.5
7,11.2,VC,0.5
8,5.2,VC,0.5
9,7.0,VC,0.5


Besides this, pandas also support reading files in a variety of file formats such as __pickle__, __fwf__ (fixed width format), __Excel__, __JSON, HTML Tables, HDF Store, Feather, Parquet, ORC, SAS, SPSS, Stata, Sql Queries and Google Big Query__.

### Mini Challenge

Convert the following lists to a Pandas DataFrame with two columns and an index.

1. 
```python
index = [1,2,3,4,5]
col1 = list('abcde')
col2 = list('pqrst')
```

In [None]:
# Solution 1
import pandas as pd
index = [1,2,3,4,5]
col1 = list('abcde')
col2 = list('pqrst')

pd.DataFrame({'col1':col1, 'col2':col2}, index=index)

2.
```python
# column names: 'name' and 'age'
lst = [['Bunny', 25], ['Sunny', 30], 
       ['Funny', 26], ['Hunny', 22]] 
```

In [None]:
# Solution 2
lst = [['Bunny', 25], 
       ['Sunny', 30], 
       ['Funny', 26], 
       ['Hunny', 22]] 

df = pd.DataFrame(lst, columns =['name', 'age']) 
print(df )

In [None]:
https://git.io/Jsvl6

 <div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>3. Series and its relation with Dataframe</h2>
</div>

A `Series` is a type that is used to store one column only. You can think of a Series as one column of a `DataFrame` extracted.

Series is very similar to a NumPy array, with a main difference that it has an index label for each observation. 

In [None]:
import numpy as np
import pandas as pd

__Relationship between a Series and a DataFrame__

If you extract any given column from a DataFrame, the resulting object is a Series. 

In [None]:
df = pd.DataFrame(np.random.randint(1,100, (5,4)), columns=list('abcd'))
df

Unnamed: 0,a,b,c,d
0,25,16,61,59
1,17,10,94,87
2,3,28,5,32
3,2,14,84,5
4,92,60,68,8


In [None]:
df['a']

0    25
1    17
2     3
3     2
4    92
Name: a, dtype: int32

In [None]:
type(df['a'])

pandas.core.series.Series

From this you can use indexing to get specific elements.

In [None]:
df['a'][0:3]

0    25
1    17
2     3
Name: a, dtype: int32

To get the numpy array, use `.values`.

In [None]:
df['a'][0:3].values

array([25, 17,  3])

You can further convert it to a list.

In [None]:
df['a'][0:3].values.tolist()

[25, 17, 3]

__Create a standalone Series object__

In [None]:
data = np.arange(10)
index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

ser = pd.Series(data=data, name='numbers')  # name is optional. 
ser

# IF you don't provide index, pandas will create a default starting with 0 

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
Name: numbers, dtype: int32

In [None]:
ser = pd.Series(data=data, index=index, name='numbers')  # name is optional. 
ser

a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
Name: numbers, dtype: int32

In [None]:
type(ser)

pandas.core.series.Series

Series are vectorized objects by default. Ex: To multiply every item by 2, you don't have to write a for-loop. Just multiply the series by 2

In [None]:
ser * 2

a     0
b     2
c     4
d     6
e     8
f    10
g    12
h    14
i    16
j    18
Name: numbers, dtype: int32

__Extract an item__

In [None]:
ser['b']

1

To extract more than one item, put all the item labels in a list and pass that list as argument.

This won't work. Because, Series is one dimensional object and therefore will accept only one argument.

In [None]:
# This wont work
# ser['a', 'b']

So, pass all arguments in a square bracket.

In [None]:
ser[['a', 'b']]

a    0
b    1
Name: numbers, dtype: int32

You can extract the index as well.

In [None]:
# method 1
ser.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

In [None]:
# method 2
ser.keys()

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

Also if you simply extract one column from a DataFrame, it becomes a Series. So, you can think of a DataFrame as a 'column-wise arrangement of Series'.

You can __create a `series` from a `dict` as well__

In [None]:
d1 = {'a': 0, 'b':1, 'c':3}
d2 = {'b': 0, 'c':1, 'd':3}

In [None]:
ser1 = pd.Series(d1)
ser2 = pd.Series(d2)
ser1

a    0
b    1
c    3
dtype: int64

In [None]:
ser2

b    0
c    1
d    3
dtype: int64

__Addition__

In [None]:
ser1 + ser2

a    NaN
b    1.0
c    4.0
d    NaN
dtype: float64

In place of missing value, use zero for computation.

In [None]:
ser1.add(ser2, fill_value=0)

a    0.0
b    1.0
c    4.0
d    3.0
dtype: float64

### Mini Challenge

For the given series, compute the differences between successive elements.

__Input__
```python
np.random.seed(101)
ser = pd.Series(np.random.randint(1,100, 10))
ser
#> 0    96
#> 1    12
#> 2    82
#> 3    71
#> 4    64
#> 5    88
#> 6    76
#> 7    10
#> 8    78
#> 9    41
#> dtype: int32
```

__Desired Output__

```
#> [-84,  70, -11,  -7,  24, -12, -66,  68, -37]
```

In [None]:
np.random.seed(101)
ser = pd.Series(np.random.randint(1,100, 10))
ser

0    96
1    12
2    82
3    71
4    64
5    88
6    76
7    10
8    78
9    41
dtype: int32

If you just subtract the series the result will not be as intended, because the subtraction happens after aligning the indexes.

In [None]:
ser[1:] - ser[:-1]

0    NaN
1    0.0
2    0.0
3    0.0
4    0.0
5    0.0
6    0.0
7    0.0
8    0.0
9    NaN
dtype: float64

So convert to a numpy array and then do subtraction.

In [None]:
ser.values[1:] - ser.values[:-1]

array([-84,  70, -11,  -7,  24, -12, -66,  68, -37])

In [None]:
out = ser.values[1:] - ser.values[:-1]
out.tolist()

[-84, 70, -11, -7, 24, -12, -66, 68, -37]

Or simply use the in-built `diff()` method. 

Don't bother about this new method at the moment, we will be covering all functions you need to know as we go through the course.

In [None]:
ser.diff()

0     NaN
1   -84.0
2    70.0
3   -11.0
4    -7.0
5    24.0
6   -12.0
7   -66.0
8    68.0
9   -37.0
dtype: float64