### Pandas

Pandas = Panel Data

It is an open-source library for data analysis. It is built directly on NumPy.

**Series**

Similar to NumPy array with index and data with named index.

In [3]:
import numpy as np
import pandas as pd

In [4]:
labels = ['a','b','c']
mylist = [10,20,30]
arr = np.array(mylist)
arr

array([10, 20, 30])

In [5]:
### Create a dictionary

d = {'a':10,'b':20,'c':30}
d

{'a': 10, 'b': 20, 'c': 30}

In [6]:
pd.Series(data=mylist)

0    10
1    20
2    30
dtype: int64

In [7]:
pd.Series(arr)

### No differences except the data type

0    10
1    20
2    30
dtype: int32

In [8]:
pd.Series(data=arr, index=labels)

### A series with a named index. Data can be numeric/categorical/...

a    10
b    20
c    30
dtype: int32

In [10]:
pd.Series(data = ['a',0.5,'c'])

0      a
1    0.5
2      c
dtype: object

In [12]:
ser1 = pd.Series(data = [1,2,3,4], index=['USA','Germany','USSR','Japan'])
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [14]:
### Use of named index

ser1['USSR']

3

In [15]:
ser2 = pd.Series(data = [1,4,5,6], index=['USA','Germany','Italy','Japan'])
ser2

USA        1
Germany    4
Italy      5
Japan      6
dtype: int64

In [16]:
### Can perform operations with multiple series. It performs operations based on the named index.

ser1 + ser2

Germany     6.0
Italy       NaN
Japan      10.0
USA         2.0
USSR        NaN
dtype: float64

**DataFrames**

DataFrames = Multiple series shared the same index.

In [20]:
from numpy.random import randn
np.random.seed(101)

rand_mat = randn(5,4)
rand_mat

array([[ 2.70684984,  0.62813271,  0.90796945,  0.50382575],
       [ 0.65111795, -0.31931804, -0.84807698,  0.60596535],
       [-2.01816824,  0.74012206,  0.52881349, -0.58900053],
       [ 0.18869531, -0.75887206, -0.93323722,  0.95505651],
       [ 0.19079432,  1.97875732,  2.60596728,  0.68350889]])

In [21]:
df = pd.DataFrame(data=rand_mat)
df

Unnamed: 0,0,1,2,3
0,2.70685,0.628133,0.907969,0.503826
1,0.651118,-0.319318,-0.848077,0.605965
2,-2.018168,0.740122,0.528813,-0.589001
3,0.188695,-0.758872,-0.933237,0.955057
4,0.190794,1.978757,2.605967,0.683509


In [22]:
### We may have named index and column names. Use following trick:
'A B C D E'.split()

['A', 'B', 'C', 'D', 'E']

In [23]:
df = pd.DataFrame(data=rand_mat, index='A B C D E'.split())
df

Unnamed: 0,0,1,2,3
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [24]:
df = pd.DataFrame(data=rand_mat, index='A B C D E'.split(), columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


**Grab data from a DataFrame**

In [25]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [26]:
type(df['W'])

### Note: Each column is a series

pandas.core.series.Series

In [27]:
mylist = ['W','Y']
df[mylist]

### OR
### df[['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077
C,-2.018168,0.528813
D,0.188695,-0.933237
E,0.190794,2.605967


In [28]:
### Not recommended, but we can also try:

df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [36]:
### New column:

df['NEW'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,NEW
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [37]:
### Remove a column. Here we can drop a column or a row. So we have to specify whether it is a row or a column.

df.drop('NEW', axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [38]:
df

### Thus, the drop is not permanent. 

Unnamed: 0,W,X,Y,Z,NEW
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [39]:
df.drop('NEW', axis=1, inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509
