<center><font size="50"> <b> Pandas </b> </font></center>

From the [pandas github](https://github.com/pandas-dev/pandas) page

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be a fundamental high-level building block for doing practical, real world data analysis in Python.

In [1]:
import pandas as pd

## Two most fundamental data structures in pandas 
- Series
- DataFrame

## Series

One dimensional array like object with an associated label (index)

In [2]:
import numpy as np
np.random.seed(1)
random_int = np.random.randint(1, 100, 5)
series = pd.Series(random_int)

In [3]:
series

0    38
1    13
2    73
3    10
4    76
dtype: int32

In [4]:
series.index

RangeIndex(start=0, stop=5, step=1)

If we don't specify the index, then a default index starting from 0 is created.
We can also create an index with labels.

In [6]:
series = pd.Series(random_int, index=['a', 'b', 'c', 'd', 'e'])

In [7]:
series.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [8]:
series.values

array([38, 13, 73, 10, 76])

## We can use boolean filtering (indexing) and math operations

In [9]:
series[series > 60]

c    73
e    76
dtype: int32

In [10]:
np.sqrt(series)

a    6.164414
b    3.605551
c    8.544004
d    3.162278
e    8.717798
dtype: float64

or we can use python dict to create a Series (**it is a common theme in python libraries to use dict**)

In [12]:
sdata = {'Colorado': 5.6, 'Utah': 3.1, 'Nevada': 2.9}
state_ser= pd.Series(sdata)

state_ser

Colorado    5.6
Utah        3.1
Nevada      2.9
dtype: float64

The series object itself and its index have a name attribute

In [14]:
state_ser.index.name= 'Population in Millions'
state_ser.name = 'State'

In [15]:
state_ser

Population in Millions
Colorado    5.6
Utah        3.1
Nevada      2.9
Name: State, dtype: float64

We can use labels to index value

In [16]:
state_ser[['Colorado', 'Utah']]

Population in Millions
Colorado    5.6
Utah        3.1
Name: State, dtype: float64

In real datasets there will typicaly be some missing values for an attribute. Let's add a state with missing value.

In [17]:
state_ser['Texas']= np.NAN

In [18]:
state_ser

Population in Millions
Colorado    5.6
Utah        3.1
Nevada      2.9
Texas       NaN
Name: State, dtype: float64

# Checking for missing value (isna, isnull, notnull)

In [19]:
pd.isna(state_ser)

Population in Millions
Colorado    False
Utah        False
Nevada      False
Texas        True
Name: State, dtype: bool

In [20]:
pd.isnull(state_ser)

Population in Millions
Colorado    False
Utah        False
Nevada      False
Texas        True
Name: State, dtype: bool

In [21]:
# looks like isnull is an alias for isna
pd.isnull

<function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'>

# DataFrame

Used for tabular data (2D) representation.
- It has both row and column index.
- Can be thought of as collection (dict) of Series sharing same index.
- Hierarchical indexing can be used for higher dimensional data.

In [22]:
#Creating a DataFrame from a dictionary
crime = {
    'years':['2007','2008','2009','2010'],
    'vandalism':[33,69,48,44],
    'drug abuse':[46,60,61,67],
    'liquor laws':[86,81,76,86]
}
crime_df = pd.DataFrame(crime)
crime_df

Unnamed: 0,years,vandalism,drug abuse,liquor laws
0,2007,33,46,86
1,2008,69,60,81
2,2009,48,61,76
3,2010,44,67,86


In [24]:
print(crime_df)

  years  vandalism  drug abuse  liquor laws
0  2007         33          46           86
1  2008         69          60           81
2  2009         48          61           76
3  2010         44          67           86


Note that pandas renders the display using HTML format. Print gives a different output.

# Some properties of pandas dataframes
# and
# operations on dataframes (slicing, dicing, descriptive statistics)

In [23]:
crime_df.columns

Index(['years', 'vandalism', 'drug abuse', 'liquor laws'], dtype='object')

In [25]:
crime_df.index

RangeIndex(start=0, stop=4, step=1)

In [26]:
crime_df.dtypes

years          object
vandalism       int64
drug abuse      int64
liquor laws     int64
dtype: object

In [27]:
crime_df.isna()

Unnamed: 0,years,vandalism,drug abuse,liquor laws
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [28]:
crime_df.head(2)

Unnamed: 0,years,vandalism,drug abuse,liquor laws
0,2007,33,46,86
1,2008,69,60,81


In [29]:
## How to view bottom two rows ??
crime_df.tail(2)

Unnamed: 0,years,vandalism,drug abuse,liquor laws
2,2009,48,61,76
3,2010,44,67,86


In [30]:
crime_df.sample(2)

Unnamed: 0,years,vandalism,drug abuse,liquor laws
3,2010,44,67,86
2,2009,48,61,76


In [32]:
# How to get underlining 2d numpy array??
crime_df
crime_df.values


array([['2007', 33, 46, 86],
       ['2008', 69, 60, 81],
       ['2009', 48, 61, 76],
       ['2010', 44, 67, 86]], dtype=object)

In a real dataset we have lot of columns. We can arrange columns and give index values

In [33]:
crime.keys()

dict_keys(['years', 'vandalism', 'drug abuse', 'liquor laws'])

In [34]:
pd.DataFrame(crime, columns=['years', 'liquor laws', 'drug abuse', 'vandalism'], index =list('abcd'))

Unnamed: 0,years,liquor laws,drug abuse,vandalism
a,2007,86,46,33
b,2008,81,60,69
c,2009,76,61,48
d,2010,86,67,44


In [35]:
# or we alrady have read the dataframe
pd.DataFrame(crime_df, columns=['years', 'liquor laws', 'drug abuse', 'vandalism'])

Unnamed: 0,years,liquor laws,drug abuse,vandalism
0,2007,86,46,33
1,2008,81,60,69
2,2009,76,61,48
3,2010,86,67,44


In [36]:
# or we want year to an index
crime_df = pd.DataFrame(crime_df, columns=['years', 'liquor laws', 'drug abuse', 'vandalism'] )
crime_df.set_index('years')

Unnamed: 0_level_0,liquor laws,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,86,46,33
2008,81,60,69
2009,76,61,48
2010,86,67,44


In [30]:
crime_df

Unnamed: 0,years,liquor laws,drug abuse,vandalism
0,2007,86,46,33
1,2008,81,60,69
2,2009,76,61,48
3,2010,86,67,44


What happened? We just set the index.

In [37]:
# Use inplace to modify the data frame
crime_df.set_index('years', inplace=True)
crime_df

Unnamed: 0_level_0,liquor laws,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,86,46,33
2008,81,60,69
2009,76,61,48
2010,86,67,44


# Slicing and dicing DataFrame ([], loc, iloc)

In [38]:
crime_df[['drug abuse', 'vandalism']]

Unnamed: 0_level_0,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1
2007,46,33
2008,60,69
2009,61,48
2010,67,44


## slicing or selecting data with a boolean array

In [39]:
crime_df[crime_df['vandalism']>40]

Unnamed: 0_level_0,liquor laws,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2008,81,60,69
2009,76,61,48
2010,86,67,44


In [40]:
#or use attribute access
crime_df.vandalism

years
2007    33
2008    69
2009    48
2010    44
Name: vandalism, dtype: int64

In [41]:
# Can we use drug abuse as atribute to access this column?
crime_df.columns

Index(['liquor laws', 'drug abuse', 'vandalism'], dtype='object')

In [42]:
type(crime_df["drug abuse"])

pandas.core.series.Series

Valid Python variable name is required. Let's change it.

**Search for a pandas function and use it to rename drug abuse to *drug_abuse***

In [43]:
#Write code here
crime_df.rename(columns={"drug abuse":"drug_abuse"}, inplace=True)
crime_df
crime_df.drug_abuse

years
2007    46
2008    60
2009    61
2010    67
Name: drug_abuse, dtype: int64

## Rows can be retrieved using loc and iloc

## loc
- loc uses label/index
- conditional lookup

In [45]:
crime_df

Unnamed: 0_level_0,liquor laws,drug_abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,86,46,33
2008,81,60,69
2009,76,61,48
2010,86,67,44


In [46]:
# using label
series_2010 =crime_df.loc[['2010'], ['drug_abuse', 'vandalism']]
series_2010

Unnamed: 0_level_0,drug_abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1
2010,67,44


In [47]:
# Conditional row selection
crime_df.loc[crime_df.drug_abuse>50]

Unnamed: 0_level_0,liquor laws,drug_abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2008,81,60,69
2009,76,61,48
2010,86,67,44


<font color = "red">Indexing returns a view </font>

In [51]:
# This does not modify the value
series_2010.drug_abuse = 2.0

In [49]:
crime_df

Unnamed: 0_level_0,liquor laws,drug_abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,86,46,33
2008,81,60,69
2009,76,61,48
2010,86,67,44


## iloc
use it for integer location based indexing 

In [52]:
crime_df

Unnamed: 0_level_0,liquor laws,drug_abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,86,46,33
2008,81,60,69
2009,76,61,48
2010,86,67,44


In [53]:
crime_df.iloc[1:3, 1:3]

Unnamed: 0_level_0,drug_abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1
2008,60,69
2009,61,48


In [54]:
crime_df.T

years,2007,2008,2009,2010
liquor laws,86,81,76,86
drug_abuse,46,60,61,67
vandalism,33,69,48,44


# Reindex
Create new DataFrame as per new index

In [55]:
df = pd.DataFrame(np.arange(12).reshape((4,3)), index=[0, 3 ,5 ,9], columns=['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
0,0,1,2
3,3,4,5
5,6,7,8
9,9,10,11


In [56]:
# row reindexing
df.reindex(range(10))

Unnamed: 0,a,b,c
0,0.0,1.0,2.0
1,,,
2,,,
3,3.0,4.0,5.0
4,,,
5,6.0,7.0,8.0
6,,,
7,,,
8,,,
9,9.0,10.0,11.0


In [48]:
# column reindexing
df.reindex(columns=['c', 'b'])

Unnamed: 0,c,b
0,2,1
3,5,4
5,8,7
9,11,10


In [57]:
df

Unnamed: 0,a,b,c
0,0,1,2
3,3,4,5
5,6,7,8
9,9,10,11


# drop row or column

In [58]:
data_df = pd.DataFrame(np.arange(16).reshape((4, 4)),
                      index=['Ohio', 'Colorado', 'Utah', 'New York'],
                      columns=['one', 'two', 'three', 'four'])
data_df

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [59]:
data_df.drop(['Utah'])

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,12,13,14,15


# To drop column use axis = 1, axis =0 is default

In [60]:
data_df.drop(['one', 'three'], axis=1)

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


# Arithmetic operations support and elementwise array operation from numpy

In [61]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [62]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                    columns=list('abcde'))
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [63]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [64]:
np.exp(df1)

Unnamed: 0,a,b,c,d
0,1.0,2.718282,7.389056,20.085537
1,54.59815,148.413159,403.428793,1096.633158
2,2980.957987,8103.083928,22026.465795,59874.141715


# applying lambda function to a dataframe

In [66]:
# apply a function row wise
df1.apply(lambda x: x.max())

a     8.0
b     9.0
c    10.0
d    11.0
dtype: float64

In [67]:
# or apply column wise axis =1 or columns
df1.apply(lambda x: x.max(), axis='columns')

0     3.0
1     7.0
2    11.0
dtype: float64

# applymap for element wise function

In [69]:
#FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
#  df1.applymap(lambda x: int(x) )

df1.applymap(lambda x: int(x) )

  df1.applymap(lambda x: int(x) )


Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [70]:
df['a'].map(lambda x: x**2)

0     0
3     9
5    36
9    81
Name: a, dtype: int64

# Summarizing and Computing Descriptive Statistics

In [71]:
df = pd.DataFrame(np.arange(8).reshape(4,2),
                     index=['a', 'b', 'c', 'd'],
                     columns=['one', 'two'])
df

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5
d,6,7


In [83]:
df['name'] = ['Sam', 'Tim', 'John', 'Chris']
df

Unnamed: 0,one,two,name
a,1.4,,Sam
b,7.1,-4.5,Tim
c,,,John
d,0.75,-1.3,Chris


In [84]:
df.sum()

one                9.25
two                -5.8
name    SamTimJohnChris
dtype: object

In [86]:
#The mean method words differently now than it was when the video was recorded
df.mean()

TypeError: Could not convert ['SamTimJohnChris'] to numeric

In [87]:
df.max()

one     7.1
two    -1.3
name    Tim
dtype: object

In [89]:
# Can u guess a method to get a summary stats
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [91]:
df.info

<bound method DataFrame.info of     one  two   name
a  1.40  NaN    Sam
b  7.10 -4.5    Tim
c   NaN  NaN   John
d  0.75 -1.3  Chris>

In [92]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                       [np.nan, np.nan], [0.75, -1.3]],
                     index=['a', 'b', 'c', 'd'],
                     columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [93]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [94]:
df.sum(skipna=False)

one   NaN
two   NaN
dtype: float64

# Side: quick way to scrap table in webpages

In [95]:
tables = pd.read_html('https://en.wikipedia.org/wiki/Malnutrition', header=0)
#this does not work! The page was modified since the time the video was recorded

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1006)>

In [96]:
tables[2]

NameError: name 'tables' is not defined