<center><font size="50"> <b> Pandas </b> </font></center>

From the [pandas github](https://github.com/pandas-dev/pandas) page

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

In [6]:
import pandas as pd

## Two most fundamental data structure in pandas 
- Series
- DataFrame

## Series

One dimensional array like object with associated label(index)

In [7]:
import numpy as np
np.random.seed(1)
random_int = np.random.randint(1, 100, 5)
series = pd.Series(random_int)

In [8]:
series

0    38
1    13
2    73
3    10
4    76
dtype: int64

In [9]:
series.index

RangeIndex(start=0, stop=5, step=1)

If we don't give index then a default starting from 0 is created.
We can give index with labels

In [10]:
series = pd.Series(random_int, index=['a', 'b', 'c', 'd', 'e'])

In [11]:
series.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [12]:
series.values

array([38, 13, 73, 10, 76])

## We can use boolean filtering(indexing) and math operation

In [13]:
series[series > 60]

c    73
e    76
dtype: int64

In [14]:
np.sqrt(series)

a    6.164414
b    3.605551
c    8.544004
d    3.162278
e    8.717798
dtype: float64

or we can use python dict to create a Series(**a common theme in python libraries to take dict**)

In [15]:
sdata = {'Colorado': 5.6, 'Utha': 3.1, 'Nevda': 2.9}
state_ser= pd.Series(sdata)

state_ser

Colorado    5.6
Utha        3.1
Nevda       2.9
dtype: float64

Series object itself and its index have a name attribute

In [16]:
state_ser.index.name= 'Population in Million'
state_ser.name = 'State'

In [17]:
state_ser

Population in Million
Colorado    5.6
Utha        3.1
Nevda       2.9
Name: State, dtype: float64

We can use labels to index value

In [18]:
state_ser[['Colorado', 'Utha']]

Population in Million
Colorado    5.6
Utha        3.1
Name: State, dtype: float64

In real dataset there will be values missing for an attribute. Let's add a state with missing value

In [19]:
state_ser['Texas']= np.NAN

In [20]:
state_ser

Population in Million
Colorado    5.6
Utha        3.1
Nevda       2.9
Texas       NaN
Name: State, dtype: float64

# Checking for missing value(isna, isnull, notnull)

In [21]:
pd.isna(state_ser)

Population in Million
Colorado    False
Utha        False
Nevda       False
Texas        True
Name: State, dtype: bool

In [22]:
pd.isnull(state_ser)

Population in Million
Colorado    False
Utha        False
Nevda       False
Texas        True
Name: State, dtype: bool

In [23]:
# looks like isnull is an alias for isna
pd.isnull

<function pandas.core.dtypes.missing.isna(obj)>

# DataFrame

Used for tabular data(2D) representation.
- It has both row and column index.
- Can can be thought of as collection(dict) of Series sharing same index.
- Hierarchical indexing can be used for higher dimensional data.

In [24]:
#Creating a DataFrame from a dictionary
crime = {
    'years':['2007','2008','2009','2010'],
    'vandalism':[33,69,48,44],
    'drug abuse':[46,60,61,67],
    'liquor laws':[86,81,76,86]
}
crime_df = pd.DataFrame(crime)
crime_df

Unnamed: 0,years,vandalism,drug abuse,liquor laws
0,2007,33,46,86
1,2008,69,60,81
2,2009,48,61,76
3,2010,44,67,86


Note That pandas render table in a nice html format

# Some properties of pandas dataframe

In [25]:
crime_df.columns

Index(['years', 'vandalism', 'drug abuse', 'liquor laws'], dtype='object')

In [26]:
crime_df.index

RangeIndex(start=0, stop=4, step=1)

In [27]:
crime_df.dtypes

years          object
vandalism       int64
drug abuse      int64
liquor laws     int64
dtype: object

In [28]:
crime_df.isna()

Unnamed: 0,years,vandalism,drug abuse,liquor laws
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [29]:
crime_df.head(2)

Unnamed: 0,years,vandalism,drug abuse,liquor laws
0,2007,33,46,86
1,2008,69,60,81


In [30]:
## How to view bottom two row ??


In [31]:
# How to get underline 2d numpy array??


In a real dataset we have lot of columns. We can arrange columns and give index values

In [32]:
crime.keys()

dict_keys(['years', 'vandalism', 'drug abuse', 'liquor laws'])

In [33]:
pd.DataFrame(crime, columns=['years', 'liquor laws', 'drug abuse', 'vandalism'], index =list('abcd'))

Unnamed: 0,years,liquor laws,drug abuse,vandalism
a,2007,86,46,33
b,2008,81,60,69
c,2009,76,61,48
d,2010,86,67,44


In [34]:
# or we alrady have read the dataframe
pd.DataFrame(crime_df, columns=['years', 'liquor laws', 'drug abuse', 'vandalism'])

Unnamed: 0,years,liquor laws,drug abuse,vandalism
0,2007,86,46,33
1,2008,81,60,69
2,2009,76,61,48
3,2010,86,67,44


In [35]:
# or we want year to an index
crime_df = pd.DataFrame(crime_df, columns=['years', 'liquor laws', 'drug abuse', 'vandalism'] )
crime_df.set_index('years')

Unnamed: 0_level_0,liquor laws,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,86,46,33
2008,81,60,69
2009,76,61,48
2010,86,67,44


In [36]:
crime_df

Unnamed: 0,years,liquor laws,drug abuse,vandalism
0,2007,86,46,33
1,2008,81,60,69
2,2009,76,61,48
3,2010,86,67,44


what happened, we just set the index

In [37]:
# Use inplace to modify data frame
crime_df.set_index('years', inplace=True)
crime_df

Unnamed: 0_level_0,liquor laws,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,86,46,33
2008,81,60,69
2009,76,61,48
2010,86,67,44


# Slicing and dicing DataFrame([], loc, iloc)

In [38]:
crime_df[['drug abuse', 'vandalism']]

Unnamed: 0_level_0,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1
2007,46,33
2008,60,69
2009,61,48
2010,67,44


## slicing or selecting data with a boolean array

In [39]:
crime_df[crime_df['vandalism']>40]

Unnamed: 0_level_0,liquor laws,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2008,81,60,69
2009,76,61,48
2010,86,67,44


In [40]:
#or use attribute access
crime_df.vandalism

years
2007    33
2008    69
2009    48
2010    44
Name: vandalism, dtype: int64

In [41]:
# use drug abuse as peoprty to access this colums
crime_df.

SyntaxError: invalid syntax (<ipython-input-41-f842929e3d66>, line 2)

Valid Python variable name is required. Let's change it.

**Search for pandas function and use it to rename drug abuse to *drug_abuse***

In [42]:
#Write code here

## Rows can be retrieved using loc and iloc

## loc
- loc uses label/index
- conditional lookup

In [43]:
#
crime_df

Unnamed: 0_level_0,liquor laws,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,86,46,33
2008,81,60,69
2009,76,61,48
2010,86,67,44


In [44]:
# using label
series_2010 =crime_df.loc[['2010'], ['drug_abuse', 'vandalism']]
series_2010

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['drug_abuse'], dtype='object'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

In [45]:
# Conditional row selection
crime_df.loc[crime_df.drug_abuse>50]

AttributeError: 'DataFrame' object has no attribute 'drug_abuse'

<font color = "red">Indexing returns a view </font>

In [46]:
series_2010.drug_abuse = 1.0

NameError: name 'series_2010' is not defined

In [47]:
crime_df

Unnamed: 0_level_0,liquor laws,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,86,46,33
2008,81,60,69
2009,76,61,48
2010,86,67,44


## iloc
use it for integer location based indexing 

In [48]:
crime_df

Unnamed: 0_level_0,liquor laws,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,86,46,33
2008,81,60,69
2009,76,61,48
2010,86,67,44


In [49]:
crime_df.iloc[1:3, 1:3]

Unnamed: 0_level_0,drug abuse,vandalism
years,Unnamed: 1_level_1,Unnamed: 2_level_1
2008,60,69
2009,61,48


In [50]:
crime_df.T

years,2007,2008,2009,2010
liquor laws,86,81,76,86
drug abuse,46,60,61,67
vandalism,33,69,48,44


# Reindex
create new DataFrame as per new index

In [51]:
df = pd.DataFrame(np.arange(12).reshape((4,3)), index=[0, 3 ,5 ,9], columns=['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
0,0,1,2
3,3,4,5
5,6,7,8
9,9,10,11


In [52]:
# row reindexing
df.reindex(range(10))

Unnamed: 0,a,b,c
0,0.0,1.0,2.0
1,,,
2,,,
3,3.0,4.0,5.0
4,,,
5,6.0,7.0,8.0
6,,,
7,,,
8,,,
9,9.0,10.0,11.0


In [53]:
# column reindexing
df.reindex(columns=['c', 'b'])

Unnamed: 0,c,b
0,2,1
3,5,4
5,8,7
9,11,10


# drop row or column

In [54]:
data_df = pd.DataFrame(np.arange(16).reshape((4, 4)),
                      index=['Ohio', 'Colorado', 'Utah', 'New York'],
                      columns=['one', 'two', 'three', 'four'])
data_df

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [55]:
data_df.drop(['Utah'])

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,12,13,14,15


# To drop column use axis = 1, axis =0 is default

In [56]:
data_df.drop(['one', 'three'], axis=1)

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


# Arithmetic operation support and element wise array operation from numpy


In [57]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [58]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                    columns=list('abcde'))
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [59]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [60]:
np.exp(df1)

Unnamed: 0,a,b,c,d
0,1.0,2.718282,7.389056,20.085537
1,54.59815,148.413159,403.428793,1096.633158
2,2980.957987,8103.083928,22026.465795,59874.141715


# applying lambda function to frame

In [61]:
# apply a function row wise
df1.apply(lambda x: x.max())

a     8.0
b     9.0
c    10.0
d    11.0
dtype: float64

In [62]:
# or apply column wise axis =1 or columns
df1.apply(lambda x: x.max(), axis='columns')

0     3.0
1     7.0
2    11.0
dtype: float64

# applymap for element wise function

In [63]:
df1.applymap(lambda x: int(x) )

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [64]:
df['a'].map(lambda x: x**2)

0     0
3     9
5    36
9    81
Name: a, dtype: int64

# Summarizing and Computing Descriptive Statistics

In [65]:
df = pd.DataFrame(np.arange(8).reshape(4,2),
                     index=['a', 'b', 'c', 'd'],
                     columns=['one', 'two'])
df

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5
d,6,7


In [66]:
df['name'] = ['Sam', 'Tim', 'John', 'Chris']
df

Unnamed: 0,one,two,name
a,0,1,Sam
b,2,3,Tim
c,4,5,John
d,6,7,Chris


In [67]:
df.sum()

one                  12
two                  16
name    SamTimJohnChris
dtype: object

In [68]:
df.mean()

one    3.0
two    4.0
dtype: float64

In [69]:
df.max()

one       6
two       7
name    Tim
dtype: object

In [70]:
# Can u guess a method to get a summary stats

In [71]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                       [np.nan, np.nan], [0.75, -1.3]],
                     index=['a', 'b', 'c', 'd'],
                     columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [72]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [73]:
df.sum(skipna=False)

one   NaN
two   NaN
dtype: float64

# Side: quick way to scrap table in webpages

In [74]:
tables = pd.read_html('https://en.wikipedia.org/wiki/Malnutrition', header=0)

In [75]:
tables[2]

Unnamed: 0,Degree of PEM,% of desired body weight for age and sex,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,Normal,90–100%,,,,,,,,,,,,,,
1,Mild: Grade I (1st degree),75–89%,,,,,,,,,,,,,,
2,Moderate: Grade II (2nd degree),60–74%,,,,,,,,,,,,,,
3,Severe: Grade III (3rd degree),<60%,,,,,,,,,,,,,,
4,"SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels...","SOURCE:""Serum Total Protein and Albumin Levels..."
