
# Even More Pandas Fundamentals

A few more fundamentals for our introduction to `pandas`:

- Adding/Multiplying DataFrames

- `apply` and `applymap`

- Summarizing/descriptive statistics

- Variable name binding

<br>

<img src="panda3.png" alt="Panda!" style="width:375px;"/>

<br>

In [None]:
#Once again, get our libraries
####

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

### Adding/Multiplying DataFrames

Recall with Series, addition acts something like an outer join on the index labels.

This is similarly true for DataFrames:

In [73]:
#Example:
df1 = pd.DataFrame(np.arange(9).reshape(3,3), columns = list('bcd'),
                                              index = list('ABC'))

df2 = pd.DataFrame(np.arange(9).reshape(3,3), columns = list('bde'),
                                              index = list('DAB'))

display(df1)
display(df2)


Unnamed: 0,b,c,d
A,0,1,2
B,3,4,5
C,6,7,8


Unnamed: 0,b,d,e
D,0,1,2
A,3,4,5
B,6,7,8


In [74]:
#Now Add:

#Only where index and column match do we get something not NaN
#Everything else is NaN, but we take union of all indices and columns
df1 + df2

Unnamed: 0,b,c,d,e
A,3.0,,6.0,
B,9.0,,12.0,
C,,,,
D,,,,


In [75]:
#We can change fill value with add method:
#Fill value is value used when an index/column pair is found in one object but not the other:

df1.add(df2, fill_value = 0)

Unnamed: 0,b,c,d,e
A,3.0,1.0,6.0,5.0
B,9.0,4.0,12.0,8.0
C,6.0,7.0,8.0,
D,0.0,,1.0,2.0


#### Some other arithmetic methods...

- Note that prefix r flips arguments

- Methods:

- `add`, `radd`
- `sub`, `rsub`
- `div`, `rdiv`
- `floordiv`, `rfloordiv`
- `mul`, `rmul`
- `pow`, `rpow`

Example:

In [76]:
df1

Unnamed: 0,b,c,d
A,0,1,2
B,3,4,5
C,6,7,8


In [77]:
df1.div(4)

Unnamed: 0,b,c,d
A,0.0,0.25,0.5
B,0.75,1.0,1.25
C,1.5,1.75,2.0


In [78]:
df1.rdiv(4)

Unnamed: 0,b,c,d
A,inf,4.0,2.0
B,1.333333,1.0,0.8
C,0.666667,0.571429,0.5


### Function Application and Mapping

NumPy ufuncs also work on pandas objects:

In [80]:
df = pd.DataFrame((np.random.randn(4,3).round(2)*100).astype(int),
                  columns = list('ABC'),
                  index = list('abcd'))

display(df)

display(np.abs(df))


Unnamed: 0,A,B,C
a,-161,135,-80
b,84,28,83
c,-40,41,-81
d,-8,66,-120


Unnamed: 0,A,B,C
a,161,135,80
b,84,28,83
c,40,41,81
d,8,66,120


### `apply` method:

Applies a function across columns or rows, similar to apply in R:

In [82]:
df = pd.DataFrame(np.arange(12).reshape(4,3),
                  columns = list('ABC'),
                  index = list('abcd'))

display(df)

#By default, applies down the columns:
df.apply(np.mean)

Unnamed: 0,A,B,C
a,0,1,2
b,3,4,5
c,6,7,8
d,9,10,11


A    4.5
B    5.5
C    6.5
dtype: float64

In [86]:
#Can also apply across the columns, i.e. along the row:
df.apply(np.mean, axis='columns')

a     1.0
b     4.0
c     7.0
d    10.0
dtype: float64

In [89]:
#Can use our own functions. Recall lambda keyword:
f = lambda x: x.max() - x.min()

df.apply(f, axis=0)

A    9
B    9
C    9
dtype: int32

In [90]:
#Can return more than a single scalar:
#Can also return a Series with multiple values:
f = lambda x: pd.Series([x.min(), x.max(), x.mean()], index=['min', 'max', 'mean'])

df.apply(f)

Unnamed: 0,A,B,C
min,0.0,1.0,2.0
max,9.0,10.0,11.0
mean,4.5,5.5,6.5


In [91]:
#Note: apply not usually necessary for common array statistical functions:
######

display(np.mean(df, axis='index'))

np.mean(df, axis='columns')

A    4.5
B    5.5
C    6.5
dtype: float64

a     1.0
b     4.0
c     7.0
d    10.0
dtype: float64

### `apply` on a Series: Element-wise operation

In [92]:
df['A']

a    0
b    3
c    6
d    9
Name: A, dtype: int32

In [93]:
df['A'].apply(lambda x: x**3)

a      0
b     27
c    216
d    729
Name: A, dtype: int64

### `applymap` method

To use an element-wise Python function with a DataFrame, use 

In [94]:
#Square every value:
f = lambda x: x**2

df.applymap(f)

Unnamed: 0,A,B,C
a,0,1,4
b,9,16,25
c,36,49,64
d,81,100,121


### Summarizing and Computing Descriptive Statistics

- pandas objects have set of common mathematical and statistical methods
- Usualy reductions or summary statistics: yield single value for Series, Series of values from rows or columns of a DataFrame
- Built-in handling for missing data

In [95]:
#Simple example:
df = pd.DataFrame([[1, np.nan], [2, 3], [4, np.nan], [5,6]],
                  index = list('abcd'),
                  columns = ['one', 'two'])
df

Unnamed: 0,one,two
a,1,
b,2,3.0
c,4,
d,5,6.0


In [96]:
#Sum down the columns, ignoring NaNs:
df.sum()

one    12.0
two     9.0
dtype: float64

In [100]:
#Sum across the columns, i.e. by index:
df.sum(axis='columns') #Or axis = 1

a     1.0
b     5.0
c     4.0
d    11.0
dtype: float64

In [101]:
#Can do skipna = False:
df.sum(axis=1, skipna=False)

a     NaN
b     5.0
c     NaN
d    11.0
dtype: float64

In [103]:
#idxmin and idxmax return index labels where min and max values attained:

display(df.max())

display(df.idxmax())

one    5.0
two    6.0
dtype: float64

one    d
two    d
dtype: object

In [104]:
#Series Only: argmin and argmax retun index locations (integers) where min and max attained
df['one'].argmax()

3

In [107]:
df

Unnamed: 0,one,two
a,1,
b,2,3.0
c,4,
d,5,6.0


In [108]:
#Can do cumulative sums and products:
df.cumsum()
#df.cumprod()

Unnamed: 0,one,two
a,1,
b,2,3.0
c,8,
d,40,18.0


In [109]:
#And a bunch of summary statistics:
df.describe()

Unnamed: 0,one,two
count,4.0,2.0
mean,3.0,4.5
std,1.825742,2.12132
min,1.0,3.0
25%,1.75,3.75
50%,3.0,4.5
75%,4.25,5.25
max,5.0,6.0


#### Methods:
- `count`
- `describe`
- `min`, `max`
- `argmin`, `argmax`
- `idxmin`, `indxmax`
- `quantile`
- `sum`
- `mean`
- `median`
- `mad` (mean absolute deviation from mean)
- `prod`
- `var`
- `std`
- `skew`
- `kurt`
- `cumsum`
- `cummin`, `cummax`
- `cumprod`
- `diff` (first arithmetic difference)
- `pct_change`

### Finally Note Again: Variable name binding

In [110]:
#Remake our old df:
####

data = {'state': ['Arizona', 'Arizona', 'Arizona', 'Arizona',
                  'California', 'California', 'California', 'California', 'Iowa', 'Iowa', 'Iowa', 'Iowa'],
        'year': [2019, 2020, 2021, 2022, 2019, 2020, 2021, 2022, 2019, 2020, 2021, 2022],
        'area planted': [637, 573, 616, 630, 2983, 2621, 2550, 2274, 23935, 24330, 24330, 24150]}
        
df = pd.DataFrame(data)
df

Unnamed: 0,state,year,area planted
0,Arizona,2019,637
1,Arizona,2020,573
2,Arizona,2021,616
3,Arizona,2022,630
4,California,2019,2983
5,California,2020,2621
6,California,2021,2550
7,California,2022,2274
8,Iowa,2019,23935
9,Iowa,2020,24330


In [111]:
df2 = df

df2 is df

True

In [112]:
df2.loc[df2['state'] == 'Arizona', 'year'] = [1,2,3,4]
df2

Unnamed: 0,state,year,area planted
0,Arizona,1,637
1,Arizona,2,573
2,Arizona,3,616
3,Arizona,4,630
4,California,2019,2983
5,California,2020,2621
6,California,2021,2550
7,California,2022,2274
8,Iowa,2019,23935
9,Iowa,2020,24330


In [113]:
df

Unnamed: 0,state,year,area planted
0,Arizona,1,637
1,Arizona,2,573
2,Arizona,3,616
3,Arizona,4,630
4,California,2019,2983
5,California,2020,2621
6,California,2021,2550
7,California,2022,2274
8,Iowa,2019,23935
9,Iowa,2020,24330


In [117]:
#Function stuff:

def do_stuff(df_in):
    #Try with and without:
    #df_in = df_in.copy()
    
    df_in['state'] = 0
    
    return df_in
    
df2 = do_stuff(df)

df

Unnamed: 0,state,year,area planted
0,0,1,637
1,0,2,573
2,0,3,616
3,0,4,630
4,0,2019,2983
5,0,2020,2621
6,0,2021,2550
7,0,2022,2274
8,0,2019,23935
9,0,2020,24330


In [118]:
df2

Unnamed: 0,state,year,area planted
0,0,1,637
1,0,2,573
2,0,3,616
3,0,4,630
4,0,2019,2983
5,0,2020,2621
6,0,2021,2550
7,0,2022,2274
8,0,2019,23935
9,0,2020,24330


In [119]:
df is df2

True