# M06 Notes

## New Edition of Pandas Book

There is a new edition of the book we are using for NumPy and Pandas:

- [Python for Data Analysis, 3E](https://wesmckinney.com/book/pandas-basics)

Shout to Daniel Stornetta!

## NumPy Slices

![image.png](attachment:f739e8d6-eeb0-4992-94e9-60bd6ad9fe51.png)

In [1]:
import numpy as np

In [2]:
def inspect(a):
    print('Structure:')
    print(a)
    print('Shape:', a.shape, 'Axes:', len(a.shape))

In [3]:
a1 = np.random.randn(3,3)

In [4]:
inspect(a1)

Structure:
[[ 0.94189714  0.25292708 -0.53694544]
 [-0.38919961 -1.06228332 -0.57043245]
 [-0.07302494 -2.18720499 -1.47706283]]
Shape: (3, 3) Axes: 2


In [5]:
inspect(a1[2])

Structure:
[-0.60758386 -2.84371596  1.15749854]
Shape: (3,) Axes: 1


In [8]:
inspect(a1[[2]])

Structure:
[[-0.60758386 -2.84371596  1.15749854]]
Shape: (1, 3) Axes: 2


In [6]:
inspect(a1[2, :])

Structure:
[-0.60758386 -2.84371596  1.15749854]
Shape: (3,) Axes: 1


In [7]:
inspect(a1[2:])

Structure:
[[-0.60758386 -2.84371596  1.15749854]]
Shape: (1, 3) Axes: 2


In [9]:
inspect(a1[[2], :])

Structure:
[[-0.60758386 -2.84371596  1.15749854]]
Shape: (1, 3) Axes: 2


In [10]:
inspect(a1[2:, :])

Structure:
[[-0.60758386 -2.84371596  1.15749854]]
Shape: (1, 3) Axes: 2


In [11]:
inspect(a1[2, 2])

Structure:
1.1574985440672065
Shape: () Axes: 0


In [12]:
inspect(a1[2, [2]])

Structure:
[1.15749854]
Shape: (1,) Axes: 1


In [13]:
inspect(a1[[2], 2])

Structure:
[1.15749854]
Shape: (1,) Axes: 1


In [14]:
inspect(a1[[2], [2]])

Structure:
[1.15749854]
Shape: (1,) Axes: 1


In [15]:
inspect(a1[2:, 2:])

Structure:
[[1.15749854]]
Shape: (1, 1) Axes: 2


In [16]:
inspect(a1[:, 2])

Structure:
[-2.23444565  0.09570946  1.15749854]
Shape: (3,) Axes: 1


In [17]:
inspect(a1[:, [2]])

Structure:
[[-2.23444565]
 [ 0.09570946]
 [ 1.15749854]]
Shape: (3, 1) Axes: 2


In [18]:
inspect(a1[:, 1:])

Structure:
[[ 0.68627847 -2.23444565]
 [ 0.61797039  0.09570946]
 [-2.84371596  1.15749854]]
Shape: (3, 2) Axes: 2


## Pandas 

- Pandas assumes a 2D world.
- Use [Xarray](https://docs.xarray.dev/en/stable/index.html) for more dimensions ...
- Pandas is _lingua franca_

## Pandas Indexing

In [5]:
import pandas as pd

### `-1`

In [6]:
ser = pd.Series(np.arange(3.))

In [7]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [12]:
# ser[-1] # Throws an error

In [9]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

In [10]:
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [11]:
ser2[-1]

2.0

Why?

In the first case, -1 is interpreted as an index name.

### Arithmetic

Series operations work with indexes implicitly.

In [27]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [28]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [29]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [30]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

Dataframes 

In [117]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [118]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [119]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [120]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [124]:
(df1 + df2).fillna(0).style.background_gradient(axis=None, cmap='YlGnBu')

Unnamed: 0,b,c,d,e
Colorado,0.0,0.0,0.0,0.0
Ohio,3.0,0.0,6.0,0.0
Oregon,0.0,0.0,0.0,0.0
Texas,9.0,0.0,12.0,0.0
Utah,0.0,0.0,0.0,0.0


## Anatomy of a Data Frame

<img src="https://pynative.com/wp-content/uploads/2021/02/dataframe.png" width="50%" height="50%"/>

<img src="https://miro.medium.com/max/700/1*KOBhtOeFntu6CyJUsCdN0g.jpeg" width="50%" height="50%"/>

In [97]:
import seaborn as sns

In [101]:
# sns.get_dataset_names()

In [176]:
data_set = 'iris'
# data_set = 'penguins'
df = sns.load_dataset(data_set)
df.index.name = 'obs_id'

In [177]:
# df = pd.read_csv("iris_data.csv").set_index('obs_id')

In [178]:
df.groupby('species').agg(['mean', 'median', 'count'])

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_length,sepal_width,sepal_width,sepal_width,petal_length,petal_length,petal_length,petal_width,petal_width,petal_width
Unnamed: 0_level_1,mean,median,count,mean,median,count,mean,median,count,mean,median,count
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
setosa,5.006,5.0,50,3.428,3.4,50,1.462,1.5,50,0.246,0.2,50
versicolor,5.936,5.9,50,2.77,2.8,50,4.26,4.35,50,1.326,1.3,50
virginica,6.588,6.5,50,2.974,3.0,50,5.552,5.55,50,2.026,2.0,50


In [149]:
df['species'].value_counts().to_frame('n')

Unnamed: 0_level_0,n
species,Unnamed: 1_level_1
setosa,50
versicolor,50
virginica,50


In [182]:
# df.values