
# Getting Started with Pandas

- The `pandas` library is our fundamental library for working with tabular data/data frames.

- pandas is often used with numerical computing tools NumPy and SciPy, analytical libraries like scikit-learn, and data visualization libraries such as matplotlib

- Adopts parts of NumPy's style for array-based computing and data processing without `for` loops

<br>

<img src="panda.png" alt="Panda!" style="width:375px;"/>

<br>

In [61]:
#Standard way to import
####

import pandas as pd

#Grab numpy and matplotlib.pyplot as well:
import numpy as np
import matplotlib.pyplot as plt

## Series

The two fundamental data structures in pandas are the

1. **Series**
2. **DataFrame**

A **Series** is a 1-D array-like object consisting of a sequence of values (types similar to NumPy types) + an array of data labels called its ***index***

In [62]:
#Make simple series from only an array of data
s = pd.Series([1, 3, 5, 7, 9])

s

0    1
1    3
2    5
3    7
4    9
dtype: int64

Check out default index (plus values):

In [63]:
s.values

array([1, 3, 5, 7, 9], dtype=int64)

In [64]:
s.index

RangeIndex(start=0, stop=5, step=1)

In [65]:
list(s.index)

[0, 1, 2, 3, 4]

Let's give everything a label using the index:

In [66]:
#We'll use:
list('abcde')

['a', 'b', 'c', 'd', 'e']

In [67]:
s = pd.Series([1, 3, 5, 7, 9], index = list('abcde')) #vs. ['a', 'b', 'c', 'd', 'e'])

#Vs:
#s = pd.Series([1, 3, 5, 7, 9], index = range(1,6))

display(s)

s.index

a    1
b    3
c    5
d    7
e    9
dtype: int64

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

#### Can use either numeric index or labels in index to select values in Series:

- But possible ambiguity when labels are numeric

In [70]:
#Numeric index:
s[0]

1

In [71]:
#By label:
s['a']

1

In [72]:
#Note that this is right-inclusive, unlike most of Python:
s['a':'d']

a    1
b    3
c    5
d    7
dtype: int64

In [74]:
#Note must use a list if getting multiple specific indices
s[['a', 'b', 'd']]

a    1
b    3
d    7
dtype: int64

In [76]:
## Note if we have a numeric index that does NOT align with the integer location::
s = pd.Series([1, 3, 5, 7, 9], index = range(3,8))

s

3    1
4    3
5    5
6    7
7    9
dtype: int64

### We can also use `iloc` and `loc`:

- Strongly recommended for Series
- Must use for DataFrames

#### `iloc` is the integer index:

In [84]:
#Remake s real quick:
s = pd.Series([1, 3, 5, 7, 9], index = list('abcde'))
s

a    1
b    3
c    5
d    7
e    9
dtype: int64

In [85]:
s.iloc[[0,2,3]]

a    1
c    5
d    7
dtype: int64

#### `loc` uses the index

In [86]:
s.loc['a':'c']

a    1
b    3
c    5
dtype: int64

In [89]:
#But note the difference:
s.iloc[0:2]

a    1
b    3
dtype: int64

### Boolean Masking

Similar to NumPy...

**Note that masking, filtering, scalar multiplication, and math functions all preserve index-value link**

In [90]:
s > 3

a    False
b    False
c     True
d     True
e     True
dtype: bool

In [91]:
s[s > 3]

c    5
d    7
e    9
dtype: int64

In [92]:
#OR (better):
s.loc[s > 3]

c    5
d    7
e    9
dtype: int64

### Arithmetic and universal functions:

- Similar to NumPy
- *Arithmetic operations between series automatically align by index label!*

In [93]:
s * 3

a     3
b     9
c    15
d    21
e    27
dtype: int64

In [94]:
np.exp(s)

a       2.718282
b      20.085537
c     148.413159
d    1096.633158
e    8103.083928
dtype: float64

In [95]:
#Note index alignment:

#What if:
####

s2 = pd.Series([1, 3, 5, 7, 9], index = list(reversed('abcde')))

s2

e    1
d    3
c    5
b    7
a    9
dtype: int64

In [96]:
s

a    1
b    3
c    5
d    7
e    9
dtype: int64

In [97]:
s + s2

a    10
b    10
c    10
d    10
e    10
dtype: int64

In [98]:
s * s2

a     9
b    21
c    25
d    21
e     9
dtype: int64

#### Note: Series can also be thought of as an ordered dictionary: It is a mapping of index values to data values.

- Series can be used in many contexts where a dict might be used

In [99]:
#See if index/key in s?
'a' in s

True

In [101]:
'f' in s

False

In [102]:
#Values?
5 in s.values

True

In [103]:
s.values

array([1, 3, 5, 7, 9], dtype=int64)

In [104]:
#Note boolean masking by a list of values: Use `isin()`
######

s.isin([5,7])

a    False
b    False
c     True
d     True
e    False
dtype: bool

In [107]:
#Better to use loc:
######
s.loc[s.isin([5,7])]

#Not in!
#s.loc[~s.isin([5,7])]

c    5
d    7
dtype: int64

#### Can directly create a Series from data stored in a dictionary object:

In [108]:
#Area harvested for grain in 2020 in 1,000 acres, selected states; USDA Acreage Report
#https://usda.library.cornell.edu/concern/publications/j098zb09z?locale=en
####
sdata = {'Arizona': 29, 'Ohio': 3300, 'Texas': 1810, 'Oregon': 65, 'Iowa': 12900}

s = pd.Series(sdata)
s

Arizona       29
Ohio        3300
Texas       1810
Oregon        65
Iowa       12900
dtype: int64

In [None]:
#Note, equiv to:
#sdata = [29, 3300, 1810, 65, 12900]
#pd.Series(sdata, index = ['Arizona', 'Ohio', 'Texas', 'Oregon', 'Iowa'])

In [109]:
s.values

array([   29,  3300,  1810,    65, 12900], dtype=int64)

In [110]:
s.index

Index(['Arizona', 'Ohio', 'Texas', 'Oregon', 'Iowa'], dtype='object')

- Can set order of index in resulting series:

In [111]:
states = ['Arizona', 'California', 'Iowa', 'Ohio', 'Texas']

s2 = pd.Series(sdata, index = states)
s2

Arizona          29.0
California        NaN
Iowa          12900.0
Ohio           3300.0
Texas          1810.0
dtype: float64

* Note that 'Oregon' was dropped, and California got a `NaN` value

Can detect missing data with `isnull` and `notnull`:

In [112]:
pd.isnull(s2)

Arizona       False
California     True
Iowa          False
Ohio          False
Texas         False
dtype: bool

In [113]:
pd.notnull(s2)

Arizona        True
California    False
Iowa           True
Ohio           True
Texas          True
dtype: bool

In [114]:
#Also have isnull() and notnull() methods (vs. functions)
#Plus isna() and notna() methods

s2.notnull()

Arizona        True
California    False
Iowa           True
Ohio           True
Texas          True
dtype: bool

In [115]:
s2.isna()

Arizona       False
California     True
Iowa          False
Ohio          False
Texas         False
dtype: bool

In [117]:
#Throw in an any() method (can also try all()):
s2.isna().any()

True

In [118]:
s2.isna().all()

False

#### As above, arithmetic operations automatically align by index label:

In [119]:
s

Arizona       29
Ohio        3300
Texas       1810
Oregon        65
Iowa       12900
dtype: int64

In [120]:
s2

Arizona          29.0
California        NaN
Iowa          12900.0
Ohio           3300.0
Texas          1810.0
dtype: float64

In [121]:
s + s2

Arizona          58.0
California        NaN
Iowa          25800.0
Ohio           6600.0
Oregon            NaN
Texas          3620.0
dtype: float64

In [124]:
#To only keep the ones that aren't NaN:
(s + s2).dropna()


Arizona       58.0
Iowa       25800.0
Ohio        6600.0
Texas       3620.0
dtype: float64

### Naming Stuff

In [125]:
#Name the Series
s.name = 'Grain'

#Name the index:
s.index.name = 'State'

In [126]:
s

State
Arizona       29
Ohio        3300
Texas       1810
Oregon        65
Iowa       12900
Name: Grain, dtype: int64

#### Altering the index?

In [127]:
#Can alter index by in-place assignment:

#Re-make our simpler example:
s = pd.Series([1, 3, 5, 7, 9])
s

0    1
1    3
2    5
3    7
4    9
dtype: int64

In [128]:
s.index = ['A', 'B', 'C', 'D', 'E']
s.index.name = 'Name!'

s

Name!
A    1
B    3
C    5
D    7
E    9
dtype: int64

In [129]:
#Try to rename the index?
#Not allowed directly, the following gives error
s.index[1] = 'Whoo!'

TypeError: Index does not support mutable operations

In [131]:
#Do this instead:
s.index._data[1] = 'Whoo!'

s

Name!
A        1
Whoo!    3
C        5
D        7
E        9
dtype: int64

### DataFrame

Rectangular table of data, where each column can be a different data type.

- Both row and column index

- Can also think of as a dictionary of Series that all share the same index (if you really want to)

In [132]:
#Let's make a DataFrame from a dict of equal-length lists:
#Index is assigned automatically

#Principal Crops Area Planted
#USDA Acreage Report
#https://usda.library.cornell.edu/concern/publications/j098zb09z?locale=en

data = {'state': ['Arizona', 'Arizona', 'Arizona', 'Arizona',
                  'California', 'California', 'California', 'California', 'Iowa', 'Iowa', 'Iowa', 'Iowa'],
        'year': [2019, 2020, 2021, 2022, 2019, 2020, 2021, 2022, 2019, 2020, 2021, 2022],
        'area planted': [637, 573, 616, 630, 2983, 2621, 2550, 2274, 23935, 24330, 24330, 24150]}
        
df = pd.DataFrame(data)
df

Unnamed: 0,state,year,area planted
0,Arizona,2019,637
1,Arizona,2020,573
2,Arizona,2021,616
3,Arizona,2022,630
4,California,2019,2983
5,California,2020,2621
6,California,2021,2550
7,California,2022,2274
8,Iowa,2019,23935
9,Iowa,2020,24330


In [133]:
#Grab a column:
df['area planted']

0       637
1       573
2       616
3       630
4      2983
5      2621
6      2550
7      2274
8     23935
9     24330
10    24330
11    24150
Name: area planted, dtype: int64

In [134]:
df.state

0        Arizona
1        Arizona
2        Arizona
3        Arizona
4     California
5     California
6     California
7     California
8           Iowa
9           Iowa
10          Iowa
11          Iowa
Name: state, dtype: object

In [139]:
#We have head() and tail() methods:
###

df.head(3)

Unnamed: 0,state,year,area planted
0,Arizona,2019,637
1,Arizona,2020,573
2,Arizona,2021,616


In [142]:
df.tail(2)

Unnamed: 0,state,year,area planted
10,Iowa,2021,24330
11,Iowa,2022,24150


In [143]:
#Note print vs. display:

print(df)
display(df)

         state  year  area planted
0      Arizona  2019           637
1      Arizona  2020           573
2      Arizona  2021           616
3      Arizona  2022           630
4   California  2019          2983
5   California  2020          2621
6   California  2021          2550
7   California  2022          2274
8         Iowa  2019         23935
9         Iowa  2020         24330
10        Iowa  2021         24330
11        Iowa  2022         24150


Unnamed: 0,state,year,area planted
0,Arizona,2019,637
1,Arizona,2020,573
2,Arizona,2021,616
3,Arizona,2022,630
4,California,2019,2983
5,California,2020,2621
6,California,2021,2550
7,California,2022,2274
8,Iowa,2019,23935
9,Iowa,2020,24330


- **Note some basic attributes**

In [144]:
type(df)

pandas.core.frame.DataFrame

In [145]:
len(df)

12

In [146]:
df.size

36

In [147]:
df.shape

(12, 3)

In [148]:
df.ndim

2

#### Can add our own index:

In [149]:
df = pd.DataFrame(data, columns = ['state', 'year', 'area planted'],
            index = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve'])

df

Unnamed: 0,state,year,area planted
one,Arizona,2019,637
two,Arizona,2020,573
three,Arizona,2021,616
four,Arizona,2022,630
five,California,2019,2983
six,California,2020,2621
seven,California,2021,2550
eight,California,2022,2274
nine,Iowa,2019,23935
ten,Iowa,2020,24330


#### Setting and reseting index:

In [166]:
df2 = df.set_index('state')

df2

Unnamed: 0_level_0,year,area planted
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Arizona,2019,637
Arizona,2020,573
Arizona,2021,616
Arizona,2022,630
California,2019,2983
California,2020,2621
California,2021,2550
California,2022,2274
Iowa,2019,23935
Iowa,2020,24330


In [167]:
df2 = df2.reset_index(drop=False)

df2

Unnamed: 0,state,year,area planted
0,Arizona,2019,637
1,Arizona,2020,573
2,Arizona,2021,616
3,Arizona,2022,630
4,California,2019,2983
5,California,2020,2621
6,California,2021,2550
7,California,2022,2274
8,Iowa,2019,23935
9,Iowa,2020,24330


#### Getting columns:

In [171]:
#Get columns:
my_var = 'area planted'

#This gives me a Series:
df[my_var]

#This gives me a DataFrame:
#df[[my_var]]

#Or this:
#pd.DataFrame(df[my_var])

one         637
two         573
three       616
four        630
five       2983
six        2621
seven      2550
eight      2274
nine      23935
ten       24330
eleven    24330
twelve    24150
Name: area planted, dtype: int64

In [174]:
#Pass a list of columns:
df[['state', 'year']]

Unnamed: 0,state,year
one,Arizona,2019
two,Arizona,2020
three,Arizona,2021
four,Arizona,2022
five,California,2019
six,California,2020
seven,California,2021
eight,California,2022
nine,Iowa,2019
ten,Iowa,2020


#### Getting rows:

In [175]:
#Can use row numbers without qualification:

df[0:3]

#But better to use iloc and loc...

Unnamed: 0,state,year,area planted
one,Arizona,2019,637
two,Arizona,2020,573
three,Arizona,2021,616


In [178]:
#Note:
df.iloc[0:3]

#vs:
#df.iloc[0:1]

Unnamed: 0,state,year,area planted
one,Arizona,2019,637
two,Arizona,2020,573
three,Arizona,2021,616
