4 October 2023

# `pandas`

Review core `pandas` objects: `pandas.Series` and `pandas.DataFrame`

- package built to wrangle and analyze tabular data
- built on `NumPy` 
- core tool for data analysis in Python

In [1]:
import pandas as pd
import numpy as np

## Series

`pandas.Series`:
- one of the core data structures in `pandas`
- a 1D array of *indexed* data
- will be the columns of the `pandas.DataFrame`

#### Creating a `pandas` series

Several ways, but right now will use:

```
s = pd.Series(data, index=index)
```

- `data` is a numpy array (or a list of objects)
- `index` is a list of indices of the same length as data

In [4]:
np.arange(3) #makes an array of consecutive integers

array([0, 1, 2])

In [8]:
#we can use this to create a pandas Series
pd.Series(np.arange(3), index = ['a','b','c'])

a    0
b    1
c    2
dtype: int64

index is an optional parameter, default is to start index at 0

In [11]:
# Create a series from a list of strings with default index
pd.Series(['EDS220', 'EDS222', 'EDS223', 'EDS242'])

0    EDS220
1    EDS222
2    EDS223
3    EDS242
dtype: object

#### Operations of series

Arithmetic operations work on series on most `NumPy` functions

In [15]:
s = pd.Series([98,73,65], index = ['Andy', 'Beth', 'Carolina'])
print(s, '\n') #\n adds an empty line

#divide each element in the series by 10
print(s/10)

Andy        98
Beth        73
Carolina    65
dtype: int64 

Andy        9.8
Beth        7.3
Carolina    6.5
dtype: float64


##### Conditionals in a series:

In [16]:
s > 70

Andy         True
Beth         True
Carolina    False
dtype: bool

This is simple but important! Used to select data from data frames.

#### Attributes and Methods of a Series

Two examples about identifying missing values

- missing values represented by `np.NaN`
- `NaN` is a type of float

In [18]:
type(np.NaN)

float

In [20]:
#series with NAs in it

s2 = pd.Series([1,2,np.NaN,4,np.NaN])
s2

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

`hasnans` is an attribute of a `pandas` series, return `TRUE` if there are NAs

In [21]:
s2.hasnans

True

`isna()` is a *method* of a series, returns a series indicating which elements are NAs

In [22]:
s2.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

`bool`: either `TRUE` or `FALSE`

## Dataframes

`pandas.DataFrame` 

- most used object in `pandas`
- tabular data
- each column is a `pandas.Series`

#### Creating a `pandas.DataFrame`

**dictionaries**: sets of key-value pairs
```
{ 'key1' : value1,
  'key2' : value2
 }
```
- keys are column names and values are values in the column

In [25]:
d = { 'col_name_1' : np.arange(3),
    'col_name_2' : [3.1,3.2,3.3]
    }
d

{'col_name_1': array([0, 1, 2]), 'col_name_2': [3.1, 3.2, 3.3]}

In [28]:
df = pd.DataFrame(d)

df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


#### In place operations

Rename the data frame's columns using the method `rename`
- takes in as an input in a dictionary

```
{'old_name_1' : 'new_name_1'
 'old_name_2' : 'new_name_2'
}
```

In [35]:
col_names = {'col_name_1' : 'col1',
            'col_name_2' : 'col2'}

df.rename(columns = col_names) # does not actually replace!!

Unnamed: 0,col1,col2
0,0,3.1
1,1,3.2
2,2,3.3


In [32]:
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


This does not change the column name of the object itself; creates a new object as an output.

In [34]:
df = df.rename(columns = col_names)
df

Unnamed: 0,col1,col2
0,0,3.1
1,1,3.2
2,2,3.3
