# SRM 641 Week 3 Pandas

## By the end of this week, you should be able to:

- Understand why and how Pandas can be useful for data processing and analysis
- Descibe Pandas data structure
- Use Pandas to load data, access subsets of data, describe, and summarize


## Pandas

Last week we focused on NumPy and its ndarray object, which provides efficient storage and manipulation of dense typed arrays in Python. NumPy is great for performing math operations with matrices, whereas Pandas is excellent for wrangling, processing, and understanding data like spreadsheets.

This week, you will learn about the Pandas library. Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a `DataFrame`. A data structure in Python that provides the ability to work with tabular data. Pandas dataframes are composed of rows and columns that can have header names, and the columns in pandas dataframes can be different types (e.g. the first column containing integers and the second column containing text strings). Each value in pandas dataframe is referred to as a cell that has a specific row index and column index within the tabular structure.

Pandas, and in particular its `Series` and `DataFrame` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

In [1]:
# To begin, import numpy and pandas

import numpy as np
import pandas as pd 

To get started with pandas, you will need to get comfortable with its two workhorse data structures: `Series` and `DataFrame`. 


## Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index. 

In [3]:
# lets create an object with one dimensional array

obj = pd.Series([2, 4, -8, 10])

In [4]:
# View the object

obj

0     2
1     4
2    -8
3    10
dtype: int64

The string representation of a Series object shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created.

Often, you'll want to create a Series with an index identifying each data point with a label:

In [5]:
obj2 = pd.Series([2, 4, -8, 10], index=["d", "b", "a", "c"])

In [6]:
# View

obj2

d     2
b     4
a    -8
c    10
dtype: int64

You can use labels in the index when selecting single values or a set of values:

In [8]:
obj2["d"]

2

Should you have data contained in a Python dictionary, you can create a Series from it by passing the dictionary:

In [9]:
# Create the dictionary
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

# Create a Series

obj3 = pd.Series(sdata)


In [10]:
#View the object
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [11]:
# A Series can be converted back to a dictionary with its to_dict method

obj3.to_dict()


{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

Normally the index in the resulting Series will respect the order of the keys according to the dictionary's keys method, which depends on the key insertion order. You can override this by passing an index with the dictionary keys in the order you want them to appear in the resulting Series:

In [12]:
states = ["California", "Ohio", "Oregon", "Texas", "Colorado"]

obj4 = pd.Series(sdata, index=states)

In [13]:
# View

obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Colorado          NaN
dtype: float64

We have missing values NAN (Not a Number), which is considered in pandas to mark missing or NA values. 
There are several useful methods for detecting, removing, and replacing null values in Pandas data structures:

- `isnull()`: Generate a boolean mask indicating missing values
- `notnull()`: Opposite of isnull()
- `dropna()`: Return a filtered version of the data
- `fillna()`: Return a copy of the data with missing values filled or imputed

The `isna` and `notna` functions in pandas should be used to detect missing data:

In [14]:
# Check if missing

pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
Colorado       True
dtype: bool

In [15]:
# Check if not missing

pd.notna(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
Colorado      False
dtype: bool

In [16]:
# Can also check as instance methods

obj4.isna()

California     True
Ohio          False
Oregon        False
Texas         False
Colorado       True
dtype: bool

Both the Series object itself and its index have a `name` attribute, which integrates with other areas of pandas functionality:

In [17]:
obj4.name = "population"

In [18]:
obj4.index.name = "state"

In [19]:
# View
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Colorado          NaN
Name: population, dtype: float64

Here are some commonly used attributes with Series objects:

- Attribute: Returns
- `name`: The name of the Series object
- `dtype`: The data type of the Series object
- `shape`: Dimensions of the Series object in a tuple of the form (number of rows,)
- `index`: The Index object that is part of the Series object
- `values`: The data in the Series object

In [21]:
# Check the name

obj4.name

'population'

In [20]:
# Check the shape of the Series object

obj4.shape

(5,)

In [22]:
# Check the index

obj4.index

Index(['California', 'Ohio', 'Oregon', 'Texas', 'Colorado'], dtype='object', name='state')

In [23]:
# Check the values

obj4.values

array([   nan, 35000., 16000., 71000.,    nan])

## DataFrame

Having a Series object for each column is an improvement over the NumPy representation; however, we still have the same problem when wanting to sort based on a value or grab an entire row out. The DataFrame gives us a representation of a table formed from many Series objects that form the columns and a shared Index object that labels the rows. The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index.

In [25]:
# Let's create a dataframe

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada", "Colorado", "Colorado"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2, 3.0, 3.2]}

frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically, as with Series, and the columns are placed according to the order of the keys in data

In [27]:
# View the dataframe labeled frame

frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2
6,Colorado,2002,3.0
7,Colorado,2003,3.2


In [29]:
# Inspect the dataframe

# For large DataFrames, the head method selects only the first five rows:

frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [30]:
# Check the last five rows

frame.tail()

Unnamed: 0,state,year,pop
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2
6,Colorado,2002,3.0
7,Colorado,2003,3.2


We can check the type of the underlying data with `dtypes` (note that it is not `dtype` as with Series and Index objects since each column will have its own data type):

In [31]:
frame.dtypes

state     object
year       int64
pop      float64
dtype: object

We can get the underlying data with the `values` attribute:

In [33]:
frame.values

array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9],
       ['Nevada', 2003, 3.2],
       ['Colorado', 2002, 3.0],
       ['Colorado', 2003, 3.2]], dtype=object)

We can isolate the columns with the columns attribute. Notice that the columns are actually an Index object just on a different axis (columns are the horizontal index while rows are the vertical index).

In [34]:
frame.columns

Index(['state', 'year', 'pop'], dtype='object')

In [35]:
# If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:
    
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2
6,2002,Colorado,3.0
7,2003,Colorado,3.2


In [36]:
# If you pass a column that isn’t contained in the dictionary, it will appear with missing values in the result:

frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])

In [37]:
# View

frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,
6,2002,Colorado,3.0,
7,2003,Colorado,3.2,


A column in a DataFrame can be retrieved as a Series either by dictionary-like notation or by using the dot attribute notation:

In [38]:
# Check the column names
frame2.columns

# Select a column dictionary like notation
frame2["state"]

0        Ohio
1        Ohio
2        Ohio
3      Nevada
4      Nevada
5      Nevada
6    Colorado
7    Colorado
Name: state, dtype: object

In [39]:
# Check the column using the dot attribute notation

frame2.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
6    2002
7    2003
Name: year, dtype: int64

Rows can also be retrieved by position or name with the special `iloc` and `loc` attributes for label-based and integer-based indexing. Since DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either axis labels (loc) or integers (iloc).

In [41]:
frame2.loc[7] #select row 7

year         2003
state    Colorado
pop           3.2
debt          NaN
Name: 7, dtype: object

In [42]:
frame2.iloc[2] # select row 2

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

Columns can be modified by assignment. For example, the empty debt column could be assigned a scalar value or an array of values:

In [43]:
frame2["debt"] = 16.5

In [44]:
# view

frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5
6,2002,Colorado,3.0,16.5
7,2003,Colorado,3.2,16.5


Assigning a column that doesn’t exist will create a new column.

The `del` keyword will delete columns like with a dictionary. As an example, first add a new column of Boolean values where the state column equals "Ohio", then delete the column:

In [45]:
# Create a new column called eastern with boolean values if state is equal to Ohio

frame2["eastern"] = frame2["state"] == "Ohio"

In [46]:
# View

frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,16.5,True
1,2001,Ohio,1.7,16.5,True
2,2002,Ohio,3.6,16.5,True
3,2001,Nevada,2.4,16.5,False
4,2002,Nevada,2.9,16.5,False
5,2003,Nevada,3.2,16.5,False
6,2002,Colorado,3.0,16.5,False
7,2003,Colorado,3.2,16.5,False


In [47]:
# Delete the new column, doesn't make sense

del frame2["eastern"]

In [49]:
# View if the column is deleted

frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Here are some commonly used attributes:

- Attribute: Returns
- dtypes: The data types of each column
- shape: Dimensions of the DataFrame object in a tuple of the form (number of rows, number of columns)
- index: The Index object along the rows of the DataFrame object
- columns: The name of the columns (as an Index object)
- values: The data in the DataFrame object
- empty: Check if the DataFrame object is empty

Resources:

- https://wesmckinney.com/book/accessing-data
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html
