UM MSBA - BGEN632

# Week 8: Advanced Data Manipulation

In the previous week, we covered approaches for filtering and querying data using various techniques. In this tutorial, we will go over how to perform advanced filtering and querying using `pandas` in Python. To start, let's set up our notebook.

### Notebook Setup

In [16]:
# import modules
import pandas as pd
import os
import numpy as np

In [18]:
# set working directory
os.chdir("/Users/obn/Documents/GitHub/UM-BGEN632/week8labs/data")  # add your filepath
os.getcwd()  # confirm change

'/Users/obn/Documents/GitHub/UM-BGEN632/week8labs/data'

In [36]:
# load data
ozone_df = pd.read_table("ozone.data.txt")
ozone_df.info()  # quick inspect

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   rad     111 non-null    int64  
 1   temp    111 non-null    int64  
 2   wind    111 non-null    float64
 3   ozone   111 non-null    int64  
dtypes: float64(1), int64(3)
memory usage: 3.6 KB


## Advanced Data Manipulation in Python
Many of the operations found in pandas mimic those found in R's [tidyverse library](https://www.tidyverse.org/). For example, pandas provides close equivalents to the functions provided in dplyr (a core tidyverse package) which is designed to support data wrangling tasks:

| `dplyr` | `pandas` |
|:---:|:---:|
| select | [filter](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html) |
| filter | [query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) |
| arrange | [sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) |
| mutate | [assign](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html) |
| rename | [rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) |
| summarize | [agg](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) |
| group_by | [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) |

<br><br>
In addition to these functions, another similarity includes using *pipes* as shortcuts. Piping provides the output of the previous line of code as input into the next line of code. Piping is useful for organizing multiple lines of code that should be run in sequence.

For those familiar with R, we can use piping `%>%` to create streamlined, simple code like so:

```R
ozone_df %>% 
    select(rad, temp, wind) %>%  # select desired columns
    filter(wind == 6.3) %>%  # keep rows based on condition
    head()  # display first n rows
```

In the code above, the output for each line is passed to the line below it. The advantage is the *lack of a need to assign the output to a variable*. Remember, assigning a value to a variable is one of the most fundamental aspects of programming. This reduces the complexity of code, the amount of text we type, and creates a clean appearance.

The equivalent code in Python is provided in the cell below. 

In [49]:
(ozone_df
.filter(["rad", "temp", "wind"])  # select desired columns
.query("wind == 6.3")  # keep rows based on condition
.head()  # display first n rows
)

Unnamed: 0,rad,temp,wind
39,267,92,6.3
47,285,84,6.3
69,51,79,6.3
80,237,96,6.3
81,188,94,6.3


You can see the similarities between the R code and Python code above. Note that we wrapped the entire Python code in parentheses `()` because otherwise we would need to place a backslash at the end of each line like so:

```Python
ozone_df \
.filter(["rad", "temp", "wind"]) \
.query("wind == 6.3") \
.head() 
```

Another difference between pandas and tidyverse: lots of (single or double) quotes. Unlike tidyverse, which attempts to use as few quotes as possible, pandas relies on the underlying Python base which forces the use of quotes here. Could the pandas programmers have changed that? Sure. Yet, it does keep things Pythonic.

Let's use a different example that adds in more complexity. Here is an example with R code:

```R
ozone_data %>% 
    select(rad, temp, wind) %>%  # select desired columns
    filter(temp %in% 62:90) %>%  # keep rows based on value in specified range
    mutate(rad_wind = rad * wind) %>%  # create a new column based on mathematical operation applied to two other columns
    arrange(desc(rad_wind)) %>%  # sort the data largest to smallest based on value in new column
    head()  # display first n rows
```

Okay, now we'll convert the above example over to Python:

In [67]:
l = list(range(62, 91))  # create a range that we will use later for selecting rows

(ozone_df
.filter(['rad','temp','wind'])  # select desired columns
.query('temp in @l')  # keep rows based on value in specified range
.assign(rad_wind = ozone_df.rad * ozone_df.wind)  # create a new column based on mathematical operation applied to two other columns
.sort_values(by = ('rad_wind'), ascending = False) # sort the data largest to smallest based on value in new column
.head(6)  # display first n rows
)

Unnamed: 0,rad,temp,wind,rad_wind
29,284,72,20.7,5878.8
17,320,73,16.6,5312.0
25,291,90,13.8,4015.8
73,259,77,15.5,4014.5
93,259,76,15.5,4014.5
11,334,64,11.5,3841.0
