<a href="https://colab.research.google.com/github/JaimeAdele/APEX/blob/main/Module9_pandas2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://images.pexels.com/photos/1661535/pexels-photo-1661535.jpeg?cs=srgb&dl=pexels-diana-silaraja-1661535.jpg&fm=jpg' width=700>  
Photo by Diana Silaraja production from Pexels

# APEX Faculty Training, Module 9: Pandas Part 2

Created by Valerie Carr and Jaime Zuspann  
Licensed under a Creative Commons license: CC BY-NC-SA  
Last updated: Feb 20, 2022  

**Learning outcomes**  
<font color='red'>Update?</font>  
1. Learn to read in spreadsheet data ("dataframes") in Python with the Pandas library.
2.  Learn to manipulate the contents of a dataframe with Pandas methods

## 1. A couple notes before you start 
* This file is view only, meaning that you can't edit it.
    * To create an editable copy, look towards the top of the notebook and click on `Copy to Drive`. This will cause a new tab to open with your own personal copy.
    * If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`.
* To run a cell, use `shift` + `enter`.   
* Keep the following Python style preferences in mind:
    * Variable names should use `snake_case`
    * Include spaces before and after operators, e.g., `x + 1`
    * Don't put unnecessary spaces after a function name, before the parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print (my_variable)`
    * Don't put unnecessary spaces at the beginning or end of parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print( my_variable )`

<font color='red'>Exercise 0</font>  
Before you start with this module, run the cell below in order for the rest of the exercises to work. Again, you will see no output. These lines simply import the Pandas library and read the csv file into the `my_df` variable as a dataframe, as you learned in the last module.

In [1]:
import pandas as pd
filepath = "https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/state_pop.csv"
my_df = pd.read_csv(filepath)

## 2. Smallest or Largest Observations
Pandas has simple methods for determining the smallest or largest set of 'n' observations in a dataframe.  

Smallest: `df_name.nsmallest(n, ‘col_name’)`  
Largest: `df_name.nlargest(n, ‘col_name')`  

The `n` in these expressions represents an integer value of the desired set size. 

<font color='red'>Exercise 1</font>  
The cell below subsets the dataframe to show the state with the highest population. Run the cell to see the output.

In [None]:
my_df.nlargest(1, 'totalPop')

<font color='red'>Exercise 2</font>  
Now subset the dataframe to show the five states with the smallest Hispanic population.

## 3. Random Selection
The `sample()` method can be used to subset a random sample of rows. The generic syntax for this method is:  

`df_name.sample(n = x)`  

where x is an integer indicating the desired sample size.  

<font color='red'>Exercise 3</font>  
Using the syntax shown above, display a sample of six rows of the `my_df` dataframe.

## 4. Subsetting Variables (Columns)
We've already learned the Attribute method and the Label method for selecting a single column. To review:  

Attribute method: `df_name.col_name`  
Label method: `df_name['col_name']`  

But what if you want to subset a specific selection of columns? The generic syntax for this is:  

`df_name[['col_1', 'col_2', 'etc.']]`

Notice the double brackets here, this is different than the Label approach that uses only one set of brackets with one argument for the column name. 

<font color='red'>Exercise 4</font>  
In the cell below, subset the `my_df` dataframe to show just the `state` and `totalPop` columns.

## 5. Subsetting Rows and Columns
Just as we can subset by either rows or columns, we can also subset by both at the same time. To do this, we implement the `loc` method. The generic syntax goes like this:  

`df_name.loc[boolean_expression, ['col_1', 'col_2', 'etc.']]`

<font color='red'>Exercise 5</font>  
An example of this process is to display only the `state` and `totalPop` columns for states with a total population greater than 15 million. Run the cell below to see the output.

In [None]:
my_df.loc[my_df.totalPop > 15000000, ['state', 'totalPop']]

<font color='red'>Exercise 5</font>  
Now try something a bit harder. In the cell below, write a line of code to display the `state` and `hispPop` columns for states with less than a 3% Hispanic population.

## 6. Sorting Dataframes
So far we've seen how to choose which rows or columns to display, but it's often useful to view all of the data sorted by a particular column. By default, the observations are sorted according to their ID number (the number in the left column). Our example dataframe happens to already have the rows sorted alphabetically by state name, but what if we wanted to see the data sorted by the total population?

### 6a. Ascending Order
The `sort_values()` method provides exacly this functionality. To call this method, follow the this syntax:  

`df_name.sort_values(by = 'col_name')`

<font color='red'>Exercise 6</font>  
Try sorting the `my_df` dataframe by the `totalPop` column. 

### 6b. Descending Order
By default, the `sort_values()` method sorts in ascending order. To sort in descending order, one more argument is necessary:  

`df_name.sort_values(by = 'col_name', ascending = False)`  

---
###<font color='blue'>Reminder</font>
The keyword `False` does not have quotation marks surrounding it--it is a Boolean literal, not a string.

---

### 6c. Viewing vs. Changing
It is important to understand that the `sort_values()` method does not change the original dataframe, it just creates a new view. To verify that the data haven't changed, we can simply look at the dataframe's header.

<font color='red'>Exercise 7</font>  
Run the cell below to display the first five rows of the dataframe to verify that the original order hasn't changed.

In [None]:
my_df.head()

If you actually want the dataframe to change according to the `sort_values()` method, you'll need to include yet another argument in the parentheses: `inplace = True`.

<font color='red'>Exercise 8</font>  
Sort the `my_df` dataframe again by total population in ascending order (default) as you did in Exercise 6, but include the above argument this time to actually change the dataframe. Then, add a line and display the first five rows again to ensure that the dataframe has changed.

### 6d. Sorting by More than One Column
A dataframe can also be sorted by two columns. The dataframe we've been working with isn't a great example for this, but say you wanted to sort a dataframe of college students first by their year in school and then by last name. Doing so would look somthing like this:  

`students_df.sort_values(by = ['year', 'last_name'])`

This line of code would display the whole dataframe of students sorted by year and then by last name within each year. 

Therefore, the generic syntax to sort first by 'col_1' and then by 'col_2', etc., is:  

`df_name.sort_values(by = ['col_1', 'col_2', 'etc.'])`

## All done!
You've finished learning about the Pandas library! Next, you'll learn how you can create and import your own data rather than relying on existing datasets.