<a href="https://colab.research.google.com/github/wintera71/BEACO2N-Modules/blob/main/Lesson%202%3A%20Introduction%20to%20Pandas/IN_CLASS_Introduction_to_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BEACO2N Notebook 2a: Introduction to Pandas**

Notebook developed by: *Skye Pickett, Alec Morgan, Lan Dinh, Su Min Park, Amy Castillo*


### Learning Outcomes
Working through this notebook, you will learn about:
  1. The `DataFrame` and `Series` data structures of the *pandas* library
  1. Importing CSV data into a *pandas* `DataFrame`
  1. Accessing and manipulating data within a `DataFrame` and `Series`



## Table of Contents
1. Welcome to Pandas
2. Pandas Structure
> 2.1 Series
<br> 2.2 DataFrames
3. Importing to DataFrames
4. Manipulating DataFrames

*Note: In this notebook, there are some more advanced topics that are "optional". This means you can just read over these sections; don't worry about fully understanding these parts unless you are really interested. They may be useful later in the course, but for now they are not necessary, so feel free to just skim the parts labelled "Optional"!*






<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## 1. Welcome to Pandas

[*Pandas*](http://pandas.pydata.org/) is a column-oriented data analysis Application Programming Interface (API). It's a great tool for handling and analyzing input data, and many machine learning (ML) frameworks support *pandas* data structures as inputs.
Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials.



**Reminder:** every time we see a cell block we should run that cell to see what it outputs. It's good practice to also run text cells- that way you're in the habit of running everything as you work down the notebook. To run a cell:


*   Click the **Play icon** in the left gutter of the cell;
*   Type **Shift+Enter** or **Shift+Return** to run the cell and move focus to the next cell (will one if none exists)

The line `import pandas as pd` imports the pandas library and gives it the alias pd, which is a common convention in Python. The line `pd.__version__` prints the version number of the pandas library.<br>**Run the cell below!** It will print out the version number below the cell.


In [None]:
from __future__ import print_function

import pandas as pd
pd.__version__

'2.2.2'

***
## 2. Pandas Structure

The primary data structures in *pandas* are implemented as two classes:

  * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.
  * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`. Series have similar properties and look similar to *lists* (covered in *Intro to Colab* notebook).


### 2.1 Series
One way to create a `Series` is to construct a `Series` object. This is done by using the *Series* function call from the Pandas package (*pd*).

For example, run the code cell below.

In [None]:
pd.Series(['San Francisco', 'San Jose', 'Sacramento'])

Unnamed: 0,0
0,San Francisco
1,San Jose
2,Sacramento


San Francisco, San Jose, and Sacramento turn appear to be the values of a column! The above is a *Series* and *Series* are what make up each column of a *DataFrame*.

Let's move on to a slightly more complex use case. In the above example, we input `['San Francisco', 'San Jose', 'Sacramento']` in the parentheses of the `pd.Series(___)` function to create a series of city names.
1. Let's do that again below, but this time, let's save this series by giving it a name! We need to assign it to a *variable* (review *Intro to Colab* notebook if you don't remember the term variables). Let's call the variable `city_names`. (See cell below)

2. Next, let's make a series of population sizes of each of those cities. We can make this series with the same format (putting the values in a comma-separated list between brackets). We've assigned this series to the variable `population`. Run the cell below to see what the series, `population`, looks like.


In [None]:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
print(population)

0     852469
1    1015785
2     485199
dtype: int64


###2.2 DataFrames
**Now we have two series! Let's put them into a DataFrame!**

For a `DataFrame`, we need the name of the series (so your code knows what values the column should be made up of) *and* a name for the column. Similarly to how we made a series with `pd.Series`, we make DataFrames with the `pd.DataFrame` function. `DataFrame` objects can be created by putting in something called a [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) in between the parentheses. To understand what a Python dictionary is, think of a real life dictionary: a book full of words and their corresponding definitions. <br>The idea here is the same: a comma-separated list where each component of the list has a name (also called "key") and a corresponding value (which in our case will be the name of the series).
>Format: pd.DataFrame({**_<mark style="background-color: red;">"column_name1": column1_values</mark>_**, <mark style="background-color: yellow;">**"column_name2": column2_values,** ...}</mark>)



In the cell below, we have the column name "City name" with values of the series `city_names` and the column name "Population" with the values of the series `population`. (Remember we assigned both series to these variable names in the cells above).<br>
**Run this cell to see the DataFrame we made!**

In [None]:
pd.DataFrame({'City name': city_names, 'Population': population})

Unnamed: 0,City name,Population
0,San Francisco,852469
1,San Jose,1015785
2,Sacramento,485199


*Reminder:* We put the names of the columns `"City name"` and `"Population"` in quotations because they are *strings*. If we tried the code statement above without quotations, we would get a NameError because it would try to reference variables we haven't made! Note: There's no difference between single quotes ' ' and double quotes " ". They just need to match, so a string can be `"hello"` or `'hello'`, but not `"hello'`.

*The code cell below is optional/advanced. Skip to "Your turn!" if you're not interested in this section.*

You can also create `DataFrame` objects by specifying the rows. For example:

Note: we have to wrap our data in brackets `[]` to be able to pass it into the `data` argument of the `pd.DataFrame` function!

In [None]:
pd.DataFrame(data = [('San Francisco', 852469), ('San Jose', 1015785), ('Sacramento', 485199)],
             columns = ['City name', 'Population'])

Unnamed: 0,City name,Population
0,San Francisco,852469
1,San Jose,1015785
2,Sacramento,485199


***

##3. Importing to DataFrames

More often than creating our own DataFrames from scratch, we will load an entire file into a `DataFrame`.

Let's import a file and save it as a DataFrame in the cell below. **The data below is about Bay Area fine particulate matter concentrations from August 2021** (a wildfire period). The link we are pulling the data from is saved in the `pm_data` variable. We use the `pm_data` variable inside of the function **`pd.read_csv`** to be able to load the data and create a `DataFrame` that we label as the variable, `wildfire_pm`, to stand for **"wildfire particulate matter concentrations"**. Run the cell below and notice what we're doing as you'll do it solo later on!

In [None]:
# Holds the link we are pulling data from
pm_data = "http://128.32.208.8/node/31/measurements_all/csv?name=Fred%20T.%20Korematsu%20Elementary%20School&interval=60&variables=pm2_5&start=2021-08-01%2012:00:00&end=2021-08-31%2023:00:00&chart_type=measurement"

# Using the pm_data variable, we load the data into a DataFrame
wildfire_pm = pd.read_csv(pm_data, on_bad_lines='skip')
wildfire_pm

Unnamed: 0,local_timestamp,epoch,datetime,node_file_id,pm2_5,node_id
0,2021-08-01 12:00:00,1.627844e+09,2021-08-01 19:00:00,2155678,1.89882,31
1,2021-08-01 13:00:00,1.627848e+09,2021-08-01 20:00:00,2155715,1.57583,31
2,2021-08-01 14:00:00,1.627852e+09,2021-08-01 21:00:00,2155750,2.00943,31
3,2021-08-01 15:00:00,1.627855e+09,2021-08-01 22:00:00,2155827,1.52706,31
4,2021-08-01 16:00:00,1.627859e+09,2021-08-01 23:00:00,2155877,1.22749,31
...,...,...,...,...,...,...
724,2021-08-31 19:00:00,1.630462e+09,2021-09-01 02:00:00,2219863,4.32151,31
725,2021-08-31 20:00:00,1.630465e+09,2021-09-01 03:00:00,2219870,5.30118,31
726,2021-08-31 21:00:00,1.630469e+09,2021-09-01 04:00:00,2219962,4.74171,31
727,2021-08-31 22:00:00,1.630472e+09,2021-09-01 05:00:00,2220001,3.55556,31


Let's now make sure we understand our data. Use the `.columns` function to list the column names of the `wildfire_pm` DataFrame we just created.

In [None]:
wildfire_pm.columns

Index(['local_timestamp', 'epoch', 'datetime', 'node_file_id', 'pm2_5',
       'node_id'],
      dtype='object')

Below are column descriptions: *It is always important to know what our data represents!*
* local_timestamp: Pacific time at the node
* epoch: Unix epoch time
* datetime: UTC time at the node
* node_file_id: Name of the file the data is stored in for that hour
* pm2_5: PM2.5 concentration in ug/m^3
* node_id: Each node is assigned a node identification number

***
## 4. Manipulating DataFrames
We only want to look at the `local_timestamp` and the `pm2_5` values because most of the other columns are only metadata.The following cell cleans and formats the data for you. You **do not** need to undertand the code and what it does, but feel free to read through the comments below if you're interested. Simply **run the cell** below.

In [None]:
# Formats the timestamp column
wildfire_pm['timestamp']=pd.to_datetime(wildfire_pm['local_timestamp'],format='%Y-%m-%d  %H:%M:%S')
wildfire_pm.index=wildfire_pm['timestamp']

# Drop all columns except the ones we want: timestamp and co2
wildfire_pm = wildfire_pm.drop(['local_timestamp','epoch','datetime','node_id','node_file_id', 'timestamp'],axis = 1)

# Renaming the columns of the dataframe
wildfire_pm = wildfire_pm.rename(columns={'pm2_5': 'pm'})
wildfire_pm

Unnamed: 0_level_0,pm
timestamp,Unnamed: 1_level_1
2021-08-01 12:00:00,1.89882
2021-08-01 13:00:00,1.57583
2021-08-01 14:00:00,2.00943
2021-08-01 15:00:00,1.52706
2021-08-01 16:00:00,1.22749
...,...
2021-08-31 19:00:00,4.32151
2021-08-31 20:00:00,5.30118
2021-08-31 21:00:00,4.74171
2021-08-31 22:00:00,3.55556


*NOTE: If you get an error that says `KeyError: 'local_timestamp'`, that means you ran the cell above twice. The fact that you ran it twice isn't generally a problelm but since the code renames a column, Python gets confused when you're trying to rename a column that's already renamed. To fix this, run the cell that defines `wildfire_pm` above (the first code cell in section 1.3), then run the cell right above this, and you'll have a proper DataFrame again!*

Now, `wildfire_pm` looks like what got printed above!

Another useful function is **`DataFrame.head()`**, which displays the first 5 records of a `DataFrame` by default:

In [None]:
wildfire_pm.head()

Unnamed: 0_level_0,pm
timestamp,Unnamed: 1_level_1
2021-08-01 12:00:00,1.89882
2021-08-01 13:00:00,1.57583
2021-08-01 14:00:00,2.00943
2021-08-01 15:00:00,1.52706
2021-08-01 16:00:00,1.22749


For the `.head(n)` function, the default value (if no n is given) is 5 rows. You can input any number `n` to display the first `n` rows. Try it out for yourself!

In [None]:
wildfire_pm.head(...) # Replace ... with any number

Unnamed: 0_level_0,pm
timestamp,Unnamed: 1_level_1
2021-08-01 12:00:00,1.89882
2021-08-01 13:00:00,1.57583
2021-08-01 14:00:00,2.00943
2021-08-01 15:00:00,1.52706


The opposite function of `.head` is `DataFrame.tail()`, which displays the **last** 5 records of a `DataFrame` by default:

In [None]:
wildfire_pm.tail()

Unnamed: 0_level_0,pm
timestamp,Unnamed: 1_level_1
2021-08-31 19:00:00,4.32151
2021-08-31 20:00:00,5.30118
2021-08-31 21:00:00,4.74171
2021-08-31 22:00:00,3.55556
2021-08-31 23:00:00,4.61466


Like above, replace the `...` with any number to display that number of rows. `wildfire_pm.tail(1)` would show the very last row of the `wildfire_pm` DataFrame.

In [None]:
wildfire_pm.tail(3)

Unnamed: 0_level_0,pm
timestamp,Unnamed: 1_level_1
2021-08-31 21:00:00,4.74171
2021-08-31 22:00:00,3.55556
2021-08-31 23:00:00,4.61466


***
#### You've finished the **Introduction to Pandas *In Class* notebook** and are ready to begin the **Introduction to Pandas *Student Exploration* notebook**! Good job!

***
***