## Organizing Data 
Performing initial exploratory analysis of big data sets is becoming a critical skill for civil and environmental engineers.  Data sets are getting bigger over time and that makes it more difficult to determine how the data impacts our designs and decision making as an engineer.  If we can quickly explore the data to determine if it is useful to us, this will help us quickly and effectively tell our clients what they want to know.  This lab is dedicated to teaching you how to organize data, including filtering, grouping, aggregating and categorizing.  Let's get started!

#### Start every notebook the same:
1. Check your working directory has the data you need
2. Check you have all the relevant packages loaded
3. Load any relevant data sets

In [5]:
### Install packages
import pandas as pd

### Load data
air_data = pd.read_csv('AirQuality_Daily_StudentVersion.csv')

We may also want to make sure that the new data set we imported is a dataframe. Dataframes are a type of data structure in Python where each column is interpreted as a different variable that can then be manipulated. If we want to coerce a data set into a data frame, we can use the following syntax. Note that this syntax overwrites the `air_data` we created above! Also, the `pd.` before the function tells Python that this function comes from the pandas package.

In [6]:
air_data = pd.DataFrame(air_data)

Next, we can use the print function to examine what types of data are present in our CSV. This gives us a chance to examine what types of data are present and what variables we will be able to manipulate.

In [1]:
#print(air_data)

We have many different way of checking our data is properly loaded.  It's always good practice to try at least one method before starting out analysis.  Here are the methods we can use:
1. Use `print` to show the entire data set
2. Append `.columns` to the end of a data set to look at the column names
3. Append `.head()` to view the top rows of the dataset (default is 5)
4. Append `.tail()` to view the bottom rows of the dataset (default is 5)

## Summarizing Data

Next, let's look at how to group data. This is an important skill when you want to examine data in a final table that summarizes a statistic or characterisitic of your data.

1.) Append `.groupby()` to the end of a variable and then use square brackets [] and list the columns you want to group by

2.) Then, add `.size()` to the end of the statement to show the number of observations in each group

3.) Use `print` to show this new dataframe

In [1]:
### Use this space for grouping data





Now, we may want to use the data we just generated in future analyses. Therefore, we need to save that data to a variable. We will add `.reset_index(name = '')` to the end of the previous code and put the name of the new column within the quote marks. We will assign this new dataframe to a dataframe called `summary` and then print the results.

This code will give us the number of rows associated with the `sensor.name` in the Air Purple data set

In [2]:
## Add the index here





What about grouping by multiple groups?  That is 100% doable if each column name is correctly named and called within [ ]  to index each column name.  Let's try this below.

In [6]:
## Group with multiple names



### Lab Activity #1: Grouping
In the code block below, show you can group your original air_data by `monitor_index` and call this new dataframe `monitor_grouped` and how many observations there are in each group.

In [7]:
### Enter your response to Lab Activity #1 in this cell



The code above only summarizes the total number of samples.  What if we want to know some more in depth statistics about this data?  

The function `.agg()` is a pandas function that aggregates data.  Note that because it is a pandas function, `.agg()` will only work if you have imported the correct packages!  

`.agg()` can take arguments such as max, min, mean and median.  In order for this function to work, you need to tell `.agg()` which summary statistic you want (for example `max()`) and then tell the statistic what to take the statistic of and what you want that new data column to be called.  For example, if we want to take the max() of the column "pm2.5_atm" and name this new data column "maximum", our code would look like this:
`.agg(max('violations', 'maximum')`

Try this out using your summary data, group by contaminant and assign this new summary statistic data to a data frame called `stat_summary`and print this new dataframe to the console.

In [8]:
### Create your stat_summary here





What if we wanted to group by a different column? How would the code look different if you wanted to look at the altitude versus the sensor name?

In [9]:
## Try grouping by a different column here




### Lab Activity 2: Aggregating
In the code block below, show you can create a statistical summary (including the mean, max, min, and median) of the concentration of VOCs, grouped by `sensor.name`.

In [10]:
####Enter your response to Lab Activity #2 in this cell







### Filtering
Next, let's examine how to filter data so that we only keep the observations relevant to our intended analysis.

First, we can filter our data by using indices and AND (&) or OR   (|) operators. Let's try filtering for only the `sensor.name` "Swnphd-mccook". The variable we want to index to filter is the 'sensor.name' column name and the condition we want is only the columns with the string 'Swnphd-mccook'. Therefore, we can write:

In [11]:
## Indexing filter method





Notice that the variable name (sensor.name) and the string ('Swnphd-mccook') must both be in quotes.  We also use a double equals sign (==) when we are asking Python to find a specific case within a column.  A single equal sign (=) is used to assign new variables.

Also notice that we can print only the `sensor.name` column by appending it to the dataframe as a way to check that we have properly filtered our data.

There are other methods that can be used to filter in Python.  For example, the function `df.query()` can be used to examine the McCook sensors:

In [12]:
## Query filter method






Now, what if we want to apply more than one filter?

What if we want only the McCook location AND systems with an altitude greater than 2000 ft? We can use the & sign to represent AND in Python. this means that if we want 'sensor.name' == "Swnphd-mccook" AND also want 'sensor.altitude' => 2000, then we would need to write the following:

In [13]:
## Multiple filters here







### Lab Activity #3: Data Types
Take note!  Python did a good job here and read in the `sensor.altitude` as a numerical value so we could perform the filter.  What would we need to do if Python was not recognizing the `sensor.altitude` as a number?  Enter your response in the cell below:

Student response to Lab Activity #3:




## Putting Groups, Filters and Aggregate together
Let's combine what we've just learned.  

Our client wants all the results grouped by `sensor.name`.  They also want summary statistics for `pm10.0_atm`, `pm2.5_atm` and `voc`, including `max`, `min`, `mean` and `median`.  For practice, let's also say that we only want temperatures above 50 F.  How would we put all of these elements together?


In [54]:
### Put it all together!





### Lab Activity #4: Put Groups, Filters and Aggregate together
Try it yourself for this scenario:
1. We only want humidity > 20
2. We only want alitude > 3000 ft
3. We want a summary of max, min, mean, median for `pm10.0_atm`
4. We want this data grouped by `monitor_index`

In [14]:
### Student Response to Lab Activity #4






## Formatting as a Table

Lastly, let's learn how to create formatted tables that are ordered (ascending or descending) that can be copied and pasted into our reports to our clients.  This is a quick way to combine many of the functions we just learned in a "Pivot Table" (similar to Excel).

In [15]:
##Create a pivot table!





### Lab Activity 5: Create a Summary Table
Create a summary Pivot Table that shows data grouped by `sensor.name`, and presents the max and min `pm10.0_atm` value for each location.  Enter your response in the space below.

In [17]:
### Student Response to Lab Activity #5

