## Exploring data with Pandas - Part 1
Pandas is an exceptional tool for working with tabular data. Here we review the basics of what a dataframe is, how we get data into a dataframe, and how we can explore those data when in a dataframe format.
### Topics
1. [Basic form of a dataframe](#1.-The-DataFrame)
* [Loading data into a dataframe](#2.-Loading-data-into-a-dataframe)
* [Viewing and inspecting data properties](#3.-Viewing-and-inspecting-data-properties)
* [Selecting columns](#4.-Selecting-columns)
* [Descriptive statistics](#5.-Descriptive-Statistics)
* [Some basic plots](#6.-Some-Basic-Plots)

In [None]:
#Import the pandas package, conventially imported as "pd"
import pandas as pd

## 1. The DataFrame
We'll begin by exploring the key elements of the DataFrame object. Some notions are self evident, i.e., data are stored in rows and columns, much like a spreadsheet. Others are more nuanced: implicit and explicit indices, tables vs. views, and some others.

Let's begin examining the components of DataFrames by examining two ways they can be created.  

#### DataFrame as a list of lists
First, a DataFrame can be considered as a list of lists. Below we see an example where we have 4 sub-lists, each containing 3 items (e.g. the first list [`'Joe'`,`22`,& `True`]). Each of these 4 sub-lists comprises a _row_ in the resulting DataFrame, and each item in a given list becomes a _column_. 

In [None]:
#Creating a simple data frame as a list of lists
df = pd.DataFrame([['Joe',22,True],
                   ['Bob',25,False],
                   ['Sue',28,False],
                   ['Ken',24,True]],
                  index = [10,20,40,30],
                  columns = ['Name','Age','IsStudent']
                 )
#Reveal the type of object created
type(df)

In [None]:
#Display the resulting data frame
df

A few key points here: 
* First is that each of the sub-lists has the same number of elements (3) and the same data types as the other sub-lists. Otherwise we'd end up with missing data or "coerced" data types.
* Second is that we also explicitly specify and **index** for the rows (`index = [1,2,3,4]`). The index allows us to identify a specific row.
* Likewise, we explicitly set column names with `columns = ['Name','Age','IsStudent']`, and yes, these allow us to indentify specific columns in our DataFrame.

#### Data frame as a collection of dictionaries
Another way to build (and think of) a DataFrame as a set of dictionaries where each dictionary is a column of data, with the dictionary's key being the column name and it's value being a list of values:

In [None]:
#Creating a data frame as dictionaries of lists
df = pd.DataFrame({"Name":['Joe','Bob','Sue','Ken'],
                   "Age":[22,25,28,24],
                   "IsStudent":[True,False,False,True]},
                  index = [10,20,40,30]
                 )
df

### Meh, so what...
What does this reveal? List of lists vs set of dictionaries? Well, it explains how you can extract elements from the DataFrame. **Thinking of a DataFrame as a list of lists**, getting the value of the 2nd column, 3rd row is equivalent of getting data from the 2nd item in the 3rd list.  

We can get that value using the DataFrame's `iloc` function (short for intrinsic location), passing the row and column of the location we want.

In [None]:
#Get the 2nd item from the 3rd row; recalling Python is zero-based
df.iloc[2,1]

And if we hop over to **thinking a DataFrame as a set of dictionaries**, we can target a specific value by specifying the index of the value (row) from the dictionary column) we want. The row however, is referred to by the index *we* assigned, not it's implicit index generated by the order in which it was entered. 

In [None]:
#Get the value in the 'Name' column corresponding to the row with an index of '20'
df['Name'][20]

We can also to quick math on our data. If we wanted to calculate the age of our students in days:

In [None]:
df['Age_days'] = df['Age'] * 365
df

_We'll return to how we extract data from a DataFrame, but for now just soak in the fact that values in a DataFrame can be referenced by their implicit location (i.e. their row, column coordinates) and by their explicit column name and row index._

---
## 2. Loading data into a dataframe
More than likely we'll be reading in data vs entering it manually, so let's review how files are read into a Pandas Dataframe. Pandas can read many other formats: Excel files, HTML tables, JSON, etc. But let's concentrate on the simplest one - the csv file - and discuss the key parameters involved. 

In the Data folder within our workspace is a file named `surveys.csv` which holds the data we'll use. If you're curious, this dataset is part of the Portal Teaching data, a subset of the data from Ernst et al [Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA](http://www.esapubs.org/archive/ecol/E090/118/default.htm).

The dataset is stored as a `.csv` file: each row holds information for a single animal, and the columns represent:

| Column | Description |
| :--- | :--- |
|record_id |	Unique id for the observation |
|month| 	month of observation |
|day |	day of observation |
|year |	year of observation |
|plot_id |	ID of a particular plot |
|species_id |	2-letter code |
|sex |	sex of animal (“M”, “F”) |
|hindfoot_length |	length of the hindfoot in mm |
|weight |	weight of the animal in grams |

Below, we read in this file, saving the contents to the variabel `surveys_df`

In [None]:
#Read in the surveys.csv file
surveys_df = pd.read_csv('../data/surveys.csv')

In [None]:
#View the dataframe just by typing the variable name
surveys_df

Pandas can read data stored in many different formats. CSV is one of the more common ones, but it can also handle txt, JSON, HTML, MS Excel, HDF5, Stata, SAS, and certain SQL formats. Each has it's own `read_()` function and is fully documented. 

* Run the following to get documentation for the `read_csv()` function:

In [None]:
#Display help on the read_csv() function
pd.read_csv?

Full documentation can be found simply via a web search on "[pandas read_csv](https://www.google.com/search?q=read_csv)".

>Note all the options available to read in a simple CSV file. Think back to some of the other files we've read into Python using its file object. How might the following options have helped? 
* `delimeter` or `sep` to read in tab delimeted files...
* `comment` to skip metadata rows...
* `skiprows` to skip metadata rows...

I've never used many of these modifiers, but it's important to know they exist and how they are implemented. I often use a bit of trial and error when first applying them. 

#### Using the `dtype` modifier 
One important modifier that is easy to overlook is the `dtype` one. This modifier allows us to override the default data type that Pandas assignes to a column when it's imported into a dataframe. 

Note that in our `surveys_df` dataframe, we have two numeric columns that contain _nominal_ data: `record_id` and `plot_id`. It's possible these numeric labels may have leading zeros. (Think ZIP codes or HUC codes...). As such, we'd want to be sure to import these values as _strings_, not _integers_ as Pandas would do by default. 

We do this using the `dtype` modifier, passing a dictionary of column name:format for each variable we want to ensure is imported under our control:

In [None]:
#Read in the surveys.csv file
surveys_df = pd.read_csv(
    '../data/surveys.csv',
    dtype={'record_id':'str','plot_id':'str'})
surveys_df

In [None]:
surveys_df.dtypes

---
## 3. Viewing and inspecting data properties
We've already seen that typing in the dataframe's variable name will display a nicely formatted snapshot of the data, truncated if the dataframe is too big. 

### Viewing data with `head()`, `tail()`, and `sample()`
Some handy commands to show snippets of our dataframe are `head()`, `tail()`, and `sample()`. We can pass in a number as the argument to each of these to display a set number of records from our dataset. (`head()` and `tail()` default to 5.)

In [None]:
#Use the head() command to view the first 5 rows of the dataframe
surveys_df.head()

► What do you think the sample function does? What about sample? How might you find out?

In [None]:
#Try out the tail command


In [None]:
#Try out the sample command (you need to specifiy number of records)


### Inpsecting our dataframe's properties
Here, we'll create a second dataframe by reading in a dataset stored online. And then we'll apply the following commands to explore properties of this dataset:
* Revealing size attributes with `len()`, `shape`, and `size`
* Revealing columns included with `columns`
* Revealing the index with `index`
* Revealing the dataframe's data types with `dtypes` and `info()`

In [None]:
#Read in HUC12 land cover data from EPA's EnviroAtlas dataset
data_url = 'https://github.com/ENV859/EnviroAtlasData/blob/main/LandCover.csv?raw=true'
land_df = pd.read_csv(data_url)
land_df.head()

In [None]:
#Pass our dataframe into the len() function
len(land_df)

In [None]:
#Reveal the shape of the dataframe
land_df.shape

In [None]:
#Reveal the size of the dataframe
land_df.size

► Can you deduce what properties the `len()`, `shape`, and `size` reveal? 
* How many rows does the dataframe have?
* How many columns?
* How many total values are in this table? 

In [None]:
#Show a list of columns in the dataframe
land_df.columns.values

In [None]:
#Show the dataframe's index
land_df.index

In [None]:
#Show the data types of each column 
land_df.dtypes

In [None]:
#Show more information on the dataframe and on each column
land_df.info()

---
## Check your understanding:
1. Using the commands describe above, answer the following questions regarding the **surveys_df** dataframe.
 1. How many records (rows) are in the dataset?
 * How many columns are in the dataset?
 * What are the column names?
 * What data type do the columns use?
 * How many total values are stored in this dataframe?
 * What are the indices used in this dataframe?
 * How many non-null values are found in the 'WEIGHT column  

In [None]:
# How many records (rows) are in the dataset?


In [None]:
# How many columns are in the dataset?


In [None]:
# What are the column names?


In [None]:
# What data type do the columns use?


In [None]:
# How many total values are stored in this dataframe?


In [None]:
# What are the indices used in this dataframe?


In [None]:
# How many non-null values are found in the 'WEIGHT column


2. Re-read EnviroAtlas data into the **land_df** dataframe, but this time ensure the `HUC_12` column is read as a **string**, not as an **integer (int64)**.
 * Ensure that the data type of the column was read in as an "object" (i.e. a string) not "int64".

In [None]:
# Re-read EnviroAtlas data into the land_df dataframe, ensuring HUC_12 is a string
data_url = 'https://github.com/ENV859/EnviroAtlasData/blob/main/LandCover.csv?raw=true'
land_df = pd.read_csv(data_url) #<-- Modify this line

In [None]:
# Ensure that the data type of the column was read in as an "object" (i.e. a string) not "int64"


---
## 4. Selecting columns
We can select specific columns based on the column names. The basic syntax is dataframe[column name], where value can be a single column name, or a _list_ of column names. Let’s start by selecting two columns from our `surveys_df` dataframe, `species_id` and `hindfoot_length`:

In [None]:
#Subset just the two columns from the full dataframe
selection = surveys_df[['species_id','hindfoot_length']]
selection.head()

In [None]:
#Reveal what is returned - it's a dataframe
type(selection)

In [None]:
#Reveal the shape of the new dataframe
selection.shape

**Note**: if we select just one column, the obejct returns is a **series**, not a dataframe. 
>A **series** is simply one column of data. However, it has a different set of properties and methods. 

In [None]:
one_col = surveys_df['hindfoot_length']
one_col.head()

In [None]:
type(one_col)

>**NOTE**: You can also retreive a column using a different syntax:
```python
surveys_df.hindfoot_length
```
This syntax works only if the column name is a valid name for a Python variable (e.g. the column name should not contain whitespace). The syntax data["column"] works for all kinds of column names, so we recommend using this approach. Also, things may get ugly if you have a column name that conflicts with a property of your dataframe. For example, what if your column name was "`shape`"??

## 5. Descriptive Statistics
Pandas DataFrames and Series contain useful methods for getting summary statistics. Available methods include `mean()`, `median()`, `min()`, `max()`, and `std()` (the standard deviation).

We could, for example, check the mean hindfoot length in our input data. We check the mean for a single column (Series):

In [None]:
#Compute the mean hindfoot length
surveys_df['hindfoot_length'].mean()

Try computing some other summary stats for either of our two numeric columns (`hindfoot_lenght` or `weight`) using the statements mentioned above. 

In [None]:
# Compute the count of weight records


In [None]:
# Compute the standard deviation of weights


In [None]:
# Compute the median weight


In [None]:
# What were the first and last years of the survey
first_year =
last_year =
print(f"The study spanned fro {first_year} to {last_year}")

A few more complex ones are **quantiles** and **correlations**:

→ First, percentiles using the `quantiles()` function, where we pass in the percent to compute:

In [None]:
#Compute a the 25th percentile of weight
surveys_df['weight'].quantile(0.25)

In [None]:
#Change the above to compute the 50th percentile; does it equal the median?
surveys_df['weight'].median() == surveys_df['weight'].quantile(0.5)

→ Now to compute **variable pairwise coorelations**. Here we'll use the HUC12 land cover data and display pairwise correlation values among its variables (using the default Pearson method as a table

In [None]:
#Create a pairwise correlation table among variables in the land_df dataframe
land_df.corr()

**TIP**: Using visualization tricks that we'll touch on later, we can style our table, making it much more informative...

In [None]:
#Save the correlation table to a variable
coorelation_table = land_df.corr()
#Show the table, with styling
coorelation_table.style.background_gradient(cmap = 'YlGnBu')

Pandas also as a `describe()` function that quickly generates descriptive stats for all numeric variables in your dataset.

In [None]:
#Generate summary stats on all numeric columns in the data
land_df.describe()

#### Categorical data
Descriptive statistics are useful for numeric data. For categorical data, we have other means for summarizing data:
* `nunique()`: lists the number of unique values in each column of a dataframe (or in a series)
* `unique()` : lists the unique values occuring in the supplied series
* `value_counts()` : lists the number of records associated with each unique value

_→In exploring these, consider the type of object they return and what you can do with that object..._

In [None]:
#Reveal how many unique values in each field in the surveys_df dataframe
surveys_df.nunique()

In [None]:
#Reveal how many unique values occur just in the 'species_id' series
surveys_df['species_id'].nunique()

In [None]:
#List the unique values in the species_id field
surveys_df['species_id'].unique()

In [None]:
#List how many records are associated with each unique month record
surveys_df['month'].value_counts()

---
### <font color='red'>*Challenge* - Counts and Lists from Data </font>


1. Create a list of unique **plot ID**’s found in the surveys data (much like we did above with the species id values). Call it `plot_names`. How many unique plots are there in the data? How many unique species are in the data?

1. What is the difference between `len(plot_names)` and `surveys_df['plot_id'].nunique()`?


In [None]:
# Challenge 1


In [None]:
# Challenge 2


## 6. Some Basic Plots
We'll revist visualizations, but as a teaser here are some quick and easy plots of our data. 

* First a histogram of all the `PFOR` (percent forest) values from the EnviroAtlas data. 

In [None]:
# Create a histogram of values in the PFOR column, in 10 bins
land_df['PFOR'].hist(bins=10);

* And now, we'll generate histograms for each month of the `weight` values in the `surveys_df` dataframe. We'll also increase the size of the figure to 20 x 20 units.

In [None]:
#Plot distributions of weights by month
surveys_df.hist(column='weight',by='month',figsize=(20,20));

In [None]:
#And finally box plots by month
surveys_df.boxplot(column='weight',by='species_id',figsize=(20,10));

## Summary
While this is a very quick introduction to the Pandas dataframe and series objects, I'm hopeful that you now have an appreciation for the utility of structuring data into a dataframe has -- and now have some command on how to generate and explore dataframes using the Pandas package. 

Actions you should now be capable of include:
* Creating a dataframe from a lists of lists (of equal sizes)
* Creating a dataframe from a set of dictionairies (of equal lengths)
* Reading data from a CSV file into a Pandas dataframe
* Viewing and inspecting properties of the data in your dataframe
* Selecting specific columns in your dataframe to a new object
* Computing descriptive statistics from your data
* Generating a few basic plots

### Next up: 
Next we will explore how we can effectively process data stored in dataframes. 