`pandas` is an alternative to Excel for managing tabular data. An excellent introduction is [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html).

The two ways you might turn data from something that is not a dataframe into something that is a dataframe is probably either (1) turning a dictionary into a dataframe or (2) loading a delimited text file into pandas

# Data basics

<b>Computation learning objectives:</b>
- Recognize data formats that are appropriate for use in a `pandas ` dataframe (e.g. dictionaries, .csv files)
- Understand to how query rows and columns of DataFrames

<b>Geoscience learning objectives:</b>
- Understand how environmental data is stored and manipulated in tabular formats
- Perform a query on a dataset to examine basic trends in the data

<b>Previous skills leveraged:</b>
- Building and extracting data from dictionaries
- Operating on data with `numpy` or built-in operators
- Performing simple statistical analyses, like finding minimum and maximum values in a dataset

<b>Real-world context:</b>
- Many environmental and geoscientific data are collected in tabular format (rows and columns), which each row representing an observation either in time or space.
- That data will likley need to be "cleaned" (e.g. missing data, in the wrong units), and if the dataset is large, it may be challenging to perform those tasks in Excel
- Getting data into `pandas` DataFrames will help you quickly visualize your data (in the next module!)

<b>Tips for success:</b>
- placeholder
- placeholder
- placeholder

## Previous module review: Plotting

Write in your own words (or discuss with a partner) the elements of a "publication-worthy" plot, and list a few `matplotlib` functions that you might call to make those plots.

<i>Your text here</i>

# Dataframes from dictionaries

## But what's a dictionary again?

Remember from two weeks ago that [dictionaries](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) (`dict`) are used for holding metadata attributes (e.g., instrument specifications, geographic coordinates) associated with environmental measurements. They also are used for organizing and accessing information, such as referencing a list of properties of data point by its name.

Let us return to that fake weather data:

In [None]:
day = [1, 2, 3, 4, 5]
temp = [60, 65, 68, 61, 58]
humidity = [20, 25, 32, 28, 25]

If we wanted to store these lists of numbers in a dictionary, you would (1) create a dictionary, (2) create keys for the dictionary (e.g. `'day'`), and (3) assign values to those keys

In [None]:
site_data = {}

site_data['day'] = day
site_data['temp'] = temp
site_data['humidity'] = humidity

site_data


The most important thing you need to know right now about dictionaries is that every key has to be unique, and there is nothing ordered to a dictionary (e.g. unlike being able to access the 2nd or 3rd element in a list or array, you cannot access the 2nd or 3rd key in a dictionary - you have to give it a key).

## Example dictionary to pandas DataFrame

In [None]:
import pandas as pd

In [None]:
pd.DataFrame.from_dict(site_data, orient='columns')

`pandas` knows to take your data and make the dictionary keys the names of columns in a DataFrame. Great!

So when are you likely to encounter or use dictionaries?
- You're manually entering data.
- You're generating data in your code (e.g., from a model).
- You want to quickly test or demonstrate something.

# Pandas from delimited data

CSV (Comma-Separated Values) files are a common way to store and share data. You might get them from a weather station or instrument, a website, another piece of software, or someone has recorded data into a csv (or in Excel, which can save files as `.csv`s).

In [None]:
# Load temperature data from a CSV file

# I'm using the two dots before the module_02 folder to say "Hey, go up a directory to find 0909"
weather_data = pd.read_csv('../module_02/williamsburg_meteo.csv')

# Peek at the data and particularly the column names
weather_data.head()

We can create new columns and fill them with a single value or perform an opreation on a column:

In [None]:
weather_data['QC'] = 'good' # a pretend "quality control column"

weather_data['datetime'] = pd.to_datetime(weather_data['DATE']) # the pd.to_datetime() just reads the dates as a specific type of data that plots well for time series

weather_data['PRCP_cm'] = weather_data['PRCP'] * 2.54 # convert inches to centimeters

weather_data.head()

## Mini-assignment 1

Precip and temperature data were originally given in imperial units. Create new columns where temperature values are given in the metric system (Celcius).

In [None]:
# your code here

## Mini-assignment 2

Print the **date** of the highest-recorded daily rainfall in the record (consult [the docs](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats) or Google. Note you will have to look at the value of one column to get the value in another column).

In [None]:
# your code here

# For the capstone

Many types of geoscientific data can be formatted as a table - any time series (like discharge or temperature) or field and/or laboratory data (like concentrations of certain elements in a core or soil sample) can find its way into a `pandas` DataFrame. In fact, `pandas` has many excellent built-in functions for [time series analysis](https://pandas.pydata.org/docs/user_guide/timeseries.html). We will also see later on that geospatial data can be stored in a `pandas` like DataFrame using the [`geopandas`](https://geopandas.org/en/stable/index.html) package. It is therefore very likely that your capstone project will involve some sort of tabular dataset, either one that you create or find in the wild.   