# Pandas and data manipulation

You've learned how to work with data stored in CSV and JSON files using "plain" Python, but there are many libraries created to make this easier.  One of the most commonly used ones is [Pandas](https://pandas.pydata.org).  Pandas has some excellent [tutorial documentation](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html), which is useful for understanding Pandas beyond the basics we're doing here.

To get started with Pandas, just import it like any other library.  We'll use `pd` as an abbreviation, the way we use `np` for `numpy`.

In [None]:
import pandas as pd

Now let's load in a familiar data file:

In [None]:
data = pd.read_csv("../project/bls_fatalities_summary.csv")

Ok, that was easy.  But what is in `data`, and how do we access it?

* What is the type of `data`?
* What does `data.columns` return?
* What does `data["year"]` return?
* What about `data["year"] == 2015`?
* What about `data[data["year"] == 2015`]?

*Challenge*: Read this [tutorial page](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html) and try some other indexing methods.

In [None]:
# Your code here...

We can also create new data columns using existing columns and math operations:

In [None]:
data['total'] = data['year'] + 5

data # This will cause Jupyter/pandas to print a nice table

Change the code above to calculate the total number of workplace fatalities.

What happens when you do math with the `NaN` values?

*Challenge*:
* You can get an explicit `NaN` (aka `nan`) value in Python using `math.nan`.  What is the result of `math.nan == math.nan`?  Why?  (And what about `math.nan is math.nan`?)
* Under what conditions do you get a `NaN`?  Is it the same as infinity?  [Wikipedia](https://en.wikipedia.org/wiki/NaN) is a good starting place for understanding this.


## Data cleaning

We need some way to filter out out the `NaN` values.  We can select the rows where a particular column is zero using the `isna()` function:

In [None]:
data['Multiple races (non-Hispanic)'].isna()

* Write code to set the `NaN` values to 0.

## Plotting

Now that we can read a CSV file with pandas and extract particular columns of interest, it should also be easy to plot data.  Make a plot showing the percentage of "Hispanic or Latino" deaths as a fraction of the total number of workplace deaths.

Note that pandas has some built-in plotting functions, but these are just wrappers around matplotlib.  For now, I strongly encourage you to keep using the matplotlib functions you know and (hopefully) love.

## Solar data

In the next project, we'll be looking at the state of rooftop solar installation in the United States.  Let's start just by exploring the data, which comes from Google's [Project Sunroof](https://sunroof.withgoogle.com/data-explorer/).

* Try looking at the solar opportunity and installation trends for some different places.  What do you observe?
* Is "going solar" a good option for these places?  Why or why not?

The data is available as a CSV file, which you can download from Google: https://storage.googleapis.com/project-sunroof/csv/latest/project-sunroof-census_tract.csv

You'll need to upload this to JupyterLab, ideally in the `inclass` folder.


In [None]:
solar = pd.read_csv("project-sunroof-census_tract.csv")

Try doing the following things with the data:
* Select all of the rows corresponding to a particular state.
* What tract has the most solar generation potential?
* How many tracts have less than 80% coverage by Project Sunroof?