# P1 - Experience with Pandas



In this project, you will be working with the Global Land Temperature Data set. 

    "Berkeley Earth provides high-resolution land and ocean time series data and gridded temperature data. Our
    peer-reviewed methodology incorporates more temperature observations than other available products, and often
    has better coverage. Global datasets begin in 1850, with some land-only areas reported back to 1750. The 
    newest generation of our products are augmented by machine learning techniques to improve the spatial 
    resolution. This allows Berkeley Earth to provide the most comprehensive, high-resolution instrumental
    temperature data product available."
    -- https://berkeleyearth.org/data/

The data we will be working with is available online at: 
https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Complete_TAVG_daily.txt

Be sure to investigate and understand the data, format, and descriptions provided.



### Autograder Setup

You will have access to a few tests for the project.  Note, when you submit the autograder will run additional "hidden" tests on your solutions. 

Always make sure that you are answering the question as asked.  Do not rely on passing the public tests to ensure that you have correctly or completely answered the problems. 
 

### Project Setup 

You should use the following libraries to complete this assignment:

In [None]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

import os
if os.environ["HOME"]=='/home/jovyan':
    !pip install --upgrade otter-grader
    
import os
import otter
grader = otter.Notebook()

## 1. (20 pts) Get the Data

Read in the data from the link provided.  



### 1a. Load the data

Write a function to read in the data to a DataFrame object.  
In the function, you will print the number of rows, columns, and data types of each column using a print statement.

The function includes a few arguments that may be helpful for using the [`read_csv` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) 

Make sure to use the same column names as given in the data, but replacing '*space* ' with '_'.
    
*Hint: Column names should be coded as `Date_Number`, `Year`, `Month`, `Day`, `Day_of_Year`, `Anomaly`.*

In [None]:
def p1_q1a(url, column_names=None, separator=',', paramA=None):
    '''
    - Description - 
    Read in data from URL to a DataFrame object 
    Report the number of rows and columns with a print statement.
    Addtionally, print the data types of each column
    Use pandas.read_csv(...):
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

    - Inputs - 
    url: location of dataset file, filepath_or_buffer
    column_names: list of column names to add to DataFrame (default is None)
    separator: delimiter to use (default is a comma)
    paramA: additional parameter that may be passed to read_csv (default is None)

    - Outputs - 
    df: return object is a DataFrame

    - Print Statements you should include - 
    print("Number of Rows: ...")
    print("Number of Columns: ...")
    print("Column Data Types:")
    print(Data Types)
    '''

    df = pd.read_csv(...)
    print("Number of Rows: " + ...)
    print("Number of Columns: " + ...)
    print("Column Data Types:")
    ...
    return df


url = 'https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Complete_TAVG_daily.txt'
cnames = ...
separator = ...   
argA = ...

climate = p1_q1a(url, cnames, separator, argA)
climate

In [None]:
grader.check("p1_q1a")

<!-- BEGIN QUESTION -->

### 1b. Understand the Data

What are the column names and what do they correspond to? *Use this Markdown cell to describe each succinctly (< 15 words per column)*

Also, provide example code on how to detect if there are missing data in each column. This should only be a single line of code. *You can add a code cell to run the code and determine if there is missing entries, but then delete the cell you added.*

State whether there is any missing entries in the DataFrame. 

* `Date_Number` is 
* `Year` is 
* `Month` is 
* `Day` is 
* `Day_of_Year` is 
* `Anomaly` is

Code for missing entries for each column 
```
# Add example code

```

Missing data is ...

<!-- END QUESTION -->

## 2. (8 pts) Manipulate Data

### 2a.  Add Temperature Column 

Add a new column to your DataFrame. 

The new column `Temp_C` adds the anomaly information to the estimated average temperature given in the data description (use the value and ignore the +/- part).  

In [None]:
# Create new column "Temp_C" 
...

climate['Temp_C'].mean()

In [None]:
grader.check("p1_q2a")

### 2b. Add Temperature in Fahrenheit 

While much of the world using temperature in Celsius, the US uses Fahrenheit.  Add a column to the data `Temp_F` that reports the temperature of each value in Fahrenheit. 

$$ TempF = TempC * \frac{9}{5} + 32 $$

In [None]:
climate['Temp_F'] = ...
climate['Temp_F'].mean()

In [None]:
grader.check("p1_q2b")

## 3. (37 pts) Calculate Statistics and Create Visualizations 
Pandas DataFrames have several methods for manipulation, aggregation, and calculation of meaningful statistics. The [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) is very helpful to understand attributes and methods. Matplotlib [pyplot](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.html) integrates well with both pandas DataFrames and numpy arrays for visualization tasks.

### 3a. Calculate the mean temp by year.

Create an `Index` of `years` and an 1-dimensional array of corresponding `mean_temp_year` holding the mean temperature for each year.  

Use `Temp_F` or temperature in Farhenheit data.  

*Hint: Use methods like `.groupby()` and `.mean()` to get the proper data, then use the `.keys()` method to get an index, and `.values` for an array.*



In [None]:
years = ...
mean_temp_year = ...
print(years[0:5])
print(mean_temp_year[0:5])

In [None]:
grader.check("p1_q3a")

<!-- BEGIN QUESTION -->

### 3b. Plot `mean_temp_year` vs. `years`

Plot the mean temperature per year vs. years.  

Select an appropriate plot type.  

Be sure to include a plot title, x-axis label, and y-axis label. 

In [None]:
# Plot for each year (x-axis) vs. the "mean_temp_year" value (y-axis)



<!-- END QUESTION -->

### 3c. Calculate the mean anomaly by month
Create an `Index` of `months` and an array of `mean_anomaly_month` with the mean anomaly for each month. 

In [None]:
months = ...
mean_anomaly_month = ...

print('Months index:\n', months[0:5])
print('Means array:\n', mean_anomaly_month[0:5])

In [None]:
grader.check("p1_q3c")

<!-- BEGIN QUESTION -->

### 3d. Plot `mean_anomaly_month` vs. `months`

Select an appropriate type of plot. 

Be sure to include a plot title, x-axis label, y-axis label, and black horizontal line at 0 mean anomaly on the plot. 

Label the months with 3 letter abbreviations: Jan, Feb, Mar, Apr, ...

In [None]:
# Plot the mean_anomaly_month vs month 
#  Make sure there is a horizontal line for 0 mean anomaly on the plot
#  Label months with 3 letter abbreviations: Jan, Feb, Mar, Apr, ...


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### 3e. Plot distribution of the Anomaly data.

Create an overlapping density plot of the Anomaly data for three time periods: 

* before the average time period, labeled 'Before Ave.'
* during the average time period, labeled 'Ave. Period'
* after the average time period, labeled 'After Ave.'

Ensure that the density plots are normalized independently. 

Be sure to include a plot title, x-axis label, and y-axis label.

**HINT** You may want to use [seaborn's `kdeplot`](https://seaborn.pydata.org/generated/seaborn.kdeplot.html).  You can also add a new column to the data specifying the time periods. 

In [None]:
# Plot a density plot of the anomaly data 



<!-- END QUESTION -->

## 4. (25 pts) Get Additional Data and Explore

Let's also add additional information on the daily land-average max and min temperatures. 



### 4a. Load the additional data 

First, load in the data from the provide url's.  Use `pandas` `read_csv` function specifying the proper parameters. 

Use the same column names as in Question 1A. 

In [None]:
url_max = 'https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Complete_TMAX_daily.txt'
url_min = 'https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Complete_TMIN_daily.txt'

climate_max = ...
climate_min = ...

print(climate_max.head())
print(climate_min.head())

In [None]:
grader.check("p1_q4a")

### 4b. Create single data frame

Create a single DataFrame with the following columns: `Date_Number`, `Year`, `Month`, `Day`, `Day_of_Year`, `Temp_AVG`, `Temp_MIN`, and `Temp_MAX`.  The temperatures should reported in degree Celsius.  For the new DataFrames, `climate_min` and `climate_max` you will need to calculate temperature from the anomaly data.  

The DataFrame should only contain observations when all temperature values (avg, min, max) are available.  

You can assume that the date indices are the same for each data set and not missing any days. 

**Note:** You should not use the `merge` or `join` functions. 

In [None]:
# Create new DataFrame `full_data` 
full_data = ...

full_data.head()

In [None]:
grader.check("p1_q4b")

<!-- BEGIN QUESTION -->

### 4c. Plot mean, min, and max anomolies 

Create a plot showing the average temperature average for each year, the minimum temperature minimum for each year, and maximum temperate maximum for each year. 

Only plot the years since 1950.  

Consider using the [`agg` function](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.agg.html#) after doing a `groupby` operation.  Explore how to use `agg` to simplify your code. 



In [None]:
# Create the requested plot 



<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## 5. Bonus 

Here is a bonus question. 

#### Bonus Plot 

The plot should show the average temperature difference for each day (max - min) since 1950 for each month. 

Label the months with 3 letter abbreviations: Jan, Feb, Mar, Apr, ...

Choose an appropriate type of plot for this type of information. 


In [None]:
# Create the requested plot 



<!-- END QUESTION -->

## Congratulations! You have finished P1! 

### Submission Instructions

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. Once you submit this file to the P1 assignment on Gradescope, Gradescope will automatically submit a PDF file with some of your answers to the P1 - Figures assignment (making them easier to grade). 

**Important**: Please check that your responses were generated and submitted correctly to the P1 - Figures Assignment. 

**You are responsible for ensuring your submission follows our requirements and that the PDF for P1 - Figures answers was generated/submitted correctly. We will not be granting regrade requests nor extensions to submissions that don't follow instructions.** If you encounter any difficulties with the submission, contact course staff well-ahead of the deadline. 

Make sure you have run all cells in your notebook **in order** before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)