# Problem Set 01

This is your first **problem set** for CSS 120.

It will cover the topics discussed in the first two weeks of class, including:

- Loading data
- Descriptive statistics
- Understanding the environmental data you are analyzing

## Things to keep in mind

### Independent work

Note that unlike labs, problem sets should be **completed independently**.

### Note on hidden tests

Your problem set will be graded with an **auto-grader**. Visible tests can be seen directly in the Jupyter notebook, and can be identified by their `assert` statements.

Additionally, some of the questions in this problem set will have **hidden tests**. This means that in addition to the `assert` statements you see in the code, there are *hidden* tests. The point of these is to check whether your code **generalizes** to other cases. The problem set will notify you when/where there is a hidden test, even if you can't see it.

### Q1. Loading Data (3pt)

In the problem set folder, I added a dataset with daily measurements of sealevel temperature performed by the Scripps Institution of Oceanography.

The dataset is named `table_15.csv` (which is the original name of it).

You have to:

1. Import pandas using the alias `pd` (1pt).
2. Load the dataset using `pandas`, naming it `dat` (2pt).

In [1]:
import numpy as np

In [2]:
### BEGIN SOLUTION
import pandas as pd
dat = pd.read_csv("table_15.csv")
### END SOLUTION

In [3]:
assert pd

In [4]:
assert type(dat) == pd.core.frame.DataFrame

In [5]:
assert dat.shape == (37538, 6)

In [6]:
dat.head()

Unnamed: 0,Date (PST),Year,Month,Day,Sea Surface Temperature (C),Sea Surface Temperature Flag
0,8/22/16,1916,8,22,19.5,0
1,8/23/16,1916,8,23,19.9,0
2,8/24/16,1916,8,24,19.7,0
3,8/25/16,1916,8,25,19.7,0
4,8/26/16,1916,8,26,19.5,0


### Q2.  Loading Data (4pt)

Compute:

1. The mean of the temperature (0.5pt)
1. The standard deviation of the temperature (0.5pt)
1. The 0.005 percentile (0.5pt)
1. The 0.995 percentile (0.5pt)
1. How many missing data in the temperature (1pt).
1. The average temperature in the year of 1969 (1pt)

In [7]:
# Variables to fill
mean_temp = ...
stdev_temp = ...
perc0_005 = ...
perc0_995 = ...
missing_temp = ...
avg_temp_1969 = ...

In [8]:
### BEGIN SOLUTION
mean_temp = dat["Sea Surface Temperature (C)"].mean()
stdev_temp = dat["Sea Surface Temperature (C)"].std()
perc0_005 = dat["Sea Surface Temperature (C)"].quantile(0.005)
perc0_995 = dat["Sea Surface Temperature (C)"].quantile(0.995)
missing_temp = dat["Sea Surface Temperature (C)"].isna().sum()
avg_temp_1969 = dat.loc[dat.Year == 1969]["Sea Surface Temperature (C)"].quantile(0.995)
### END SOLUTION

In [9]:
# Mean temperature
assert mean_temp ** 2 > 296

In [10]:
# Standard deviation temperature
assert dat["Sea Surface Temperature Flag"].sum() / 100 < stdev_temp

In [11]:
# The 0.005 percentile (0.5pt) and the 0.995 percentile (0.5pt)
assert perc0_995 / perc0_005 > 3.833764 ** 0.49879

In [12]:
# How many missing data in the temperature (1pt).
assert missing_temp < (dat.shape[0] / 30) and missing_temp > (dat.shape[0] / 31)

In [13]:
# The average temperature in the year of 1969 (1pt)
assert avg_temp_1969 / mean_temp > stdev_temp / 3 and avg_temp_1969 / mean_temp < stdev_temp / 2

### Q3. Missing Imputation (3pt)

Create a function (1pt) that:

1. Receives a pandas Series
1. Find the missing data between two dates
1. Impute the missing with: $ \dfrac{\text{temp}_{\text{day after}} + \text{temp}_{\text{day before}}}{2} $

A few rules

1. If the missing is in the first and the last date, then do nothing (0.5pt).
1. If there are two or more consecutive missing data, then just skip (0.5pt).

Use the placeholder provided.

After the imputation:

1. What is the mean temperature? (0.5pt)
1. How many missing observations remain? (0.5pt)

In [14]:
# Placeholder for the stats
mean_temp_imp = ...
nmissing_imp = ...

In [15]:
def my_imputer(x):
    x = x.copy() # Believe me, it is going to help
    ### BEGIN SOLUTION
    for i in range(1, len(x)-1):
        if pd.isna(x[i]) and not (pd.isna(x[i-1]) or pd.isna(x[i+1])): 
            x[i] = (x[i-1] + x[i+1]) / 2
    return x
    ### END SOLUTION

# Compute the stats
### BEGIN SOLUTION
mean_temp_imp = my_imputer(dat["Sea Surface Temperature (C)"]).mean()
nmissing_imp = my_imputer(dat["Sea Surface Temperature (C)"]).isna().sum()
### END SOLUTION

In [16]:
# function
assert my_imputer(pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan, 3, np.nan])).mean() >= 4.25 / 2

In [17]:
# if missing is in the first and the last date, then do nothing (0.5pt).
assert my_imputer(pd.Series([np.nan, 2, np.nan])).isna().sum() == pd.Series([1, 2, 3]).mean()

In [18]:
# If there are two or more consecutive missing data, then just skip (0.5pt).
assert my_imputer(pd.Series([1, np.nan, np.nan, 2])).mean() == pd.Series([1, 2]).mean()

In [19]:
# Compute the mean temperature? (0.5pt)
assert mean_temp_imp / mean_temp - 1 > 0.0005 and mean_temp_imp / mean_temp - 1 < 0.0048

In [20]:
# Compute how many missing observations remain? (0.5pt)
assert nmissing_imp ** 0.1 > 1.78 and nmissing_imp ** 0.11 < 2.0

### Submit!

Once you've completed all the cells above (saving regularly):

- Click "Validate". This will run a check to determine whether you've passed all visible tests. 
- Once you've validated the assignment, you should now have an option to "Submit" the assignment (next to where the assignment is stored in your directory). Click this.
- This will now show up under your "Submitted Assignments" section.

If you have any trouble accessing or submitting the assignment, please check in with your TA!