> When data contains missing values, we can remove any row containing even one miss ing value—but that may be too heavy-handed and may also remove useful data. One alternative is interpolation: replacing NaN with plausible values. The values may be wrong, but they will be roughly in the right ballpark. 
>
>In this exercise, we load some basic temperature data from New York City from the end of 2018 and the start of 2019. We then simulate a simple recurring equipment fail ure at 3:00 and 6:00 a.m. preventing us from getting temperature readings at those hours. How well does interpolation help us, and how far off are the interpolated mean and median calculations from the original, true values? Here are the steps I want you to take: 
>
>1 Load the temperature data from New York City (from the nyc-temps.txt file) into a series. The measurements are in degrees Celsius. 
>
>2 Create a data frame with two columns: temp, with the temperatures, and hour, representing the hours at which the measurements were taken. The hour values should be 0, 3, 6, 9, 12, 15, 18, and 21, repeated for all 728 data points.
> 
>3 Calculate the mean and median values. These are the real values, which we hope to replicate via interpolation. 
>
>4 Set all values from 3:00 and 6:00 a.m. to NaN.
> 
>5 Interpolate the values with the interpolate method. 
>
>6 What are the mean and median of the interpolated data frame? Are they similar to the real values? Why or why not?

SOLUTION

In [25]:
import pandas as pd
from pandas import Series
from pandas import DataFrame

nyc_temp = pd.read_csv('./data/nyc-temps.txt').squeeze()
nyc_temp

0         8.9360
1       169.6320
2       212.7410
3       260.4970
4       293.8110
          ...   
1386    276.3320
1387    267.1790
1388    223.9820
1389    157.6720
1390     12.8763
Name: 772.768, Length: 1391, dtype: float64

In [28]:
a = []
for i in range(1391):
    a.append(i*3%24)
print(a)
nyc_temp = pd.read_csv('./data/nyc-temps.txt').squeeze()
df = DataFrame({'temp': nyc_temp,
               'hour': a})
df

[0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15,

Unnamed: 0,temp,hour
0,8.9360,0
1,169.6320,3
2,212.7410,6
3,260.4970,9
4,293.8110,12
...,...,...
1386,276.3320,6
1387,267.1790,9
1388,223.9820,12
1389,157.6720,15


In [33]:
import numpy as np
df.loc[df.hour.isin([3,6]), 'temp'] = np.nan
df

Unnamed: 0,temp,hour
0,8.9360,0
1,,3
2,,6
3,260.4970,9
4,293.8110,12
...,...,...
1386,,6
1387,267.1790,9
1388,223.9820,12
1389,157.6720,15


In [35]:
df = df.interpolate()
df

Unnamed: 0,temp,hour
0,8.936000,0
1,92.789667,3
2,176.643333,6
3,260.497000,9
4,293.811000,12
...,...,...
1386,278.834000,6
1387,267.179000,9
1388,223.982000,12
1389,157.672000,15
