<a href="https://colab.research.google.com/github/stevengregori92/Project-Interpolation-Impact/blob/main/Project_Interpolation_Impact.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project - Measure Interpolation Impact

![Data Science Workflow](img/ds-workflow.png)

## Goal of Project
- The goal of the project is to see how big impact interpolation can have on results.
- The focus is mainly on step 2.
- To see the impact we will make simple model usages.
- The project will not go into details of steps 3 to 5.

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

### Step 1.b: Read the data
- Use ```pd.read_parquet()``` to read the file `files/weather-predict.parquet`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)
- Apply ```.head()``` on the data to see all is as expected

In [None]:
data = pd.read_parquet('weather-predict.parquet')
data.head()

Unnamed: 0_level_0,Pressure,Pressure+24h
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2006-04-01 00:00:00,1015.13,1015.68
2006-04-01 01:00:00,1015.63,1015.41
2006-04-01 02:00:00,1015.94,1014.98
2006-04-01 03:00:00,1016.41,1015.18
2006-04-01 04:00:00,1016.51,1014.7


In [None]:
len(data)

96418

In [None]:
data.shape

(96418, 2)

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Get the data types by ```.dtypes```

In [None]:
data.dtypes

Pressure        float64
Pressure+24h    float64
dtype: object

### Step 2.b: Check the length, null-values, and zero values
- Check the length
    - HINT: Use `len()`
- Check the number of null-values
    - HINT: Use `.isna().sum()`
- Check the number of zero-values
    - HINT: Use `(data == 0).sum()`

In [None]:
len(data), len(data.dropna())

(96418, 96380)

In [None]:
data.isna().sum()

Pressure         0
Pressure+24h    38
dtype: int64

In [None]:
(data == 0).sum()

Pressure        1288
Pressure+24h    1288
dtype: int64

### Step 2.c: Baseline
- Check the correlation to have a measure if we did nothing
    - HINT: Use `corr()`

In [None]:
data.corr()['Pressure'].sort_values(ascending = False)

Pressure        1.000000
Pressure+24h    0.419074
Name: Pressure, dtype: float64

### Step 2.d: Prepare data
- We know `Pressure+24` has NaN and 0 values.
- These are not correct values and we cannot use them in our model.
- Create a `dataset` without these rows.
    - HINT: Use filters like `data[data['Pressure+24h'] != 0]` and `dropna()`

In [None]:
dataset = data[data['Pressure+24h'] != 0].dropna()

In [None]:
len(dataset)

95092

### Step 2.e: Check the size and zero values
- Check the size of datasets `data` and `datasets`
- Check how many zero-values each dataset has

In [None]:
data.size, dataset.size

(192836, 190184)

### Step 2.f: Check the correlation
- For fun check the correlation of `dataset`
- Then do the same after you interpolated 0 values
    - HINT: Apply `replace` and `interpolate`
- Does the result surprice you?
- Notice how much interpolation improves the result

In [None]:
dataset.corr()

Unnamed: 0,Pressure,Pressure+24h
Pressure,1.0,0.083047
Pressure+24h,0.083047,1.0


In [None]:
dataset.replace(0, np.nan).interpolate().corr()

Unnamed: 0,Pressure,Pressure+24h
Pressure,1.0,0.79447
Pressure+24h,0.79447,1.0


### Step 2.g: Linear Regression Function
- Create function `regression_score` to calculate the r-square score
- It should take independent features X and dependent feature y
- Then split that into training and testing sets.
- Fit the training set.
- Predict the test set.
- Return the r-square score

In [None]:
def regression_score(X,y):
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
  lin = LinearRegression()
  lin.fit(X_train, y_train)
  y_pred = lin.predict(X_test)
  return r2_score(y_pred, y_test)

### Step 2.h: Test baseline
- Test the `regression_score` function on `dataset`

In [None]:
regression_score(dataset[['Pressure']], dataset['Pressure+24h'])

-152.7581067291715

### Step 2.i: Test on interploated dataset
- Make a interpolated dataset
- Get the result (from `regression_score`) for interpolated dataset

In [None]:
dataset_interpolated = dataset.replace(0, np.nan).interpolate()

regression_score(dataset_interpolated[['Pressure']], dataset_interpolated['Pressure+24h'])

0.4042254234015955