<a href="https://colab.research.google.com/github/sunjana2199/ML-Climate-Final-Project-Template/blob/patch-1/Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Cleaning and Pre-processing**



---

In this notebook, a series of Cleaning and Preprocessing steps will be conducted.

- Checking for Missing Values
- Checking for Null Values



---



#Importing Dependencies

In [131]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt
from sklearn.metrics import mean_squared_error

#Loading the Data

In [109]:
data = pd.read_csv('/content/DailyEnergyDemandData.csv')

In [110]:
data.shape

(2440, 2)

In [111]:
data

Unnamed: 0,Date,Demand
0,2015-07-01,20243
1,2015-07-02,19275
2,2015-07-03,17682
3,2015-07-04,16523
4,2015-07-05,17620
...,...,...
2435,2022-03-01,18069
2436,2022-03-02,16941
2437,2022-03-03,17313
2438,2022-03-04,17953


In [112]:
data['Demand'] = data['Demand'].astype(int)

In [113]:
data

Unnamed: 0,Date,Demand
0,2015-07-01,20243
1,2015-07-02,19275
2,2015-07-03,17682
3,2015-07-04,16523
4,2015-07-05,17620
...,...,...
2435,2022-03-01,18069
2436,2022-03-02,16941
2437,2022-03-03,17313
2438,2022-03-04,17953


In [114]:
data.columns = ['Date', 'Demand']


#Data Cleaning



> Removing Missing and Null Values





In [115]:
data.isnull().sum()

Date      0
Demand    0
dtype: int64

In [116]:
data.replace({'?' : np.nan,'':np.nan, " ": np.nan, '-' :np.nan, '--':np.nan}, inplace = True)

In [117]:
data.isna().sum()

Date      0
Demand    0
dtype: int64

In [118]:
data.dropna(axis = 0, inplace= True)

#Data Preprocessing





##Split Data into Training and Validation Sets

In [121]:
data['Date'] = pd.to_datetime(data['Date'])

In [123]:
dataset = data[data['Date'].dt.year != 2022]
validation = data[data['Date'].dt.year == 2022]

##Set 'Date' Column as Index:

In [124]:
dataset = dataset.set_index('Date')
validation = validation.set_index('Date')

## Converting Data into Series

In [125]:
dataset = dataset.squeeze()
dataset

Date
2015-07-01    20243
2015-07-02    19275
2015-07-03    17682
2015-07-04    16523
2015-07-05    17620
              ...  
2021-12-27    17661
2021-12-28    17164
2021-12-29    17162
2021-12-30    16732
2021-12-31    15852
Name: Demand, Length: 2376, dtype: int64

In [126]:
validation = validation.squeeze()
validation

Date
2022-01-01    15223
2022-01-02    15914
2022-01-03    18620
2022-01-04    18918
2022-01-05    18288
              ...  
2022-03-01    18069
2022-03-02    16941
2022-03-03    17313
2022-03-04    17953
2022-03-05    16260
Name: Demand, Length: 64, dtype: int64

In [127]:
print('Dataset %d, Validation %d' % (len(dataset), len(validation)))


Dataset 2376, Validation 64


# Saving Cleaned Data to csv

In [129]:
# data.to_csv("DailyEnergyDemandDataCleaned.csv")
dataset.to_csv('energydemand_training.csv')
validation.to_csv('energydemand_validation.csv')

#Fitting the Training Data to Baseline Model 'Persistence Model'

In [137]:
# evaluate a persistence model
series = pd.read_csv('energydemand_training.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

# prepare data
X = series.values[1:]
print(X)
X = X.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]

# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
  # predict
  yhat = history[-1]
  predictions.append(yhat)
  # observation
  obs = test[i]
  history.append(obs)
  print('>Predicted=%.3f, Expected=%.3f' % (yhat, obs))

# report performance
rmse = sqrt(mean_squared_error(test, predictions))
print('RMSE: %.3f' % rmse)

['20243' '19275' '17682' ... '17162' '16732' '15852']
>Predicted=14815.000, Expected=17100.000
>Predicted=17100.000, Expected=18033.000
>Predicted=18033.000, Expected=17690.000
>Predicted=17690.000, Expected=17839.000
>Predicted=17839.000, Expected=16712.000
>Predicted=16712.000, Expected=15642.000
>Predicted=15642.000, Expected=16302.000
>Predicted=16302.000, Expected=17307.000
>Predicted=17307.000, Expected=18574.000
>Predicted=18574.000, Expected=19497.000
>Predicted=19497.000, Expected=19439.000
>Predicted=19439.000, Expected=16546.000
>Predicted=16546.000, Expected=14753.000
>Predicted=14753.000, Expected=14414.000
>Predicted=14414.000, Expected=16404.000
>Predicted=16404.000, Expected=16023.000
>Predicted=16023.000, Expected=16201.000
>Predicted=16201.000, Expected=16426.000
>Predicted=16426.000, Expected=16254.000
>Predicted=16254.000, Expected=14941.000
>Predicted=14941.000, Expected=15087.000
>Predicted=15087.000, Expected=16615.000
>Predicted=16615.000, Expected=16345.000
>Pr

The RMSE value for the base model is 1072.812


---

The Actual Model would have to be better than the persistence model