<a href="https://colab.research.google.com/github/ssebadduka/climate_change_predict_sarima/blob/main/Climate_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'daily-climate-time-series-data:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F312121%2F636393%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240806%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240806T094247Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D454b686c3aae1aad305f137ae627f82214aa9a371157af7ad0564ef29e914d87e7cf06bc3a942777c6b3858da1c02389cca998b8c17920f75636f5484ac26e1c9a9a451d002efe2e3effb4736eb69dbfda3717cf96109c27058b9907a90bc6e2e3029439443669cd46dd7d670d1f915b51168fafb2a4fcb097b68783d1f513af23075a0afd0fa2da4ae46a3677be595e9c7841ca95b9b669e55e230141e7eb0f9ed51ccd2941f6af42a24f0c692ccbc698ea60dd109dfc5ebf86cbc29e6b0bb765ccaa339484f1c937b464c5c2a32e2f929d00e1afc7201314859d6f76b0ccb8d6ddb33d7d36d5e87e0b03cda311aa22d60e9d8ebbe88528b8ec2ac565a41a37'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.preprocessing import StandardScaler
from tensorflow import keras
from tensorflow import nn

In [None]:
# Standardizing the size of all images
mpl.rcParams['figure.figsize'] = (15,9)
mpl.rcParams['font.size'] = 15

# 1. Data Exploration

In this dataset, we have one dataset for training and another one for test. To simplify things, we gonna merge both of them in just one, this will make things easier to explore and manipulate the data

In [None]:
df_train = pd.read_csv('../input/daily-climate-time-series-data/DailyDelhiClimateTrain.csv')
df_train.head()

In [None]:
df_test = pd.read_csv('../input/daily-climate-time-series-data/DailyDelhiClimateTest.csv')
df_test.head()

In [None]:
# Merging the datasets in one
df = pd.concat((df_test, df_train), ignore_index=True)
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by=['date'])

In [None]:
# Visualizing the behavior of all columns
columns = df.drop(columns=['date']).columns
for column in columns:
    sns.lineplot(x=df['date'], y=df[column])
    plt.title(column)
    plt.show()

All the columns look very stable, so this is good for us and our models!

In [None]:
# Function to convert our dataset in a time series dataset
def create_window(target, feature, window=1, offset=0):
    feature_new, target_new = [], []
    feature_np = feature.to_numpy()
    target_np = target.to_numpy()
    for i in range(window, target.shape[0] - offset):
        feature_list = feature_np[i - window:i]
        feature_new.append(feature_list.reshape(window, feature_np.shape[1]))
        target_new.append(target_np[i+offset].reshape(1))
    return np.array(feature_new), np.array(target_new)

The function above is simple, but very usefull. With it we can create our time series, varying the size of the window, offset and the feature we want for our new dataset

In [None]:
# scale all the dataset (not including the date)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.drop(columns='date'))
df_scaled = pd.DataFrame(df_scaled, columns=df.drop(columns='date').columns)

# Set the window to 10
window = 10
feature_columns = ['humidity', 'wind_speed', 'meanpressure', 'meantemp']

# Create a window with all the columns as features (excluding the date)
feature, target = create_window(df_scaled['meantemp'],df_scaled[feature_columns], window=window)
print(feature[0])
print(target[0])
print(df_scaled.head(12))

Here we have our first time series, using all the columns and with a window of 10

In [None]:
# Function to create train and test datasets
def train_test(feature, target, perc_train = 0.9):
    size_train = int(len(feature) * perc_train)

    x_train = feature[0:size_train]
    y_train = target[0:size_train]

    x_test = feature[size_train: len(feature)]
    y_test = target[size_train: len(feature)]

    return x_train, x_test, y_train, y_test

Here we have a function to create the train and test dataset. We're using a custom function because we want to separate the dataset without any shuffle, just a clean cut, so we can maintain the temporal characteristic of the data

In [None]:
x_train, x_test, y_train, y_test = train_test(feature, target)

# Visualize the train and test data
sns.lineplot(x=df['date'].iloc[window:len(y_train) + window], y=y_train[:,0], label='Train')
sns.lineplot(x=df['date'].iloc[window + len(y_train):], y=y_test[:,0], label='Test')

# 2. Prediction

In [None]:
# Create a standard model using LSTM
def model_lstm(x_shape):

    model = keras.Sequential()
    model.add(keras.layers.LSTM(64, input_shape=(x_shape[1], x_shape[2])))
    model.add(keras.layers.Dense(units=1))

    model.compile(loss='mean_squared_error', optimizer='RMSProp')
    return model

We create a simple model because our dataset is not too complex, so a simple neural network will do

And we create this function because we use this same model after

## 2.1 Using all the features

To this case test we will use all the features on our time series

In [None]:
model = model_lstm(x_train.shape)
model.summary()

In [None]:
result = model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=50)

Based on the loss, our solution is pretty good!

In [None]:
# Function to print the results of the fit process
def print_loss(result):
    plt.plot(result.history['loss'])
    plt.plot(result.history['val_loss'])
    plt.legend(['Train', 'Test'])
    plt.xlabel('Epochs')
    plt.ylabel('Cost')
    plt.show()

# Function to print the y_predicted compared with the y_test
def print_test_x_prediction(y_test, y_predict, df_date, train_size, window=0):
    sns.lineplot(x=df_date.iloc[train_size + window:], y=y_test[:,0], label = 'Test')
    sns.lineplot(x=df_date.iloc[train_size + window:], y=y_predict[:,0], label = 'Predict')
    plt.show()

In [None]:
y_predict = model.predict(x_test)

print_loss(result)
print_test_x_prediction(y_test, y_predict, df['date'], len(y_train), window=window)

Well, looks like our model is very precise

## 2.2 Using only the target

Here we just will use the target value on our time series and ignore all the other features

In [None]:
feature, target = create_window(df_scaled['meantemp'], df_scaled[['meantemp']], window=10)

x_train, x_test, y_train, y_test = train_test(feature, target)

model = model_lstm(x_train.shape)
result = model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=50)

In [None]:
y_predict = model.predict(x_test)

print_loss(result)
print_test_x_prediction(y_test, y_predict, df['date'], len(y_train), window=window)

Looks like this solution had a better result than the last one, not for much, but it's a difference

This most likely happen because we have a simple dataset, and just the tempeture of the previous day is already enough to make a good prediction, the others features just confuse the model more than help, so less is more in this case!

In [None]:
# Create a dense neural network
def model_dense(x_shape):
    model = keras.Sequential()
    model.add(keras.layers.Dense(64, input_dim=x_shape[1], activation=nn.relu))
    model.add(keras.layers.Dense(64,activation=nn.relu))
    model.add(keras.layers.Dropout(0.2))
    model.add(keras.layers.Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

This function is to create a simple dense neural network, so we can test against our lstm model we tested early

## 2.3 Not using time series

Here we don't use time series, just the features, and to help, the date separeted in month, year and day

In [None]:
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['day'] = df['date'].dt.day

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.drop(columns=['date']))
df_scaled = pd.DataFrame(df_scaled, columns=df.drop(columns=['date']).columns)

feature_columns = ['humidity', 'wind_speed', 'meanpressure', 'month', 'year', 'day']

feature, target = df_scaled[feature_columns], df_scaled['meantemp']

feature, target = np.array(feature), np.array(target).reshape(-1,1)

x_train, x_test, y_train, y_test = train_test(feature, target)

model = model_dense(x_train.shape)
model.summary()

In [None]:
result = model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=50)

In [None]:
y_predict = model.predict(x_test)

print_loss(result)
print_test_x_prediction(y_test, y_predict, df['date'], len(y_train), window=0)

Well, is not a bad result, but nothing compared with the previous ones. This shows the power of time series for forecasting

## 2.4 Using time series, but not with LSTM

Here we gonna use time series, but in our dense model, to see if LSTM is the one that creates the good results, or is the dataset itself

In [None]:
feature, target = create_window(df_scaled['meantemp'], df_scaled[['meantemp']], window=10)

feature = feature.reshape(-1, feature.shape[1] * feature.shape[2])
x_train, x_test, y_train, y_test = train_test(feature, target)

model = model_dense(x_train.shape)
result = model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=50)

In [None]:
y_predict = model.predict(x_test)

print_loss(result)
print_test_x_prediction(y_test, y_predict, df['date'], len(y_train), window=window)

Wow, great result, and without LSTM, this shows us that time series is more a concept than a specific algorithm, like LSTM or GRU

## 2.5 Time Series with Linear Regression

And to reinforce the previous idea, lets test with a simple Linear Regression and see the result

In [None]:
from sklearn.linear_model import LinearRegression

model_linear_reg = LinearRegression().fit(x_train, y_train)
y_predict = model_linear_reg.predict(x_test)

print_test_x_prediction(y_test, y_predict, df['date'], len(y_train), window=window)

Well, i think this proves that time series are very powerful even when not used with their most well know alghoritm

# 3. Conclusion

Well this concludes this notebook, any suggestions or tips are always welcome!