# Svalbard Climate Change Exploration

## Svalbard Climate Change Project Motivation ##
This is my first data science project, first time using python and first kaggle project. I am looking to improve all my skills and appreciate any comments and suggestions. 

## Context ##
I read an article in the Guardian, __"Here's What Happens When you Try to Replicate Climate Change Contrarian Paper" (https://www.theguardian.com/environment/climate-consensus-97-per-cent/2015/aug/25/heres-what-happens-when-you-try-to-replicate-climate-contrarian-papers)__.
I became interested in the problem of predicting climate and specifically seeing if I could practice forecasting climate with the data that was used. I decided to investigate one of the studies mentioned in the article by Humlum, et. al, __Identifying natural contributions to late Holocene climate change (https://www.researchgate.net/publication/232402119_Identifying_natural_contributions_to_late_Holocene_climate_change)__, to see if I could use the dataset for my own exploration. One the interesting things in Humlum's paper was a prediction that the temperature would drop from 2015-2017 and then rise again. Since the paper was written, we now some have data on some of the period he forecast.  I am actually not making a case for or against climate change, but wanted to use this project to learn about data science and predictive modeling.  I found the paper interesting and, unlike it critics, did not 'deny' climate change but proposed that there are potentially long term cyclic patterns in climate caused by solar and lunar affects on the Earth. __Critics of their model (http://static-content.springer.com/esm/art%3A10.1007%2Fs00704-015-1597-5/MediaObjects/704_2015_1597_MOESM1_ESM.pdf)__ have suggested that the method was based on biased selection of data and then curvefitting the data.   However, I personally found the creation of the model using wavelet analysis interesting and reserve judgement on the process to others who have better qualificatins than me to weigh in.

I am not a climatologist nor am I using more sophisticated climate forecasting models.  I was surprised that there the data set isn't being used by others.

## The Data and Background ##
I extracted a text file from the NASA database on climate for the Svarbard and Isfjord weather stations.  It has monthly, seasonal, and annual recorded temperatures.

The data represents a combination of 2+ weather stations on an island governed by Norway close to each other but with very little overlapping time periods, so we have data for over 100 years. I obtained the data from NASA and combined data from 2 weather stations:
Isfjord Radio: (78.1 N,13.6 E), 1912 - 1976 
Svalbard Luft: (78.2 N,15.5 E), 1977 - 2017 (partial) 

My understanding is that these radio stations have moved several times, though close to each other.  The Isfjord Radio weather station was moved to Svalbard and is relatively close (47km). It is also an area with one of the longest human-recorded weather information in the Artic region and could be representative of any climate changes.  It is fairly isolated from the local affects of human generated air quality changes.  
== Missing Data ==
One problem is missing data.  These are represented in the file with a temperature of 999.90.  I had to figure out what to do - which for me was challenging.

##  The Plan ##
This is a data exploration.  The plan is to start simple with annual temperatures and see if I can solve for annual temperature change using linear regression.  I plan to look at monthly cycles as well as seasonal cycles as well.  I am also interested attempting to use wavelet analysis to replicate some of the results.  Another interest of mine is geomagnetic pole reversals...a future project.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Outline of the planned data exploration
1. Load data
2. Extract annualized data
3. Clean it - deal with missing readings
4. Solve for a linear equations
5. Plot and solve for r squared
6. Try to optimize a fitted polynomial curve to the data
7. Try using Single Vector Regression to fit a curve (though this may be overfitting)
8. Do a few predictions

In [None]:
#First read in the data and visually check it.
climate_df = pd.read_csv('../input/svalbard-climate-1912-2017.csv', header=0)
print("climate data frame", climate_df)


Notice that 2004, 2005 and 2017 have a lot of missing data.  I think I will toss out 2017 as it is incomplete and fill in the missing data for rest. I read this article https://www.niwa.co.nz/our-science/climate/information-and-resources/nz-temp-record/temperature-trends-from-raw-data/technical-note-on-the-treatment-of-missing-data and used the technique of removing the missing data and averaging over the rest it to fill in the missing data.

First I will explore the data by MetAnn, which is the average annual temperature.

In [None]:
#Extract X and Y arrays.  X is year, Y is average annual temperature in the MetANN column
X=[]
Y=[]

for i in climate_df.YEAR:
    X.append([1, i])
    X.append
#I am not sure if this is the best way to replace missing data with 999.90 values
#calculate mean of the known data
mean_metANN = np.mean(climate_df.metANN[climate_df.metANN<999])
#fill in missing data with mean of known data
for i in climate_df.metANN:
    if i < 999:
        Y.append(i)
    else:
        Y.append(mean_metANN)

In [None]:
#convert to an np array to plot
X= np.asarray(X)
Y= np.asarray(Y)


In [None]:
#create solve function
def solve_w (X_s, Y_s):
    w_solve = np.linalg.solve(np.dot(X_s.T, X_s), np.dot(X_s.T, Y_s))
    Yhat_solve = np.dot(X_s, w_solve)
    return w_solve, Yhat_solve

In [None]:
#create plot function (probably don't really need this)
#define a function to plot data with a label
def plot_it (X, Y, lab, mark="-", col='blue'):
    plt.plot(X, Y, label = lab, linestyle=mark, color=col)

In [None]:
#create residual mean squared function
# determine how good the model is by computing the r-squared
def calc_r2(X, Y, Yhat):
    d1 = Y - Yhat
    d2 = Y - Y.mean()
    r2 = 1 - d1.dot(d1) / d2.dot(d2)
    return r2

In [None]:
#Ok, solve for Yhat (predicted value) and also calculate least square

w, Yhat = solve_w(X, Y)

#for fun, also try the lin alg least square method, I think this should be similar or the same as Yhat
A = np.vstack([X[:,1], np.ones(len(X))]).T
m, c = np.linalg.lstsq(A, Y)[0]


In [None]:
#now plot everything

fig = plt.figure(figsize=(8,8))
plot_it(X[:,1], Y, "Y", "-", 'blue')
plot_it(X[:,1], Yhat, "Yhat", "dashdot", 'red')
plot_it(X[:,1], X[:,1] * m + c, "Least Square", "--", 'green')
#plt.plot(X[:,1], Y, label="Y")
plt.xlabel("Year")
plt.ylabel("Temperature (C)")
plt.legend()
plt.show()

In [None]:
#Calcuate r-squared
r2 = calc_r2(X, Y, Yhat)
print ("the r-squared is:", r2)

In [None]:
#doesn't look like a very good fit
#Lets look at moving averages to see how well they can fit

In [None]:
#define a function for returning an array of moving averages over a period, n*2 is moving average period) 
# not sure if I got this quite correctly

def moving_average(a, n=3):
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n:-n] / n



In [None]:
#now try with moving averages
periods = 5
Y_ma = moving_average(Y, periods)

#solve and plot
#print("shape X", X.shape, "X[periods:-periods]", X[periods:-periods].shape, "Y_ma shape", Y_ma.shape)
w, Yhat_ma = solve_w(X[periods:-periods], Y_ma)


#calculate least square
fig = plt.figure(figsize=(8,8))
plot_it(X[:,1], Y, "Y")
plot_it(X[periods:-periods,1], Y_ma, "Y moving average", "solid", "green")
plot_it(X[periods:-periods,1], Yhat_ma, "Yhat moving average prediction", "dashdot", "red")
plt.xlabel("Year")
plt.ylabel("Temperature (C)")
plt.legend()
plt.show()
r2 = calc_r2(X[periods:-periods,1], Y[periods:-periods], Y_ma)
print ("the r-squared of moving average is:", r2)
r2 = calc_r2(X[periods:-periods,1], Y[periods:-periods], Yhat_ma)
print ("the r-squared of prediction from moving average is:", r2)

In [None]:
#Now try fitting a polynomial
import warnings
warnings.filterwarnings("ignore")

#now with different equations
plt.figure(figsize=(8,8))
plot_it(X[:,1], Y, "Y", "solid", "blue")

r_array = []
dim_array=[]
for dim in range(1, 50, 1):
    z = np.polyfit(X[:,1], Y, dim)
    p = np.poly1d(z)
    #plot Y and predicted Y
    plot_it(X[:,1], p(X[:, 1]), dim, "solid", "green")
    r = calc_r2(X, Y, p(X[:, 1]))
    dim_array.append(dim)
    r_array.append(r)
plt.xlabel("Year")
plt.ylabel("Temperature (C)")
#plt.legend()
plt.show()

In [None]:
#plot r squared of polynomials and calculate minimum number of degrees
plt.xlabel("dim")
plt.ylabel("r sqared")
plot_it(dim_array, r_array, "r2")
plt.legend()
plt.show()
max_dim = r_array.index(max(r_array))
r_max = max(r_array)
print("maximum is dim: ", max_dim, "with r: ", r_max)

In [None]:
#It looks like a polynomial with about 36 dimensions is our best fit using linear algrebra
z = np.polyfit(X[:,1], Y, max_dim)
p = np.poly1d(z)

#plot Y and predicted Y with polynomial
plot_it(X[:,1], Y, "original annual data", 'solid', 'blue')
plot_it(X[:,1], p(X[:, 1]), "poly max dim", "dashed", "green")
plt.legend()
plt.show()
print("r squared for dimension max dim is: ", calc_r2(X, Y, p(X[:, 1])))
print("The equation coefficients are: ", p)

In [None]:
#test with 2017, just for fun
p(2017)

In [None]:
#Now try with support vector regression
from sklearn.svm import SVR
#convert to matrix
x = np.matrix(X[:,1]).T
y = Y


In [None]:
#SVR kernels
# #############################################################################
# Fit regression model
svr_rbf = SVR(kernel='rbf', C=100.0, gamma=0.1)
svr_lin = SVR(kernel='linear', C=100.0)
# this is very slow when I try several degrees svr_poly = SVR(kernel='poly', C=10, degree=1)
#I would like to understand why polyfit seems so much better?? So I use that instead.
y_lin = svr_lin.fit(x, y).predict(x)
y_rbf = svr_rbf.fit(x,y).predict(x)
#y_poly = svr_poly.fit(x, y).predict(x)


In [None]:
#plot the different models

# #############################################################################
# Look at the results
lw = 2
plt.figure(figsize=(10,5))
plt.scatter(X[:,1], y, color='darkorange', label='data')
plt.plot(X[:,1], y_rbf, color='navy', lw=lw, label='RBF model')
plt.plot(X[:,1], y_lin, color='c', lw=lw, label='Linear model')
#plt.plot(X[:,1], y_poly, color='cornflowerblue', lw=lw, label='Polynomial model')
plt.plot(X[:,1], p(X[:, 1]), color='cornflowerblue', lw=lw, label='Polynomial model')
plt.xlabel('data')
plt.ylabel('target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()
print("rsqared:  y_rbf",calc_r2(X[:,1], Y, y_rbf ))
print("rsqared:  y_lin",calc_r2(X[:,1], Y, y_lin ))
#print("rsqared:  y_poly",calc_r2(X[:,1], Y, y_poly ))
print("rsqared:  Polyfit",calc_r2(X[:,1], Y, p(X[:,1])))

## Conclusion and Future Work
It looks like Support Vector regression gives the best fit for the current data, though I am not sure if it is overfitting the data.  I am interested in developing my skills for future work.

In [None]:
#Now some fun.  Make some predictions.  (I now this is not really a valid method)
x_predict = range(2016, 2025)
y_predict = svr_rbf.fit(x,y).predict(np.matrix(x_predict).T)


In [None]:
#now put it together and plot - past and future
fig = plt.figure(figsize=(10,8))
plt.plot(x,y)
plt.plot(x_predict, y_predict, color="blue", linestyle='dashed')
plt.show()