# Week 21 - Formative Exercise

This week you're given a scenario below, you must select appropriate techniques to consider, train at least two models based on the scenario and then finally evaluate your models using reasonable metrics. 

## Scenario
You are working as a data scientist at the Met Office.

Wales is particularly succeptible to climate change which is increasing the frequency of both flooding and drought events. Luckily temperature and rainfall records are well kept for four stations in Wales.

To help identify the future state of the climate in Wales you have been asked to conduct one of the two tasks below, choose one which interests you the most.

### Task 1: Rainfall Prediction
The first option is to produce a model capable of predicting the rainfall in June of 2100.

### Task 2: Temperature Prediction
The second option is to produce a model capable of predicting the temperature in September of 2150.

## Dataset

You have been provided a weather dataset for four weather stations across Wales, these are:
+ 0: Valley
+ 1: Cardiff
+ 2: Ross-on-Wye
+ 3: Aberporth

<div>
<img src="wales.png" width="250"/>
</div>

The dataset is provided in a Numpy array format which can be loaded as follows:

`dataset = np.loadtxt("weather_data.csv", delimiter=",")`

Once you have loaded the dataset it will look like a numpy array consisting of 5 columns which are as follows:
1. Weather Station (a numerical indicator as defined above)
2. Year of the reading
3. Month of the reading
4. Temperature (Degrees Celsius)
5. Rainfall (mm)

You do not have to use all the data; however, you should consider carefully which data you need to successfully train and test a model to answer the questions.

## Recommended Steps

To help you answer the questions here are a list of steps I recommend you take to tackle each question individually:
1. Load the data from the `weather_data.csv` file.
2. Extract only the inputs and outputs you need.
3. Split the data into training and testing.
4. Select a suitable modelling approach.
5. Implement and fit the model.
6. Evaluate the model's performance.
7. (Optional) Try a second modelling approach to compare performance!

## Tips and Tricks

Now you've got a handle on the metrics necessary for evaluating tasks I recommend you make use of the sklearn metrics library so you don't need to implement them yourself! Take a look through the [metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) documentation, you'

Task 1: Rainfall Prediction

In [133]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import root_mean_squared_error
from sklearn.neighbors import KNeighborsRegressor

In [134]:
dataset = np.loadtxt('weather_data.csv', delimiter=',')
# Weather station, Year, Month, Temperature, Rainfall
print(dataset)


[[0.000e+00 1.930e+03 1.200e+01 8.600e+00 1.303e+02]
 [0.000e+00 1.931e+03 1.000e+00 8.000e+00 6.620e+01]
 [0.000e+00 1.931e+03 2.000e+00 7.600e+00 6.060e+01]
 ...
 [3.000e+00 2.022e+03 1.000e+01 1.560e+01 1.074e+02]
 [3.000e+00 2.022e+03 1.100e+01 1.190e+01 1.182e+02]
 [3.000e+00 2.022e+03 1.200e+01 7.600e+00 8.060e+01]]


In [135]:
#From the dataset, get rid of the row that doesn't have valye 6 in the 3rd collumn
june_dataset = dataset[dataset[:, 2] == 5]

print(dataset.shape)
# Split the dataset into train and test
train_x, test_x, train_y, test_y = train_test_split(june_dataset[:,:4], june_dataset[:,-1], test_size=0.3)

# Create linear regression model
linear_model = LinearRegression()

# Train the model
linear_model.fit(train_x, train_y)

# Test the model
y_pred = linear_model.predict(test_x)

# Evaluate the model
print(mean_squared_error(test_y, y_pred))
print(mean_absolute_error(test_y, y_pred))
print(root_mean_squared_error(test_y, y_pred))

(3719, 5)
860.019539116975
24.790568455301287
29.326089734517538


In [136]:
# Split the dataset into train and test
train_x, test_x, train_y, test_y = train_test_split(june_dataset[:,:4], june_dataset[:,-1], test_size=0.3)

# Create linear regression model
linear_model = KNeighborsRegressor(n_neighbors = 5)

# Train the model
linear_model.fit(train_x, train_y)

# Test the model
y_pred = linear_model.predict(test_x)

# Evaluate the model
print(mean_squared_error(test_y, y_pred))
print(mean_absolute_error(test_y, y_pred))
print(root_mean_squared_error(test_y, y_pred))

968.5627182795698
24.13677419354839
31.121740283595482


In [137]:
september_dataset = dataset[dataset[:, 2] == 8]

# Assuming dataset is your numpy array
train = np.concatenate((september_dataset[:, 0:3], september_dataset[:, 4].reshape(-1, 1)), axis=1)  # Features: Weather station, Year, Month, Rainfall
test = dataset[:, 3]  # Target: Temperature

# Split the dataset into train and test
train_x, test_x, train_y, test_y = train_test_split(train, test, test_size=0.3)

# Create random forest model
linear_model = LinearRegression()

# Train the model
linear_model.fit(train_x, train_y)

# Test the model
y_pred = linear_model.predict(test_x)

# Evaluate the model
print(mean_squared_error(test_y, y_pred))
print(mean_absolute_error(test_y, y_pred))
print(root_mean_squared_error(test_y, y_pred))

ValueError: Found input variables with inconsistent numbers of samples: [310, 3719]