# Aim of The Project
The aim of the project is to predict weather phenomena with multiple linear regression model. We are particularly interested in predicting the average temperatures. The predictions should be based on values such as rainfall, amount of UV, max and min temperatures. We found a good dataset containing weather data from airport in Perth, Australia. The dataset contains data from 1944 till 2016. With this dataset we set out to implement our models as best as we could.


# Evaluation of Tutorial Used For Learning
Before starting working on our own model. We decided to go through some tutorials and learn about linear regression. We found multiple good tutorials on the subject but after brief discussion we decided on one tutorial that we would do as a group. [The tutorial we used](https://towardsdatascience.com/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-83a8f7ae2b4f) taught us how to build a linear regression models with scikit-learn python library. The tutorial was divided in two parts: simple linear regression and multiple linear regression. Since our models was going to rely on multiple variables the second part was more useful for us. But the first part of the tutorial was also very helpful with examples of how matplotlib and pandas libraries work.

In general the tutorial was of high quality and we can recommend it for everyone who is interested in building their own linear regression models or just are interested in learning about machine learning and the libraries used for machine learning in Python


In [None]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

# Describing the dataset

The original dataset being used is a CSV file, wa_weather_1944_till_2016.csv. We use the pandas library to read
the contents of the file with read_csv() function.

In [None]:
dataset = pd.read_csv('wa_weather_1944_till_2016.csv')

### Shape

We use shape on the dataset to return the amount of columns and rows. The dataset contains 26543 rows and 9 columns, that are: year, month, day, rainfall amount in mm, the min and max temperatures in celcius, the daily average temperature, the daily temperature range and finally the amount of ultraviolet light. 

In [None]:
dataset.shape

### Describe

The describe function gives an overview of our dataset and gives a detailed description of the differenct values of the dataset attributes, i.e. the total amount, mean, min and max values.
* Year - Year of the measurement,
* Month - Month of the measurement,
* Day - Day of the measurement,
* Rainfall - Amount of rain,
* Min. temperature - Minimum temperature that day,
* Max. Temperature - Maximum temperature that dat,
* Daily avg. - Average temperature that day,
* Daily range - Daily range of temperature that dat,
* UV/MJ m*m - amount of UV radiation that day

In [None]:
dataset.describe()

# Preparing the dataset

We check if the dataset contains any null values and remove them using isnull().any() to return a boolean value and dropna() to drop the null values from dataset.

In [None]:
dataset.isnull().any()

In [None]:
dataset = dataset.dropna()

# Evaluation of results

The algorithm predicted the results of the test data fairly accurately. A comparison can be seen in the table below which shows the actual values of the test data in left column and the predicted values in the other. As can be seen, the predicted values fall into range of the actual dataset test values.

### Algorithm performance

To get a better evaluation of the results we use evaluation metrics to determine the accuracy of the algorithm. The python Scikit-Learn library is used to calculate the root mean squared error, that is the square root of the mean of the squared errors. The calculation gives a result of 1.52 and comparing it to the daily_avg mean gives us a percentage.

(1.52 / 18.22) * 100 = 8%

This is a good enough value to get a fairly accurate prediction on the daily average temperature. 

# Analysis process
After we were decided on what dataset to use and what to predict with it we started by making a git repository and a common development environment. Our development environment is jupyter notebook running inside a conda virtual environment. This way it was easy for us to install and manage the libraries we wanted to use.

The actual process for the analysis started by looking at our dataset and figuring out what to use it for. The nature of the data being about weather we wanted to try to predict the daily average temperature based on other attributes such as the amount of UV radiation, rainfall and taking into consideration the time of the year.
With our goal set out in our minds we started preparing the data for our multiple linear regression model. By preparing the data we made sure that the dataset did not contain any bad values and our end result wouldn't be affected by them. Our method of cleaning up the data from these bad values meant dropping them from the dataset. Luckily there weren't many of such values.

With our cleaned up data we were ready to make training and testing datasets. Before making the datasets we had to decide what variables to use in our model. For our purposes these variables were rainfall, uv radiation and max temperature. We used these variables to predict daily average temperature. Our datasets were split into 80% training data and 20% testing data. The training data was used for creating our multiple linear regression model and training it. As for the testing data it was used for testing the model.

Once we had our datasets ready. The training datasets were fitted into a regression model. The model training method took the variables that know as their own dataset and the variable we want to predict as a seperate dataset. The model training function, in a nutshell, found us the best available line from the mess of datapoints and from that line we can make our predictions. With this method we got ourselves some coeffients that tells us how the variables affect the value we are trying to predict, in this case the daily average temperature. Now with our very own model we tried to predict some values with the testing dataset. We made our predictions and compared them to the testing dataset. By comparing them against each other we were able to determine how good our predictions were.