# Univariate Linear Regression

 # This is part of a series on AI by ISTE Manipal. Read more posts on AI at:
 https://instagram.com/istemanipal?igshid=eb1cyeqm3pvr

# Import all necessary libraries
For this tutorial we use numpy for array manipulation and operations, matplotlib to visualize our data.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split#just to split dataset into training set and validation set
%matplotlib inline

# Import all the datasets
We use a function read_csv() from the library pandas to load the dataset( a comma separated value file) into a variable

In [None]:
path="../input/weatherww2/Summary of Weather.csv"#filepath for the dataset
data=pd.read_csv(path)#read dataset(csv file) from our filepath into a dataframe
data2=pd.read_csv("../input/weatherww2/Weather Station Locations.csv")

**Take a look at our data**
This is helpful as we need to figure out how our training set looks

In [None]:
data.head()

Let's analyse our dataset further before working on the model.
We will use a property dhape to get value of the dimension of the dataset. Further we will analyse the mean statistical distributions of various columns like mean, max value etc. This is important to understand as it will directly affect the choice of parameters we make later in the model.

In [None]:
print(data.shape)

In [None]:
data.describe()#use a built in function to get all properties 

# Visualize the data
This is perhaps the most important step in exploratory data anaylsis i.e visualisation of data. 

In [None]:
data.plot(x="MinTemp",y="MaxTemp",style='o')
plt.xlabel('MinTemp')
plt.ylabel('MaxTemp')
plt.show()

iloc is used to extract values from a dataframe. 
For more usage read the official documentation at:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

to_numpy() converts a pandas dataframe to a numpy array.
reshape with arguments -1 and 1 simply tells numpy to convert a 1D array to a 2D array. Don't worry its not changing any values , it just makes (n, ) array to (n,1) array, so that we can further work upon this.

In [None]:
df=data.iloc[:,[4,5]]#extract MinTemp and Maxtemp into another dataframe
x=df.iloc[:,0].to_numpy().reshape(-1,1)
y=df.iloc[:,1].to_numpy().reshape(-1,1)

In [None]:
#Just a sanity check here
print(x.shape)
print(y.shape)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

To understand how the function of train_test_split, https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
#For a sanity check ,print shape of datasets
print("Size of training set:\n",x_train.shape)
print(y_train.shape)
print("Size of testing set:\n",x_test.shape)
print(y_test.shape)

# Initialise parameters for the model
The initial weights for the model are set as 0. However we can also use random numbers to initialise the parameters (Try it yourself!! Does changing initial value of the parameters have any effect on the final output? If yes, then how can you define the change? If no, what causes the invariance?. )

In [None]:
params=np.zeros((2,1))

The params vector actually consists of both the weight and the bias. So the first row is the bias and the second one is the weight. 
Note that it is not necessary to take the bias in the parameter vector . We have taken it in this manner for convienience. In fact when we graduate to neural networks, we will use bias separately.

In [None]:
iterations=60000
learning_rate=0.001

Try adjusting the learning rate and see if it fits the graph better !
Can we adjust learning rate depending upon how we are doing? For example, speeding it up when we have heavy losses , slowing it down when we approach the minimum loss

In [None]:
x_train=np.hstack((np.ones_like(x_train),x_train))#simply add a column of 1s in the training set 

**Define a function computeCost which will calculate the cost for a given dataset and weights**

In [None]:
def computeCost(x,y,w):
    temp=np.dot(x,w)-y
    return np.sum(np.power(temp,2))/(2*len(y))

Computing the initial cost when the model is untrained

In [None]:
J=computeCost(x_train,y_train,params)
print(J)

# Gradient Descent
Finally apply gradient descent to adjust parameters. For more information about Gradient Descent follow this post :https://www.instagram.com/p/CC0Puu3BCuZ/?utm_source=ig_web_copy_link

In [None]:
def gradientDescent(x,y,w,learning_rate,iterations):
    for i in range(iterations):
        temp=np.dot(x,w)-y
        temp=np.dot(x.T,temp)
        w=w-(learning_rate/len(y))*temp
    return w

In [None]:
params=gradientDescent(x_train,y_train,params,learning_rate,iterations)

In [None]:
print(params)

Compute the final cost once we have trained the model.

Try changing the initial parameters, learning rate , iterations , and see what changes it has on initial and final values

In [None]:
print(computeCost(x_train,y_train,params))

# Visualize the curve 
Plot a straight line according to the parameters calculated by the model.

In [None]:
plt.scatter(x_train[:,1],y_train)
plt.plot(x_train,np.dot(x_train,params))
plt.xlabel("Minimum Temperature")
plt.ylabel("Maximum Tempearture")
plt.show()

Next step we have to validate the parameters using the testing dataset(x_test,y_test). Stay tuned for the tutorial for hypertuning of parameters and validation of datasets.