# <u>Simple Linear Regression</u>
## Part 1 - Principles
#### 1. Definition

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous  variables. It concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

The formula for a simple linear regression is
$y=β_1+β_2x$
- **y** is the predicted value of the dependent variable (**y**) for any given value of the independent variable (**x**)
- **β<sub>1</sub>** is the intercept, the predicted value of **y** when the **x** is 0
- **β<sub>2</sub>** is the regression coefficient – how much we expect **y** to change as **x** increases
- **x** is the independent variable

#### 2. Least Squares
The method of least squares is used to determine the line of best fit for a set of data. It helps us to find the optimal value of **β<sub>1</sub>** and **β<sub>2</sub>** . The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the residuals made in the results of each individual equation.


[Here](https://en.wikipedia.org/wiki/Linear_least_squares#Example) you can learn about the principle of the least square method.

## Part 2 - Code Implementation
> NOTE: The complete code can be found in *code.ipynb*
#### 1. Import the Relevant Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

#### 2. Load the Dataset

In [None]:
data = pd.read_csv('datasets/Salary_Data.csv')

#### 3. Explore the Dataset
The dataset describes the relationship between salary and work experience.
##### 3.1 displays the top 5 rows of the data

In [None]:
data.head()

##### 3.2 provides some information regarding the columns in the data

In [None]:
data.info()

#### 4. Split the dataset into train and test data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data[['YearsExperience']], data['Salary'], test_size=0.20, random_state=0)
print('data.shape is ', data.shape)
print('X_train.shape is ', X_train.shape)
print('X_test.shape is ', X_test.shape)

#### 5. Model Initialization

In [None]:
SLR = linear_model.LinearRegression()
SLR.fit(X_train.values, y_train)

#### 6. Predictions
##### 6.1 use the trained model to predict the Y value corresponding to a certain X value（sklearn requires that X must be a two-dimensional array). 

In [None]:
y_temp = SLR.predict([[9.5]])
print('when YearsExperience=9.5, the predicted value of Salary is %f' %y_temp)

##### 6.2 use the trained model to predict the Y value in the test set

In [None]:
y_pred = SLR.predict(X_test.values)
print('The Predicted values in the test set are:', y_pred)

#### 7. Plot the actual and predicted values

In [None]:
# scatter diagram of actual values
data.plot.scatter(x='YearsExperience', y='Salary')
# straight line diagram of predicted value
plt.plot(X_test , y_pred, color="red", linewidth=3)
plt.show()

#### 8. Model Performance Evaluation
RMSE and R2 score are metrics for assessing the performance of regression machine learning models.
##### 8.1 RMSE
Root Mean Squared Error (RMSE) is a measure of how accurate the predictions are. It's the square root of the mean squared error between the predicted and actual values. 

The closer RMSE is to 0, the more accurate the model is. But RMSE is returned on the same scale as the target you are predicting for and therefore there isn’t a general rule for what is considered a ‘good’ value. How good your metric value is can only be evaluated within the dataset context you are working. An RMSE of 1,000 for a house price prediction model is most likely seen as good because house prices tend to be over $100,000. However, the same RMSE of 1,000 for a height prediction model is terrible as the average height is around 175cm.

In [None]:
RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE is: %.2f' %RMSE)

##### 8.2 R2 Score
R-Squared (R2) is a measure of fit and it gives an indication as to how much of the variation is explained by the independent variables in the model.

The R2 score ranges from 1, a perfect score, to negative values for under-performing models. A score of 1 is the perfect score and indicates that all the variance is explained by the independent variables

In [None]:
R2_Score = r2_score(y_test, y_pred)
print('R2 score is: %.2f' %R2_Score)

## Part 3 - Practice
Use Simple Linear Regression algorithm to analyze the dataset ***data.csv***. The values of X and Y in this dataset are randomly generated according to some rule. Answer the following questions:

1. When x = 100, what is the predicted value y?
2. Draw a scatter diagram composed of all data in the dataset and a straight line diagram of the prediction model.
3. What is the RMSE value and R2 score value of the prediction model? Please analyze the accuracy and fit of the model according to these two values.

**HINT**: You can make use of the file *"code.ipynb"* and properly modify it to complete the practice.