# **Beginners Guide to Regression Analysis and Plot Interpretations**
![](https://expertsystem.com/wp-content/uploads/2017/03/machine-learning-definition.jpeg)

## Table Of Contents:
* **Understanding Regression**
* **How Does Regression Work?**
* **Types of Algorithms**
* **Testing of Algorithms**

## Algorithms that we will consider:-
1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Regression
4. Support Vestor Regression
5. Decision Tree Regression
6. Random Forest Regression

## Lets Start with Understanding what is Regression?
**Regression is a technique used to predict value of one variable(Dependent Variable) on the basis of other variables(Independent Variables). It is parametric in nature because it makes certain assumptions based on the data set. If the data set follows those assumptions, regression gives incredible results. Otherwise, it struggles to provide convincing accuracy.**

## How Does Regression Work?
**Regression is a part of supervised learning which basically means that we train our models on the basis of given training data and our model tries to relate between the dependent and the independent variable. It does this using various functions that maps the independent variables to the dependent variables. When the model is completely trained and the error is minumised then we are able to make predictions on testing data as well.**

## We can apply machine learning model by following six steps:-
* Indentifying Problem
* Analysing Data
* Preparing Data
* Evaluating Algorithm
* Improving Results
* Presenting Results

## Simple Linear Regression
**It is a basic and commonly used type of predictive analysis. These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. 
y = a*X + b where:**
* y – Dependent Variable
* X – Independent variable
* b – intercept
* a – Slope

## Preparing Data:

In [None]:
# Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
Training_Dataset = pd.read_csv("../input/random-linear-regression/train.csv")
Training_Dataset = Training_Dataset.dropna()
X_train = np.array(Training_Dataset.iloc[:, :-1].values) # Independent Variable
y_train = np.array(Training_Dataset.iloc[:, 1].values) # Dependent Variable

In [None]:
Testing_Dataset = pd.read_csv("../input/random-linear-regression/test.csv")
Testing_Dataset = Testing_Dataset.dropna()
X_test = np.array(Testing_Dataset.iloc[:, :-1].values) # Independent Variable
y_test = np.array(Testing_Dataset.iloc[:, 1].values) # Dependent Variable

In [None]:
# Training the Model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
accuracy = regressor.score(X_test, y_test)
print('Accuracy = '+ str(accuracy))

In [None]:
plt.style.use('seaborn')
plt.scatter(X_test, y_test, color = 'red', marker = 'o', s = 35, alpha = 0.5,
          label = 'Test data')
plt.plot(X_train, regressor.predict(X_train), color = 'blue', label='Model Plot')
plt.title('Predicted Values vs Inputs')
plt.xlabel('Inputs')
plt.ylabel('Predicted Values')
plt.legend(loc = 'upper left')
plt.show()

## Multiple Linear Regression
**It is also a basic and commonly used type of predictive analysis. These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. 
y = b + a1*X1 + a2*X2 + a3*X3 + a4*X4 + ... where:**
* y – Dependent Variable
* X1, X2, X3, X4 – Independent variable
* b – intercept
* a1, a2, a3 – Slopes 

## Preparing Data:

In [None]:
dataset = pd.read_csv('../input/insurance/insurance.csv')
print(dataset)

In [None]:
X = dataset.iloc[:, :-1] # Independent Variable
y = dataset.iloc[:, -1] # Dependent Variable

In [None]:
# We have to apply encoding in the dataset as there are words present.
# for 'sex' and 'smoker' column we will apply Label Encoding as there are only 2 catagories
# for 'region' we will apply OneHot Encoding as there are more than 2 catagories

# Label Encoding:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X.iloc[:, 1] = le.fit_transform(X.iloc[:, 1])
X.iloc[:, 4] = le.fit_transform(X.iloc[:, 4])

# OneHot Encoding:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [5])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
# Training the Model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
accuracy = regressor.score(X_test, y_test)
print('Accuracy = '+ str(accuracy))

## Polynomial Regression
**It is a basic and commonly used type of predictive analysis. These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. 
y = b + a1*X + a1*X^2 + a1*X^3 + a1*X^4 where:**
* y – Dependent Variable
* X1, X2, X3, X4 – Independent variable
* b – intercept
* a1, a2, a3 – Coefficients of independent variable

In [None]:
dataset = pd.read_csv('../input/polynomialregressioncsv/polynomial-regression.csv')
X = dataset.iloc[:, :-1] # Independent Variable
y = dataset.iloc[:, -1] # Dependent Variable

In [None]:
# Trianing the Model
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 5)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)

In [None]:
plt.style.use('seaborn')
plt.scatter(X, y, color = 'red', marker = 'o', s = 35, alpha = 0.5,
          label = 'Test data')
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue', label='Model Plot')
plt.title('Predicted Values vs Inputs')
plt.xlabel('Inputs')
plt.ylabel('Predicted Values')
plt.legend(loc = 'upper left')
plt.show()

## Support Vector Regression
**SVR gives us the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in higher dimensions) to fit the data. It uses following constraints-
|y - aX| <= e, Where:**
* e - maximum error
![](https://miro.medium.com/max/1212/1*bSZn9bK43MaA5vVDamRQ2A.png)
The points outside the margin are the Support Vectors.

In [None]:
Training_Dataset = pd.read_csv("../input/random-linear-regression/train.csv")
Training_Dataset = Training_Dataset.dropna()
X_train = np.array(Training_Dataset.iloc[:, :-1].values) # Independent Variable
y_train = np.array(Training_Dataset.iloc[:, 1].values) # Dependent Variable
y_train = y_train.reshape(len(y_train),1)

In [None]:
Testing_Dataset = pd.read_csv("../input/random-linear-regression/test.csv")
Testing_Dataset = Testing_Dataset.dropna()
X_test = np.array(Testing_Dataset.iloc[:, :-1].values) # Independent Variable
y_test = np.array(Testing_Dataset.iloc[:, 1].values) # Dependent Variable
y_test = y_test.reshape(len(y_test),1)

In [None]:
# Scalling X and y
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
y_train = sc_y.fit_transform(y_train)
y_test = sc_y.fit_transform(y_test)

In [None]:
# Training the Model
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X_train, y_train)

In [None]:
accuracy = regressor.score(X_test, y_test)
print('Accuracy = '+ str(accuracy))

In [None]:
plt.scatter(sc_X.inverse_transform(X_test), sc_y.inverse_transform(y_test), color = 'red', 
           marker = 'o', s = 35, alpha = 0.5, label = 'Test data')
plt.plot(sc_X.inverse_transform(X_test), sc_y.inverse_transform(regressor.predict(X_test)), 
           color = 'blue', label='Model Plot')
plt.title('Predicted Values vs Inputs')
plt.xlabel('Inputs')
plt.ylabel('Predicted Values')
plt.legend(loc = 'upper left')
plt.show()

## Decision Tree Regression
**Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.**

In [None]:
Training_Dataset = pd.read_csv("../input/random-linear-regression/train.csv")
Training_Dataset = Training_Dataset.dropna()
X_train = np.array(Training_Dataset.iloc[:, :-1].values) # Independent Variable
y_train = np.array(Training_Dataset.iloc[:, 1].values) # Dependent Variable

In [None]:
Testing_Dataset = pd.read_csv("../input/random-linear-regression/test.csv")
Testing_Dataset = Testing_Dataset.dropna()
X_test = np.array(Testing_Dataset.iloc[:, :-1].values) # Independent Variable
y_test = np.array(Testing_Dataset.iloc[:, 1].values) # Dependent Variable

In [None]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)

In [None]:
accuracy = regressor.score(X_test, y_test)
print('Accuracy = '+ str(accuracy))

In [None]:
X_grid = np.arange(min(X_test), max(X_test), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X_test, y_test, color = 'red', marker = 'o', s = 35, alpha = 0.5,
          label = 'Test data')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue', label='Model Plot')
plt.title('Predicted Values vs Inputs')
plt.xlabel('Inputs')
plt.ylabel('Predicted Values')
plt.legend(loc = 'upper left')
plt.show()

## Random Forest Regression
**Random Forest combines multiple trees to predict the class of the dataset, it is possible that some decision trees may predict the correct output, while others may not. But together, all the trees predict the correct output.**

In [None]:
Training_Dataset = pd.read_csv("../input/random-linear-regression/train.csv")
Training_Dataset = Training_Dataset.dropna()
X_train = np.array(Training_Dataset.iloc[:, :-1].values) # Independent Variable
y_train = np.array(Training_Dataset.iloc[:, 1].values) # Dependent Variable

In [None]:
Testing_Dataset = pd.read_csv("../input/random-linear-regression/test.csv")
Testing_Dataset = Testing_Dataset.dropna()
X_test = np.array(Testing_Dataset.iloc[:, :-1].values) # Independent Variable
y_test = np.array(Testing_Dataset.iloc[:, 1].values) # Dependent Variable

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X_train, y_train)

In [None]:
accuracy = regressor.score(X_test, y_test)
print('Accuracy = '+ str(accuracy))

In [None]:
X_grid = np.arange(min(X_test), max(X_test), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X_test, y_test, c = 'red', marker = 'o', s = 35, alpha = 0.5,
          label = 'Test data')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue', label='Model Plot')
plt.title('Predicted Values vs Inputs')
plt.xlabel('Position level')
plt.ylabel('Predicted Values')
plt.legend(loc = 'upper left')
plt.show()

## Thank you very much for your attention to my work. I wish you good datasets for research!

![](https://i.pinimg.com/originals/4f/92/fe/4f92fe4ee07e79bc3495e41bb5ae1bd3.gif)