# **Beginners Guide to Regression Analysis and Visualize Interpretations**
![](https://lh3.googleusercontent.com/proxy/G1A8T4niNrin0vIFFJUTpq3e0ZjfHnhnwSgdYwNJGTDax7PgALUObiFeV8tXBI95_v_pHdNup7BiRdxERfMnolvGLuOHlED2g3O1M8jZoMyhgxgYqeOg0ib6wtQVJy86GMI-RFYL2eLdYDtPIkFxe35x5feREsFXX9zHw_OzRem5lm-573yrKK1oOuPd76vY1JA)

## Table Of Contents:
* **Understanding Regression**
* **How Does Regression Work?**
* **Types of Algorithms**
* **Testing of Algorithms**

## Algorithms that we will consider:-
1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Regression
4. Support Vestor Regression
5. Decision Tree Regression
6. Random Forest Regression

## What is Regression?

Regression is a technique used to predict value of one variable(Dependent Variable) on the basis of other variables(Independent Variables). It is parametric in nature because it makes certain assumptions based on the data set. If the data set follows those assumptions, regression gives incredible results. Otherwise, it struggles to provide convincing accuracy.

## How Does Regression Work?

Regression is a part of supervised learning which basically means that we train our models on the basis of given training data and our model tries to relate between the dependent and the independent variable. It does this using various functions that maps the independent variables to the dependent variables. When the model is completely trained and the error is minumised then we are able to make predictions on testing data as well.

## The 7 Steps of Machine Learning

### 1 Data Collection
   * The quantity & quality of your data dictate how accurate our model is
   * The outcome of this step is generally a representation of data which we will use for training
   * Using pre-collected data, by way of datasets from Kaggle, UCI, etc., still fits into this step
 
### 2 Data Preparation
   * Wrangle data and prepare it for training
   * Clean that which may require it (remove duplicates, correct errors, deal with missing values, normalization, data type conversions, etc.)
   * Randomize data, which erases the effects of the particular order in which we collected and/or otherwise prepared our data
   * Visualize data to help detect relevant relationships between variables or class imbalances (bias alert!), or perform other exploratory analysis
   * Split into training and evaluation sets (Good train/eval split? 80/20, 70/30, or similar, depending on domain, data availability, dataset particulars, etc.)
 
### 3 Choose a Model
   * Different algorithms are for different tasks; choose the right one
 
### 4 Train the Model
   * The goal of training is to answer a question or make a prediction correctly as often as possible
   * Linear regression example: algorithm would need to learn values for m (or W) and b (x is input, y is output)
   * Each iteration of process is a training step
 
### 5 Evaluate the Model
   * Uses some metric or combination of metrics to "measure" objective performance of model
   * Test the model against previously unseen data
   * This unseen data is meant to be somewhat representative of model performance in the real world, but still helps tune the model (as opposed to test data, which does not)
 
### 6 Parameter Tuning
   * This step refers to hyperparameter tuning, which is an "artform" as opposed to a science
   * Tune model parameters for improved performance
   * Simple model hyperparameters may include: number of training steps, learning rate, initialization values and distribution, etc.
 
### 7 Make Predictions
   * Using further (test set) data which have, until this point, been withheld from the model (and for which class labels are known), are used to test the model; a better approximation of how the model will perform in the real world

## Simple Linear Regression

It is a basic and commonly used type of predictive analysis. These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. 

$$y = \beta + \alpha*X $$ 

where:

* $y$ – Dependent Variable
* $X$ – Independent variable
* $\beta$ – intercept
* $\alpha$ – Slope

## Data Collection & Data Preparation:

In this section, you may have to collect data sets from different sources, from the internet or your personal repository. You also need to do data cleaning if necessary, then separate training data and testing data. Here, I already have a separate data set, so no separation of data is required.

In [None]:
# Importing Libraries
import numpy as np                  # working with array 
import pandas as pd                 # import data set
import matplotlib.pyplot as plt     # for visualization
          

In [None]:
# Import Training Data Set
Training_Dataset = pd.read_csv("../input/random-linear-regression/train.csv")
Training_Dataset = Training_Dataset.dropna()
X_train = np.array(Training_Dataset.iloc[:, :-1].values) # Independent Variable
y_train = np.array(Training_Dataset.iloc[:, 1].values)   # Dependent Variable

In [None]:
# Import Testing Data Set
Testing_Dataset = pd.read_csv("../input/random-linear-regression/test.csv")
Testing_Dataset = Testing_Dataset.dropna()
X_test = np.array(Testing_Dataset.iloc[:, :-1].values)   # Independent Variable
y_test = np.array(Testing_Dataset.iloc[:, 1].values)     # Dependent Variable

## Training the Model

In this section, the selection of the best regression model is carried out using the `LinearRegression` library.

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
accuracy = regressor.score(X_test, y_test)
print('Accuracy = '+ str(accuracy))

In [None]:
plt.style.use('seaborn')
plt.scatter(X_test, y_test, color = 'red', marker = 'o', s = 35, alpha = 0.5,
          label = 'Test data')
plt.plot(X_train, regressor.predict(X_train), color = 'blue', label='Model Plot')
plt.title('Predicted Values vs Inputs')
plt.xlabel('Inputs')
plt.ylabel('Predicted Values')
plt.legend(loc = 'upper left')
plt.show()

## Multiple Linear Regression

It is also a basic and commonly used type of predictive analysis. These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. 

$$Y = \beta + \alpha_1*X_1 + \alpha_2*X_2 + \cdots + \alpha_n*X_n$$ 

where:
* $Y:$ Dependent Variable
* $X_1, X_2, X_3, X_4:$ Independent variable
* $\beta:$ Intercept
* $\alpha_1, \alpha_2, \cdots, \alpha_n:$ Slopes 

## Preparing Data:

In [None]:
dataset = pd.read_csv('../input/insurance/insurance.csv')
print(dataset)

In [None]:
X = dataset.iloc[:, :-1] # Independent Variable
y = dataset.iloc[:, -1]  # Dependent Variable

In [None]:
# We have to apply encoding in the dataset as there are words present.
# for 'sex' and 'smoker' column we will apply Label Encoding as there are only 2 catagories
# for 'region' we will apply OneHot Encoding as there are more than 2 catagories

# Label Encoding:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X.iloc[:, 1] = le.fit_transform(X.iloc[:, 1])
X.iloc[:, 4] = le.fit_transform(X.iloc[:, 4])

# OneHot Encoding:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [5])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
# Training the Model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
accuracy = regressor.score(X_test, y_test)
print('Accuracy = '+ str(accuracy))

## Polynomial Regression

It is a basic and commonly used type of predictive analysis. These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. 

$$y = \beta + \alpha_1*X + \alpha_2*X^2 + \cdots+ \alpha_n*X^n$$

where:
* $y:$ Dependent Variable
* $X, X^2, \cdots, X^n:$ Independent variable
* $\beta:$ Intercept
* $\alpha_1, \alpha_2, \alpha_n:$ Coefficients of independent variable

In [None]:
dataset = pd.read_csv('../input/polynomialregressioncsv/polynomial-regression.csv')
X = dataset.iloc[:, :-1] # Independent Variable
y = dataset.iloc[:, -1] # Dependent Variable

In [None]:
# Trianing the Model
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 5)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)

In [None]:
plt.style.use('seaborn')
plt.scatter(X, y, color = 'red', marker = 'o', s = 35, alpha = 0.5,
          label = 'Test data')
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue', label='Model Plot')
plt.title('Predicted Values vs Inputs')
plt.xlabel('Inputs')
plt.ylabel('Predicted Values')
plt.legend(loc = 'upper left')
plt.show()

## Support Vector Regression

SVR gives us the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in higher dimensions) to fit the data. It uses following constraints:

$$|y - aX| <= e$$

where:
* e - maximum error
![](https://miro.medium.com/max/1212/1*bSZn9bK43MaA5vVDamRQ2A.png)
The points outside the margin are the Support Vectors.

In [None]:
Training_Dataset = pd.read_csv("../input/random-linear-regression/train.csv")
Training_Dataset = Training_Dataset.dropna()
X_train = np.array(Training_Dataset.iloc[:, :-1].values) # Independent Variable
y_train = np.array(Training_Dataset.iloc[:, 1].values) # Dependent Variable
y_train = y_train.reshape(len(y_train),1)

In [None]:
Testing_Dataset = pd.read_csv("../input/random-linear-regression/test.csv")
Testing_Dataset = Testing_Dataset.dropna()
X_test = np.array(Testing_Dataset.iloc[:, :-1].values) # Independent Variable
y_test = np.array(Testing_Dataset.iloc[:, 1].values) # Dependent Variable
y_test = y_test.reshape(len(y_test),1)

In [None]:
# Scalling X and y
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
y_train = sc_y.fit_transform(y_train)
y_test = sc_y.fit_transform(y_test)

In [None]:
# Training the Model
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X_train, y_train)

In [None]:
accuracy = regressor.score(X_test, y_test)
print('Accuracy = '+ str(accuracy))

In [None]:
plt.scatter(sc_X.inverse_transform(X_test), sc_y.inverse_transform(y_test), color = 'red', 
           marker = 'o', s = 35, alpha = 0.5, label = 'Test data')
plt.plot(sc_X.inverse_transform(X_test), sc_y.inverse_transform(regressor.predict(X_test)), 
           color = 'blue', label='Model Plot')
plt.title('Predicted Values vs Inputs')
plt.xlabel('Inputs')
plt.ylabel('Predicted Values')
plt.legend(loc = 'upper left')
plt.show()

## Decision Tree Regression

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

In [None]:
Training_Dataset = pd.read_csv("../input/random-linear-regression/train.csv")
Training_Dataset = Training_Dataset.dropna()
X_train = np.array(Training_Dataset.iloc[:, :-1].values) # Independent Variable
y_train = np.array(Training_Dataset.iloc[:, 1].values) # Dependent Variable

In [None]:
Testing_Dataset = pd.read_csv("../input/random-linear-regression/test.csv")
Testing_Dataset = Testing_Dataset.dropna()
X_test = np.array(Testing_Dataset.iloc[:, :-1].values) # Independent Variable
y_test = np.array(Testing_Dataset.iloc[:, 1].values)   # Dependent Variable

In [None]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)

In [None]:
accuracy = regressor.score(X_test, y_test)
print('Accuracy = '+ str(accuracy))

In [None]:
X_grid = np.arange(min(X_test), max(X_test), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X_test, y_test, color = 'red', marker = 'o', s = 35, alpha = 0.5,
          label = 'Test data')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue', label='Model Plot')
plt.title('Predicted Values vs Inputs')
plt.xlabel('Inputs')
plt.ylabel('Predicted Values')
plt.legend(loc = 'upper left')
plt.show()

## Random Forest Regression

Random Forest combines multiple trees to predict the class of the dataset, it is possible that some decision trees may predict the correct output, while others may not. But together, all the trees predict the correct output.

In [None]:
Training_Dataset = pd.read_csv("../input/random-linear-regression/train.csv")
Training_Dataset = Training_Dataset.dropna()
X_train = np.array(Training_Dataset.iloc[:, :-1].values) # Independent Variable
y_train = np.array(Training_Dataset.iloc[:, 1].values) # Dependent Variable

In [None]:
Testing_Dataset = pd.read_csv("../input/random-linear-regression/test.csv")
Testing_Dataset = Testing_Dataset.dropna()
X_test = np.array(Testing_Dataset.iloc[:, :-1].values) # Independent Variable
y_test = np.array(Testing_Dataset.iloc[:, 1].values) # Dependent Variable

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X_train, y_train)

In [None]:
accuracy = regressor.score(X_test, y_test)
print('Accuracy = '+ str(accuracy))

In [None]:
X_grid = np.arange(min(X_test), max(X_test), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X_test, y_test, c = 'red', marker = 'o', s = 35, alpha = 0.5,
          label = 'Test data')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue', label='Model Plot')
plt.title('Predicted Values vs Inputs')
plt.xlabel('Position level')
plt.ylabel('Predicted Values')
plt.legend(loc = 'upper left')
plt.show()

## Thank you very much for your attention. I wish you will find good datasets for your mini-research!

![](https://i.pinimg.com/originals/40/12/1a/40121a3616ecf2439a5b04d733b6f437.gif)