**Foreword:** This notebook was maintained simultaneously when I was learning Linear Regression. This notebook is first part of the three-notebook series, which tend to explain process of applying Linear Regression on a dataset. You can find all parts by clicking given links:

[**1. First Linear Regression Model**](https://www.kaggle.com/salmankhi/my-first-regression-mode)

[**2. Multiple Variable Linear Regression Model**](https://www.kaggle.com/salmankhi/multiple-linear-regression)

[**3. Linear Regression on Categorical Data**](https://www.kaggle.com/salmankhi/regression-on-categorical-data)

**What is Regression?**

It is a statistical method used in finance, investing, and other disciplines that attempts to determine the causal relationship between one variable (dependent, usually denoted by Y) and a series of other variables (independent variables, denoted by x1, x2, x3,...,xk).

Among different types of regression models, the most basic one is **Simple Linear Regression Model** and in this notebook we will try to predict dependant variable with that.

**Linear Regression Model** is the approximation of linear causal relationship between two or more variables.

Linear Regression Model: (for population) <center>Y = β0 + β1*x1 + error</center>

Here is Simple Linear Regression Model Equation: (for sample) <center>y_hat = b0 + b1*x1</center>

where,
- y_hat - dependant variable (approximated) (y has a hat on it when we write this equation, like ^)
- x1 - independent variable (regressor)
- β0/b1 - constant
- β1/b1 - factor defining effect og x1 on y (coefficient of regressor)

**Now for the example,** we have some real estate data comprising of two columns. One column has the prices of some houses and other has sizes. Now these two(sizes and prices) seem to have causal relationship between them. We will be making a model that will estimate price of a house of given size.

## 1. Importing Required Libraries

import **libraryname** as **alias**

In [None]:
import numpy as np  # linear algebra
import pandas as pd  # data manipulation and procession
import matplotlib.pyplot as plt  # data visualization
import seaborn as sns  # more attractive visualizations
import statsmodels.api as sm  # will help us apply regression model

In [None]:
sns.set()  # (optional), will make all matplotlib visualizations appear in seaborn skins

## 2. Reading Data
**pd.read_csv()** is used to save data as data structure(Series/DataFrame), argument takes csv file location.

In [None]:
data = pd.read_csv("../input/real-estate-prices-and-sizes/real_estate_price_size.csv")  # saving all the data in variable 'data'
data.head()  # prints first 5 rows of the data, if no argument given.

**Note:** The relationship on which we are applying regression should make sense. Like here, it is more sensible to think that prices of the houses will depend upon sizes, not the other way around.

Hence, size will be the independent variable(x1) here and price, the dependent one (y).

## 3. Saving Dependent and Independent Variables
We will be saving both in the form of 'Series'.

In [None]:
y = data["price"]  # saving variable 'price' in y
x1 = data["size"]  # saving variable 'size' in x1

## 4. Scatter Plot
Plotting a scatter chart with independant variable on x-axis and dependent variable on y-axis.

Matplotlib(a.k.a. plt) will be used for this step.

In [None]:
plt.scatter(x1, y)  # returns a scatter plot, scatter(<var on x-axis>, <var on y-axis)
plt.xlabel("Sizes", fontsize = 20)  # labelling x-axis
plt.ylabel("Prices", fontsize = 20)  #labelling y-axis
plt.show()  # returns graph with above given properties

It can be interpreted from the graph that their is a strong relationship between prices and sizes of the houses. Bigger house; more expensive.

## 5. Adding a Regression Line (Applying Linear Regression on Data Set)
We are yet to apply regression model on this data. On applying regression, there will be a regression line on the scatter plot that will approximate linear relationship between the two variables. That line will be as close to all points together as possible.

statsmodels will be used to apply regression on the data!

In [None]:
# adding a constant, actually we are adding a column of '1s' of length equal to the length of x1
x = sm.add_constant(x1)

# fitting the model using OLS(Ordinary Least Squares) method, with dependent 'y' and independent 'x'
results = sm.OLS(y, x).fit()

# printing the summary obtained on the application of the OLS regression
# summaries are the strong suit of statsmodels
results.summary()

**Also note** that there are various methods to apply Simple Linear Regression. OLS, that is used here, is one of the most common methods for estimating the linear regression equation. Other methods for linear regression are:
- Generalized Least Squares
- Maximum Likelihood Estimation
- Bayesian Regression
- Kernel Regression
- Gaussian Process Regressi

## 6. Interpretation of the Summary
The last function used i.e. summary() gave us some results that we have gotten on applying OLS Regression on the data set. These results contain Model Summary, Coefficient Table and some other information. We will look into some important values from them. We have
- dependent variable: price
- Model, Method - OLS, Least Squares
- constant coefficient - b0 (1.019e+05 here)
- size coefficient - b1 (affect of size on the price) (223.1787 here)
- standard error - (lower, the better) (1.19e+04)
- t-statistic
- P>|t| - for the H0 that size does not affect price (0.000 means we can reject this H0)

## 7. Plotting Regression Line on the Graph
Okay, before we go forward and plot regression line on the graph, it should be understood that the line on the scatter plot is not the regression itself and it is only a graphical representation of the method.

Part of this code is similar as we have done earlier in scatter plot, we will just add a line on that graph.

In [None]:
plt.scatter(x1, y)
y_hat = 1.019e+05 + (223.1787 * x1)  # substituting b0 and b1 in Simple Linear Regression Model Equation
fig = plt.plot(x1, y_hat, lw = 4, c = "orange", label = "Regression Line")  # plotting line
plt.xlabel("Size", fontsize = 20)
plt.ylabel("Price", fontsize = 20)
plt.show()

We can use this equation, y_hat = 1.019e+05 + (223.1787 * x1) to approximate the price of houses by substituting sizes in place of x1.

## 8. Decomposition of Variability
There are few measures that helps us to analyze our model. There are three of them.

- Sums of Square Totals (SST):

    It is the measure of the dispersion of the Data Set.
    To find it
    1. We take square of differences of each sample-y and mean-y
    2. And them take the sum of all the squared differnces
    (sample-y is each observation and mean-y is all observations' mean)


- Sums of Squares by Regression (SSR):

    It is a measure to tell how dispersed are our approximated-y from mean-y
    To find it
    1. We take square of differences of each approximated-y and mean-y
    2. And them take the sum of all the squared differnces
    (approximated-y is the value we get by plugging x in linear regression equation)
    
- Sum of Square Errors (SSE):
    It is a measure to tell how far our approximated-y from sammple-y
    1. We take square of differences of each approximated-y and mean-y
    2. And them take the sum of all the squared differnces
    
For a perfect Linear Regression Model:
- SST = SSR and SSE = 0
- Lesser SSE implies strong model

## 9. R-Squared
R-squared is the ratio of SSR and SST.
- R-squared = 1 => SSR = SST, means that x explains entire variablity of y
- R-squared = 0 => means x explains NONE of the variability of y
- R-squared can range from 0 to 1

Lower value of R-squared not necessarily means that model is wrong, it just suggest that there might be other variables also that affect y and they are not incorporated in the model.