## Linear Regression Analysis with sklearn

In this notebook, we will apply Linear Regression one-by-one for following:
1. Simple Linear Regression
2. Multivariate Linear Regression
3. Feature Selection

## Importing Libraries

In [None]:
import numpy as np, pandas as pd  # another way to import multiple libraries in single line
import matplotlib.pyplot as plt, seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression  # importing requred modules from sklearn

## 1. Simple Linear Regression

### 1.1. Loading Data

In [None]:
data1 = pd.read_csv("../input/real-estate-price-size/real_estate_price_size.csv")
data1.head()

### 1.2. Assigning Dependent and Independent Variables

In [None]:
x1 = data1["size"]  # independent variable
y1 = data1["price"]  # dependent variable

### 1.3. Reshaping x to make regression on it possible

In [None]:
print(x1.shape)

shape = (100,) shows that x is a vector, we need a matrix to apply regression on it. We will use reshape() for this.

In [None]:
x1_matrix = x1.values.reshape(x1.values.size, 1)  # x.values.size = 100 as it tell total array size
x1_matrix.shape

### 1.4. Visualization:
This step is not a part of regression application but is very useful to assess whether linear regression can be applied on the data or not.

In [None]:
plt.scatter(x1_matrix, y1)
plt.xlabel("Size", fontsize = 20)
plt.ylabel("Price", fontsize = 20)
plt.show()

# 1.5. Applying Regression

In [None]:
reg1 = LinearRegression()  # making an instance of class LinearRegression
reg1.fit(x1_matrix, y1)  # applying regression on given data

**R-Squared $(R^2)$:**

In [None]:
reg1.score(x1_matrix, y1)

**Coefficient:**

In [None]:
reg1.coef_

**Intercept:**

In [None]:
reg1.intercept_

### 1.6. Predicting for New Houses

In [None]:
reg1.predict([[750], [500]])  # takes a DataFrame or array as argument and returns dependent variable

## 2. Multivariate Linear Regression

### 2.1 Loading Data

In [None]:
data2 = pd.read_csv("../input/real-estate-price-size-year/real_estate_price_size_year.csv")
data2.head(3)

### 2.2. Assigning Dependent and Independent Variables

In [None]:
x2 = data2[["size", "year"]]  # features, another word for regressors and independent varaibles
y2 = data2["price"]  # target, another word for dependent variable

In [None]:
x2.shape  # there is no need to reshape in multivariate linear regression, as it already is a matrix

### 2.3. Visualization

In [None]:
plt.scatter(x2["size"], y2)
plt.xlabel("Size", fontsize = 20)
plt.ylabel("Price", fontsize = 20)
plt.title("Price - Size")
plt.show()
plt.scatter(x2["year"], y2)
plt.xlabel("Year", fontsize = 20)
plt.ylabel("Price", fontsize = 20)
plt.title("Price - Year")
plt.show()

### 2.4. Applying Regression

In [None]:
reg2 = LinearRegression()
reg2.fit(x2, y2)

**R-Squared $(R^2)$:**

In [None]:
r2 = reg2.score(x2, y2)
r2

**Adjusted R-Squared $R^2$:**

There is no set method in sklearn to find adjusted value of R-Squared like in statsmodel. But what we can do is manually find out its values by using this formula:

adj_r2 = 1-(1-r2)*(n-1)/(n-p-1), where
- r2 - R-Squared
- n - x2.shape[0] - number of observations
- p - x2.shape[1] - number of features

In [None]:
n = x2.shape[0]
p = x2.shape[1]
print(n, p)

In [None]:
adj_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adj_r2

**Coefficient:**

In [None]:
reg2.coef_

**Intercept:**

In [None]:
reg2.intercept_

### 2.5. Predicting Targets

In [None]:
reg2.predict([[750, 2009], [640, 2015]])

## 3. Feature Selection
Feature selection is a very important part of multivariate linear regressions, so basically, of all linear regressions. When we have many features, we through this decide which feature actually hold importanct in variability of our target as there can be some features which are playing no role in variability of dependent variable.

For feature selection, we can use following techniques.
1. Feature Selection using F Regression
2. Feature Selection and Standardization

### 3.1. Feature Selection using F Regression
F Regression creates simple linear regression for each feature and independent variable. (i.e. n number of features => n regressions)

In [None]:
from sklearn.feature_selection import f_regression  # importing module for f regression

In [None]:
# we will be taking same example used in Multivariate Regression further for this
f_regression(x2, y2) # this will return an array for each feature containg two values

First value of the array is f-statistic of each regression and second value is the p-value.

In [None]:
feat_1, feat_2 = f_regression(x2, y2)  # saving each array in separate variable
f_statistics = f_regression(x2, y2)[0]  # saving an array of f-statistics in an array
p_values = f_regression(x2, y2)[1]  # saving an array of p-values in an array

In [None]:
p_values

P-Values are most usefull here, as they tell us that how much important the feature is for the model.

**Creating a Summary**

Unlike statsmodel, sklearn does not have a method to summarize whole regression model for us, but we can buit out own summary.

In [None]:
summary2 = pd.DataFrame(["Size", "Year"], columns = ["features"])  # making a dataframe with all features listed in it
summary2["coeff"] = reg2.coef_  # adding a column for coefficients and placing their value for each feature
summary2["p-values"] = p_values.round(3)  # adding p-values for all features
summary2  # printing summary

Now, looking at this summary and remembering the fact that feaures with p-value greater than 0.05 are insignificant for the model, we should be disregarding the 'year' feature. But, we will be keeping it. Why? Our next technique for feature selection will give the answer.

### 3.2. Feature Selection and Standardization
Standardization means feature scaling here. This technique may serve other purposes too, alongwith feature selection.

We have different ranges of values for differnt features. We, through feature scaling, standardize them. So different features and their weight in the model can be compared.

For standardizing, we substract mean value of the feature from each feature value and then divide it with the std of that feature's value. In sklearn, we have a method for that.

In [None]:
from sklearn.preprocessing import StandardScaler  # importing module for feature scaling

In [None]:
scaler = StandardScaler()  # instantiating StandardScaler class
scaler.fit(x2)  # fitting standardization on feature data

In [None]:
x2_scaled = scaler.transform(x2)  # transforming feature data into standardized feature data
x2_scaled  # will print standardized features

**Applying Regression on Standardized Data**

Same as we did earlier.

In [None]:
reg3 = LinearRegression()
reg3.fit(x2_scaled, y2)

**Making Regression Summary**

In [None]:
summary3 = pd.DataFrame(["Bias", "Size", "Year"], columns = ["Features"])  # bias is the term used for intercept
summary3["coeff"] = [reg3.intercept_, reg3.coef_[0], reg3.coef_[1]]
summary3

Here, we can see actual weight of each feature and compare it. Previously, with unstandardized feature we could see the independent weights of the features but we could not compared them as each of the featured ranged different.

Also, this also answers why we did not remove 'year' as feature despite of having p-values significantly larger. The reason is both features' weight, it is clear that size has almost 6 times more weight than the year making its impact smaller. Actually, when we apply regression through sklearn the weight of feature with higher p-values is reduced to compliment to that.

**Predicting Targets with Standardized Features**

In [None]:
new_houses3 = pd.DataFrame([[643, 2015], [600, 2008], [800, 2021], [300, 2004], [600, 2018]], columns = ["Size", "Year"])
reg3.predict(new_houses3).round(2)  # givings values way higher than values in the sample, weird

The reason behind these weird predictions is that we have standardized features for training our model and now for this test data, we are trying to use unstandardized values of independent variables.

In [None]:
standardized_new_houses3 = scaler.transform(new_houses3)  # transforming new data into standardized form

In [None]:
new_houses3["predictions"] = reg3.predict(standardized_new_houses3).round(0)  # predicting values and saving them in df

In [None]:
new_houses3

**What if we had removed year as it was not as significant as size?**

In [None]:
reg_simple = LinearRegression()
x_simple_matrix = x2_scaled[:, 0].reshape(-1, 1) # only transforming first column into standardized form
reg_simple.fit(x_simple_matrix, y2)
reg_simple.predict(standardized_new_houses3[:, 0].reshape(-1, 1)).round()  # doing same transformation for new data

These results are different but not very significantly differents as we got with both features. So, our choice of not excluding 'year' from the model was compensated by regression model via assignment of lesser weightage to the less significant feature.  