# LinearRegression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression



# Description
 Concerns housing values.

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'  #for setting environment

%matplotlib inline

In [None]:
df=pd.read_csv('/kaggle/input/real-estate-price-prediction/Real estate.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

No missing entries.

In [None]:
df.describe()

In [None]:
df.isnull().values.any()

In [None]:
sns.pairplot(df,diag_kind='kde')

In [None]:
len(df.columns)

In [None]:
plt.figure(figsize=(8,5))
sns.displot(df['Y house price of unit area'] , bins=30 , kde=True )

In [None]:
plt.figure(figsize=(16,10))
for i in range (len(df.columns)):
    plt.subplot(4,4,i+1)
    sns.boxplot(df[df.columns[i]])
plt.tight_layout()
plt.show()    

In [None]:
sns.heatmap(df.corr(), annot=True,cmap='Greens')

#  Determine the Features & Target Variable

In [None]:
X = df.drop(['No', 'Y house price of unit area'],axis=1)
y = df['Y house price of unit area']

# Split the Dataset to Train & Test
To split the data we will be using train_test_split from sklearn.

train_test_split randomly distributes your data into training and testing set according to the ratio provided.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model= LinearRegression()

In [None]:
model.fit(X_train, y_train)

# Coeficient Matrix
In linear algebra, a coefficient matrix is a matrix consisting of the coefficients of the variables in a set of linear equations. The matrix is used in solving systems of linear equations
In mathematics, a system of linear equations is a collection of one or more linear equations involving the same set of variable.

In [None]:
pd.DataFrame(model.coef_, X.columns, columns=['Coeficient'])

In [None]:
y_pred=model.predict(X_test)

# Evalutaing the Model
Evalutaing the Model
Model Evaluation is an integral part of the model development process. It helps to find the best model that represents our data and how well the chosen model will work in the future. Evaluating model performance with the data used for training is not acceptable in data science because it can easily generate overoptimistic and overfitted models. There are two methods of evaluating models in data science, Hold-Out and Cross-Validation. To avoid overfitting, both methods use a test set (not seen by the model) to evaluate model performance. In this method, the mostly large dataset is randomly divided to three subsets:
Training set is a subset of the dataset used to build predictive models. Validation set is a subset of the dataset used to assess the performance of model built in the training phase. It provides a test platform for fine tuning model's parameters and selecting the best-performing model. Not all modeling algorithms need a validation set. Test set or unseen examples is a subset of the dataset to assess the likely future performance of a model. If a model fit to the training set much better than it fits the test set, overfitting is probably the cause.

In [None]:
from sklearn import metrics
MAE= metrics.mean_absolute_error(y_test, y_pred)
MSE= metrics.mean_squared_error(y_test, y_pred)
RMSE=np.sqrt(MSE)

pd.DataFrame([MAE, MSE, RMSE], index=['MAE', 'MSE', 'RMSE'], columns=['Metrics'])

In [None]:
df['Y house price of unit area'].mean()

In [None]:
test_residuals=y_test-y_pred
test_residuals

To compare the shape of different testing and training sets, use the following piece of code:

In [None]:
print("shape of original dataset :", df.shape)
print("shape of input - training set", X_train.shape)
print("shape of output - training set", y_train.shape)
print("shape of input - testing set", X_test.shape)
print("shape of output - testing set", y_test.shape)

#  Residuals
In statistics and optimization, errors and residuals are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its "theoretical value". The error (or disturbance) of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean), and the residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean). The distinction is most important in regression analysis, where the concepts are sometimes called the regression errors and regression residuals and where they lead to the concept of studentized residuals.

What is a Residual in Regression?
When you perform simple linear regression (or any other type of regression analysis), you get a line of best fit. The data points usually don’t fall exactly on this regression equation line; they are scattered around. A residual is the vertical distance between a data point and the regression line. Each data point has one residual. They are:

Positive if they are above the regression line,
Negative if they are below the regression line,
Zero if the regression line actually passes through the poin
As residuals are the difference between any data point and the regression line, they are sometimes called “errors.” Error in this context doesn’t mean that there’s something wrong with the analysis; it just means that there is some unexplained difference. In other words, the residual is the error that isn’t explained by the regression line.

The residual(e) can also be expressed with an equation. The e is the difference between the predicted value (ŷ) and the observed value. The scatter plot is a set of data points that are observed, while the regression line is the prediction.

In [None]:
sns.scatterplot(x=y_test, y=y_pred)
plt.xlabel('Y-Test')
plt.ylabel('Y-Pred')

In [None]:
sns.scatterplot(x=y_test, y=test_residuals)
plt.axhline(y=0, color='c', ls='--')

In [None]:
sns.displot(test_residuals, bins=25, kde=True)

# Let us try SVR now.

In [None]:
from sklearn.svm import LinearSVR
svr = LinearSVR(max_iter=15000)

In [None]:
from sklearn.model_selection import GridSearchCV
from warnings import filterwarnings
filterwarnings('ignore')

In [None]:
params ={'max_iter':[5000*i for i in range(1,10)], 'C':[.1*i for i in range(1,6)]}

In [None]:
clf = GridSearchCV(svr, params)

In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.best_params_

In [None]:
clf.best_estimator_

In [None]:
clf.best_estimator_.score(X_test, y_test)

In [None]:
svr_pred = clf.best_estimator_.predict(X_test)
svr_residual = svr_pred - y_test

In [None]:
plt.scatter(x= svr_pred, y=y_test)

In [None]:
sns.distplot(svr_residual)

Looks like this model is doing well as residual density at zero peaks which sort of says that predictions are not very off from what was expected, and this happens ofter (as the peak is around zero). Well, really we would know, how well it is doing, by testing more data predictions. Hope it helps. Thanks!