# ** What is Linear Function? **

To start with, we will look at simple **Linear Regression** to test how well a single variable 'sqft_living' of a house predicts it's 'price'.

* Y : The response/dependent variable
* X : The predictor/independent variable

The result of linear regression is a function that predicts the Y('price') response/dependent variable as a function of X('sqft_living') predictor/independent variable.

**Yhat = a + bX**

* a refers to the **intercept** of the regression line, in other words **the value of Y when X is 0.**
* b refers to the **slope** of the regression line, in other words **the values of Y changes when X increases by 1 unit.**

First things first, let's first look at the data and understand how we picked up 'sqft_living' as X and 'price' as Y !!! 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

data = pd.read_csv("/kaggle/input/housesalesprediction/kc_house_data.csv", parse_dates=['date'])

In [None]:
data.tail()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.columns

In [None]:
data['years_renovated'] = data['yr_renovated'] - data['yr_built']

In [None]:
numeric_data = ['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15']
numeric_analysis = pd.DataFrame(data[numeric_data]).corr()
sns.heatmap(numeric_analysis)

In [None]:
y_data = data['price']
x_data = data.drop('price', axis=1)

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.20, random_state=1)

In [None]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train[['sqft_living']],y_train)


In [None]:
print("R-square is: " , lr.score(x_test[['sqft_living']], y_test))

In [None]:
from sklearn.metrics import mean_squared_error
y_hat = lr.predict(x_test[['sqft_living']])
mse = mean_squared_error(y_test, y_hat)

print("The mean squared error is: ", mse)

**Regression Plot**

When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.

This plot will show a combination of a scattered data points (a scatter plot), as well as the fitted linear regression line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).

In [None]:
plt.figure(figsize=(12,10))
sns.regplot(x='sqft_living',y='price',data=data)

**Residual Plot**

A good way to visualize the variance of the data is to use a residual plot.

What is a residual?

The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.

So what is a residual plot?

A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.

What do we pay attention to when looking at a residual plot?

We look at the spread of the residuals:

- If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

In [None]:
plt.figure(figsize=(12,10))
sns.residplot(data['sqft_living'], data['price'])

https://stattrek.com/regression/linear-transformation.aspx?tutorial=ap

In [None]:
newy_train = np.log(y_train)
newy_test = np.log(y_test)

In [None]:
lmr = LinearRegression()
lmr.fit(np.log(x_train[["sqft_living"]]), newy_train)

In [None]:
newy_hat = lmr.predict(x_test[['sqft_living']])
lmr.score(np.log(x_test[['sqft_living']]), newy_test)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

added_features = ['sqft_living','grade', 'sqft_above', 'sqft_living15','bathrooms','view','sqft_basement','lat','waterfront','yr_built','bedrooms','years_renovated']
X_data = data[added_features]
X_data = scaler.fit_transform(X_data)
Y_data = data['price']
X_train,X_test,Y_train,Y_test = train_test_split(X_data,Y_data,test_size=0.20, random_state=1)
X_train.shape, Y_train.shape, X_test.shape, Y_test.shape
       

In [None]:
lr_final = LinearRegression()
lr_final.fit(X_train, Y_train)

In [None]:
lr_final.score(X_test,Y_test)

In [None]:
from sklearn import linear_model
reg = linear_model.RidgeCV(alphas=(0.1, 1.0, 10.0))
reg.fit(X_train, Y_train)

In [None]:
reg.score(X_test, Y_test)

In [None]:
from sklearn import linear_model
lasso = linear_model.Lasso(alpha=0.1)
lasso.fit(X_train, Y_train)

In [None]:
lasso.score(X_test, Y_test)

In [None]:
from sklearn.linear_model import SGDRegressor
clf = SGDRegressor(eta0=0.1, penalty="l2", max_iter=100)
clf.fit(X_train, Y_train)

In [None]:
clf.score(X_test, Y_test)

In [None]:
from xgboost import XGBRegressor

my_model = XGBRegressor(subsample=0.2 ,gamma=1000,reg_alpha=0.8,reg_lamda=0.8, n_estimators=1000, learning_rate=0.06)
my_model.fit(X_train, Y_train, early_stopping_rounds=5, 
             eval_set=[(X_test, Y_test)], verbose=False)

In [None]:
predictions = my_model.predict(X_test)

In [None]:
print("R2 : " + str(my_model.score(X_test, Y_test)))

In [None]:
from sklearn.metrics import explained_variance_score

explained_variance_score(Y_test, predictions)