 <h1> Welcome to this Kernel </h1>

**<h4>About this Notebook</h4>**
<p> we often use <b>Model Development</b> to help us predict future observations from the data we have.</p>
<p>So, a Model will help us understand the exact relationship between different variables and how these variables are used to predict the result.</p>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
path = "../input/housesalesprediction/kc_house_data.csv"
df = pd.read_csv(path)

In [None]:
df.head(9)

In [None]:
df.describe()

In [None]:
cdf = df[['id','price','bedrooms','bathrooms','sqft_living','floors']]

In [None]:
cdf.head()

In [None]:
viz = cdf[['id','floors','sqft_living','bedrooms','bathrooms','price']]

In [None]:
viz = cdf.hist()
plt.show()

**Now, lets plot each of these features vs the price, to see how linear is their relation:**

In [None]:
plt.scatter(cdf.bedrooms, cdf.price,  color='blue')
plt.xlabel("The number of bedrooms")
plt.ylabel("The price in USD")
plt.show()

In [None]:
plt.scatter(cdf.bathrooms, cdf.price,  color='blue')
plt.xlabel("The number of bathrooms")
plt.ylabel("The price in USD")
plt.show()

In [None]:
plt.scatter(cdf.sqft_living, cdf.price,  color='blue')
plt.xlabel("The square feet living")
plt.ylabel("The price in USD")
plt.show()

#### Creating train and test dataset
Train/Test Split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. After which, you train with the training set and test with the testing set. 
This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the data. It is more realistic for real world problems.

This means that we know the outcome of each data point in this dataset, making it great to test with! And since this data has not been used to train the model, the model has no knowledge of the outcome of these data points. So, in essence, it is truly an out-of-sample testing.

Lets split our dataset into train and test sets, 80% of the entire data for training, and the 20% for testing. We create a mask to select random rows using __np.random.rand()__ function: 

In [None]:
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]

<h2 id="simple_regression">I. Simple Regression Model</h2>
Linear Regression fits a linear model with coefficients $\theta = (\theta_1, ..., \theta_n)$ to minimize the 'residual sum of squares' between the independent x in the dataset, and the dependent y by the linear approximation. 

#### **Train data distribution**
Using sklearn package to model data.

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
train_x = np.asanyarray(train[['sqft_living']])
train_y = np.asanyarray(train[['price']])
reg.fit(train_x,train_y)
# The coefficients
print ('Coefficients: ', reg.coef_)
print('Intercept:', reg.intercept_)

 __Coefficient__ and __Intercept__ in the simple linear regression, are the parameters of the fit line. 
Given that it is a simple linear regression, with only 2 parameters, and knowing that the parameters are the intercept and slope of the line, sklearn can estimate them directly from our data. 
Notice that all of the data must be available to traverse and calculate the parameters.

#### **Plot outputs**

we can plot the fit line over the data:

In [None]:
plt.scatter(train.sqft_living, train.price,  color='blue')
plt.plot(train_x, reg.coef_[0][0]*train_x + reg.intercept_[0], '-r')
plt.xlabel("The squared feet living")
plt.ylabel("The price in USD")

#### Evaluation
we compare the actual values and predicted values to calculate the accuracy of a regression model. Evaluation metrics provide a key role in the development of a model, as it provides insight to areas that require improvement.

There are different model evaluation metrics, lets use MSE here to calculate the accuracy of our model based on the test set: 
<ul>
    <li> Mean absolute error: It is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just average error.</li>
    <li> Mean Squared Error (MSE): Mean Squared Error (MSE) is the mean of the squared error. It’s more popular than Mean absolute error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones.</li>
    <li> Root Mean Squared Error (RMSE): This is the square root of the Mean Square Error. </li>
    <li> R-squared is not error, but is a popular metric for accuracy of your model. It represents how close the data are to the fitted regression line. The higher the R-squared, the better the model fits your data. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).</li>
</ul>

In [None]:
from sklearn.metrics import r2_score

test_x = np.asanyarray(test[['sqft_living']])
test_y = np.asanyarray(test[['price']])
test_y_hat = reg.predict(test_x)

print("Mean absolute error (MAE) : %.2f" % np.mean(np.absolute(test_y_hat - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_hat - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y_hat , test_y) )

<h2 id="multiple_regression_model">II. Multiple Regression Model</h2>

<p>What if we want to predict Violent crime total using more than one variable?</p>

<p>If we want to use more variables in our model to predict Violent crime total, we can use <b>**Multiple Linear Regression**</b>.
Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response or target (dependent) variable and <b>two or more</b> predictor (independent) variables.
Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:</p>

$$
Y: Target \ Variable\\
X_1 :Predictor\ Variable \ 1\\
X_2: Predictor\ Variable \ 2\\
X_3: Predictor\ Variable \ 3\\
X_4: Predictor\ Variable \ 4\\
$$

$$
a: intercept\\
b_1 :coefficients \ of\ Variable \ 1\\
b_2: coefficients \ of\ Variable \ 2\\
b_3: coefficients \ of\ Variable \ 3\\
b_4: coefficients \ of\ Variable \ 4\\
$$

The equation is given by:
$$
Yp = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
$$

In reality, there are multiple variables that predict the price. When more than one independent variable is present, the process is called multiple linear regression. For example, predicting the price using the number of bedrooms, bathrooms, floors and sqft_living in the house . The good thing here is that Multiple linear regression is the extension of simple linear regression model.

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
x = np.asanyarray(train[['bedrooms','bathrooms','sqft_living','floors']])
y = np.asanyarray(train[['price']])
reg.fit(x,y)
# The coefficients
print ('Intercept: ', reg.intercept_)
print ('Coefficients: ', reg.coef_)

As mentioned before, __Coefficient__ and __Intercept__ , are the parameters of the fit line. 
Given that it is a multiple linear regression, with 5 parameters, and knowing that the parameters are the intercept and coefficients of hyperplane, sklearn can estimate them from our data. Scikit-learn uses plain Ordinary Least Squares method to solve this problem.

#### Ordinary Least Squares (OLS)
OLS is a method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by minimizing the sum of the squares of the differences between the target dependent variable and those predicted by the linear function. In other words, it tries to minimizes the sum of squared errors (SSE) or mean squared error (MSE) between the target variable (y) and our predicted output ($\hat{y}$) over all samples in the dataset.

OLS can find the best parameters using of the following methods:
    - Solving the model parameters analytically using closed-form equations
    - Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton’s Method, etc.)

 <h2 id="prediction">Prediction</h2>

In [None]:
y_hat= reg.predict(test[['bedrooms','bathrooms','sqft_living','floors']])
x = np.asanyarray(test[['bedrooms','bathrooms','sqft_living','floors']])
y = np.asanyarray(test[['price']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % reg.score(x, y))

__explained variance regression score:__  
If $\hat{y}$ is the estimated target output, y the corresponding (correct) target output, and Var is Variance, the square of the standard deviation, then the explained variance is estimated as follow:

$\texttt{explainedVariance}(y, \hat{y}) = 1 - \frac{Var\{ y - \hat{y}\}}{Var\{y\}}$  
The best possible score is 1.0, lower values are worse.

<h3>Thanks for completing this lesson!</h3>

<h4>Author:  <a href="https://www.linkedin.com/in/ibrahim-bahbah-491435172/">Ibrahim BAHBAH</a></h4>
<p><a href="https://www.linkedin.com/in/ibrahim-bahbah-491435172/i">Ibrahim BAHBAH</a>, An ambitious data science student who's striving to apply the data-driven approach for problem-solving.</p>