<a href="https://colab.research.google.com/github/waelrash1/DL/blob/master/Linear%20regression%20in%20python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statitical models from statsmodel.api

---
## Simple and Multiple Linear Regression


In [None]:
import statsmodels.api as sm
import numpy as np
import pandas as pd
from sklearn import datasets ## imports datasets from scikit-learn
import matplotlib.pyplot as plt



In [None]:
data = datasets.load_boston() ## loads Boston dataset from datasets library 

In [None]:
data.feature_names
print(data.DESCR)

In [None]:
# define the data/predictors as the pre-set feature names  
df = pd.DataFrame(data.data, columns=data.feature_names)

# Put the target (housing value -- MEDV) in another DataFrame
target = pd.DataFrame(data.target, columns=["MEDV"])

In [None]:
df

In [None]:

target

## Simple OLS without constant (intercept/bias)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

X = df["RM"]
y = target["MEDV"]
#colors = np.random.rand(506)

plt.scatter(X,y, alpha=0.4)
plt.xlabel('RM')
plt.ylabel('MEDV')
plt.title('Boston housing')
plt.show()

In [None]:
## Without a constant

# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()

**Interpreting the Table** —This is a very long table, isn’t it? First we have what’s the dependent variable and the model and the method. OLS stands for Ordinary Least Squares and the method “Least Squares” means that we’re trying to fit a regression line that would minimize the square of distance from the regression line (see the previous section of this post). Date and Time are pretty self-explanatory :) So as number of observations. Df of residuals and models relates to the degrees of freedom — “the number of values in the final calculation of a statistic that are free to vary.”
The coefficient of 3.6534 means that as the RM variable increases by 1, the predicted value of MDEV increases by 3.6534. A few other important values are the R-squared — the percentage of variance our model explains; the standard error (is the standard deviation of the sampling distribution of a statistic, most commonly of the mean); the t scores and p-values, for hypothesis test — the RM has statistically significant p-value; there is a 95% confidence intervals for the RM (meaning we predict at a 95% percent confidence that the value of RM is between 3.548 to 3.759).

---



## Simple OLS with constant (intercept/bias)

In [None]:
X = df["RM"] ## X usually means our input variables (or independent variables)
y = target["MEDV"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

**Interpreting the Table** — 

---

With the constant term the coefficients are  different. Without a constant we are forcing our model to go through the origin, but now we have a y-intercept at -34.67. We also changed the slope of the RM predictor from 3.634 to 9.1021.

## Multiple Regression

In [None]:
X = df[["RM", "LSTAT"]]
X

In [None]:
import pandas as pd
import seaborn as sns
X = df[["RM", "LSTAT"]]
y = target["MEDV"]
# one dataframe to for visualisation
df2=X.assign(MEDV=y)


sns.pairplot(df2,diag_kind = 'kde')
plt.show()

In [None]:
# Multiple Regression
X = df[["RM", "LSTAT"]]
X = sm.add_constant(X)
y = target["MEDV"]
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()

**Interpreting the Output** — We can see here 

---

that this model has a much higher R-squared value — 0.639, meaning that this model explains 63.9% of the variance in our dependent variable. Whenever we add variables to a regression model, R² will be higher. We can see that both RM and LSTAT are statistically significant in predicting (or estimating) the median house value; not surprisingly , we see that as RM increases by 1, MEDV will increase by 5.0948 and when LSTAT increases by 1, MEDV will decrease by -0.6424. As you may remember, LSTAT is the percentage of lower status of the population, and unfortunately we can expect that it will lower the median value of houses. With this same logic, the more rooms in a house, usually the higher its value will be.

# Linear Regression in SKLearn


In [None]:
from sklearn import linear_model


In [None]:
from sklearn import datasets ## imports datasets from scikit-learn
data = datasets.load_boston() ## loads Boston dataset from datasets library

In [None]:
# define the data/predictors as the pre-set feature names  
df = pd.DataFrame(data.data, columns=data.feature_names)

# Put the target (housing value -- MEDV) in another DataFrame
target = pd.DataFrame(data.target, columns=["MEDV"])

In [None]:
X = df
y = target["MEDV"]

In [None]:
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

In [None]:
predictions = lm.predict(X)  # yhat
print(predictions[0:5])

In [None]:
#R-squared
from sklearn.metrics import r2_score,mean_squared_error
print(r2_score(predictions,y),mean_squared_error(predictions,y))
#lm.score(X,y)

In [None]:
data.feature_names

In [None]:
lm.coef_ # coefficients


In [None]:
lm.intercept_ #Intercept

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#Residual plot
plt.scatter(predictions, predictions-y, c='g', s = 40,alpha=.4)
plt.hlines(y=0, xmin=0, xmax=50,alpha=.4)
plt.title('Residual plot')
plt.ylabel('Residual')