**Predicting Housing Prices for regions in the USA.**
The data contains the following columns:

**'Avg. Area Income':** Avg. Income of residents of the city house is located in.

**'Avg. Area House Age':** Avg Age of Houses in same city.

**'Avg. Area Number of Rooms':** Avg Number of Rooms for Houses in same city.

**'Avg. Area Number of Bedrooms':** Avg Number of Bedrooms for Houses in same city.

**'Area Population':** Population of city house is located in.

**'Price':** Price that the house sold at.

**'Address':** Address for the house

In [None]:
#Import all necessary Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#Import the Dataset
df=pd.read_csv('../input/usa-housing/USA_Housing.csv')

In [None]:
#Data Overview
df.head()

In [None]:
print('The Dataset has got {} rows and {} columns'.format(df.index.nunique(),df.columns.nunique()))

In [None]:
df.info()

In [None]:
df.columns

In [None]:
#EDA
plt.figure(figsize=(8,5),dpi=200)
sns.pairplot(df)

In [None]:
plt.figure(figsize=(8,5),dpi=150)
sns.distplot(df['Price'],hist_kws=dict(edgecolor='yellow' ,linewidth=3),color='purple')


The **Prise** has got normal distribution.

In [None]:
#df.corr()
sns.heatmap(df.corr(), annot=True,cmap='Blues')

In [None]:
#Determine the Features & Target Variable
X=df[['Avg. Area Income','Avg. Area House Age','Avg. Area Number of Rooms','Avg. Area Number of Bedrooms','Area Population']]
y=df['Price']


we split datas to a group of *features(X)* and a *label(y)*.

**Starting Linear Regressin model:**

In [None]:
#Split the Dataset to Train & Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

This time we split Datas to train and test ,building a model on train datas and evaluating the model in test datas.

In [None]:
#Train the Model
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(X_train, y_train)
pd.DataFrame(model.coef_ , X.columns ,columns=['coefcient'])

In [None]:
#predicting Test Data
y_pred= model.predict(X_test)
pd.DataFrame({'Y_Test': y_test,'Y_Pred':y_pred})[:5]

we predict a y quantity for each y_test data and then compare it with the real value of its label to understand how much the model is aqurated.

In [None]:
#Evaluating the Model
from sklearn import metrics
MAE_linear=metrics.mean_absolute_error(y_test , y_pred)
MSE_linear=metrics.mean_squared_error(y_test , y_pred)
RMSE_linear=np.sqrt(MSE_linear)
pd.DataFrame([MAE_linear,MSE_linear,RMSE_linear], index=['MAE_linear','MSE_linear','RMSE_linear'],columns=['Quantity'])

Metrics help us to figure out errors densitys.

In [None]:
#Residuals:
test_residual=y_test-y_pred
sns.scatterplot(x=y_test,y=y_pred,color='green' ,s=200)
plt.ylabel('y_pred')
plt.xlabel('y_test')
plt.title('bias of y')

This shows relations are linear.

In [None]:
sns.scatterplot(x=y_test,y=test_residual,s=200)
plt.axhline(y=0,color='red',ls='--')

The plot shows intenses are choosed accidently from all parts and doesnt have specific pattern so the model words well.

**Starting Polynomial Model**

In [None]:
# Preprocessing
from sklearn.preprocessing import PolynomialFeatures
polynomial_converter=PolynomialFeatures(degree=2, include_bias=False)
poly_features=polynomial_converter.fit(X)
poly_features=polynomial_converter.transform(X)


In [None]:
# Split the Data to Train & Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)


In [None]:
# Train the Model
from sklearn.linear_model import LinearRegression
polymodel=LinearRegression()
polymodel.fit(X_train, y_train)


In [None]:
# Predicting Test Data
y_pred=polymodel.predict(X_test)
pd.DataFrame({'Y_Test': y_test,'Y_Pred':y_pred, 'Residuals':(y_test-y_pred) }).head(5)

In [None]:
# Evaluating the Model
from sklearn import metrics
MAE_Poly = metrics.mean_absolute_error(y_test,y_pred)
MSE_Poly = metrics.mean_squared_error(y_test,y_pred)
RMSE_Poly = np.sqrt(MSE_Poly)

pd.DataFrame([MAE_Poly, MSE_Poly, RMSE_Poly], index=['MAE_Poly', 'MSE_Poly', 'RMSE_Poly'], columns=['metrics'])

In [None]:
# Adjusting Model Parameters
# Train List of RMSE per degree
train_RMSE_list=[]
#Test List of RMSE per degree
test_RMSE_list=[]

for d in range(1,14):
    
    #Preprocessing
    #create poly data set for degree (d)
    polynomial_converter= PolynomialFeatures(degree=d, include_bias=False)
    poly_features= polynomial_converter.fit(X)
    poly_features= polynomial_converter.transform(X)
    
    #Split the dataset
    X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)
    
    #Train the Model
    polymodel=LinearRegression()
    polymodel.fit(X_train, y_train)
    
    #Predicting on both Train & Test Data
    y_train_pred=polymodel.predict(X_train)
    y_test_pred=polymodel.predict(X_test)
    
    #Evaluating the Model
    
    #RMSE of Train set
    train_RMSE=np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))
    
    #RMSE of Test Set
    test_RMSE=np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))
    
    #Append the RMSE to the Train and Test List
    
    train_RMSE_list.append(train_RMSE)
    test_RMSE_list.append(test_RMSE)

In [None]:
pd.DataFrame({'train_RMSE_list': train_RMSE_list,'test_RMSE_list':test_RMSE_list})[:5]

In [None]:
#**Plot the Polynomial degree VS RMSE**

plt.plot(range(1,14), train_RMSE_list[:13], label='Train RMSE')
plt.plot(range(1,14), test_RMSE_list[:13], label='Test RMSE')

plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE')
plt.legend()

The plot shows that Linear Regression is better than polynomial Regrassion

 Starting Regularization:

In [None]:
#Preprocessing
from sklearn.preprocessing import PolynomialFeatures
polynomial_converter= PolynomialFeatures(degree=1, include_bias=False)
poly_features= polynomial_converter.fit_transform(X)


In [None]:
# Split the Data to Train & Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

In [None]:
# Scaling the Data
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
scaler.fit(X_train)
X_train= scaler.transform(X_train)
X_test= scaler.transform(X_test)

In [None]:
#Regularization
#1- Ridge Regression
#Train the Model
from sklearn.linear_model import Ridge
ridge_model= Ridge(alpha=10)
ridge_model.fit(X_train, y_train)


In [None]:
#predict Test Data
y_pred= ridge_model.predict(X_test)

In [None]:
#Evaluating the Model
from sklearn.metrics import mean_absolute_error, mean_squared_error

MAE= mean_absolute_error(y_test, y_pred)
MSE= mean_squared_error(y_test, y_pred)
RMSE= np.sqrt(MSE)
pd.DataFrame([MAE, MSE, RMSE], index=['MAE_Ridge', 'MSE_Ridge', 'RMSE_Ridge'], columns=['metrics'])

In [None]:
#Ridge Regression (Coosing an alpha value with Cross-Validation
 #Train the Model
from sklearn.linear_model import RidgeCV
ridge_cv_model=RidgeCV(alphas=(0.1, 1.0, 10.0), scoring='neg_mean_absolute_error')
ridge_cv_model.fit(X_train, y_train)

In [None]:
ridge_cv_model.alpha_

In [None]:
#Predicting Test Data
y_pred_ridge= ridge_cv_model.predict(X_test)

In [None]:
MAE_ridge= mean_absolute_error(y_test, y_pred_ridge)
MSE_ridge= mean_squared_error(y_test, y_pred_ridge)
RMSE_ridge= np.sqrt(MSE_ridge)
pd.DataFrame([MAE_ridge, MSE_ridge, RMSE_ridge], index=['MAE_ridge_CV', 'MSE_ridge_CV', 'RMSE_ridge_CV'], columns=['Ridge Metrics'])

In [None]:
ridge_cv_model.coef_

In [None]:
#2- Lasso Regression
from sklearn.linear_model import LassoCV
lasso_cv_model= LassoCV(eps=0.01, n_alphas=100, cv=5)
lasso_cv_model.fit(X_train, y_train)


In [None]:
lasso_cv_model.alpha_

In [None]:
y_pred_lasso= lasso_cv_model.predict(X_test)
MAE_Lasso= mean_absolute_error(y_test, y_pred_lasso)
MSE_Lasso= mean_squared_error(y_test, y_pred_lasso)
RMSE_Lasso= np.sqrt(MSE_Lasso)


In [None]:
pd.DataFrame([MAE_Lasso, MSE_Lasso, RMSE_Lasso], index=['MAE_Lasso', 'MSE_Lasso', 'RMSE_Lasso'], columns=['Lasso Metrics'])

In [None]:
lasso_cv_model.coef_

In [None]:
#3- Elastic Net
from sklearn.linear_model import ElasticNetCV
elastic_model= ElasticNetCV(l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1],cv=5, max_iter=100000)
elastic_model.fit(X_train, y_train)



In [None]:
elastic_model.l1_ratio_

In [None]:
y_pred_elastic=elastic_model.predict(X_test)

In [None]:
MAE_Elastic= mean_absolute_error(y_test, y_pred_elastic)
MSE_Elastic= mean_squared_error(y_test, y_pred_elastic)
RMSE_Elastic= np.sqrt(MSE_Elastic)

In [None]:
pd.DataFrame([MAE_Elastic, MSE_Elastic, RMSE_Elastic], index=['MAE_Elastic', 'MSE_Elastic', 'RMSE_Elastic'], columns=['Elastic Metrics'])

In [None]:
elastic_model.coef_

So linear Regressin is the best Model 