# House Price Prediction with Linear Regression

![](https://i.imgur.com/3sw1fY9.jpg)

We predict the price of house based on various features leveraging the understanding of corelation amongst the features using linear regression model ie Ridge

Steps Include:

1. Downloading and exploring the data
2. Preparing the dataset for training
3. Training a linear regression model
4. Make predictions and evaluating the model

Loading the data from the file train.csv into a Pandas data frame.

In [None]:
!pip install numpy pandas matplotlib seaborn plotly opendatasets jovian --quiet

In [None]:
import pandas as pd
prices_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

In [None]:
prices_df

In [None]:
prices_df.info()

Describing number of rows and columns does the dataset contain?

In [None]:
n_rows = prices_df.shape[0]
n_cols = len(prices_df.columns)
print('The dataset contains {} rows and {} columns.'.format(n_rows, n_cols))

understand the corelation between the features by visualizing the dataset

In [None]:
prices_df.corr()

Finding top correlations with absolute values and both direct and indirection relations are important

In [None]:
c = prices_df.corr().abs()
e = c['SalePrice']
d= c['SalePrice']>0.65
f = e[d]
f = pd.DataFrame(data=f)
f

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
x=prices_df['OverallQual']
y = prices_df['SalePrice']
plt.scatter(prices_df['GrLivArea'],y)

In [None]:
sns.boxplot(x,y)

## Step 2 - Preparing the Dataset for Training

Before we can train the model, we need to prepare the dataset. Here are the steps we'll follow:

1. Identify the input and target column(s) for training the model.
2. Identify numeric and categorical input columns.
3. [Impute](https://scikit-learn.org/stable/modules/impute.html) (fill) missing values in numeric columns
4. [Scale](https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range) values in numeric columns to a $(0,1)$ range.
5. [Encode](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features) categorical data into one-hot vectors.
6. Split the dataset into training and validation sets.

### Identify Inputs and Targets

While the dataset contains 81 columns, not all of them are useful for modeling. Note the following:

- The first column `Id` is a unique ID for each house and isn't useful for training the model.
- The last column `SalePrice` contains the value we need to predict i.e. it's the target column.
- Data from all the other columns (except the first and the last column) can be used as inputs to the model.

In [None]:
prices_df

In [None]:
# Identify the input columns (a list of column names)
input_cols = list(prices_df.columns[1:-1])

In [None]:
# Identify the name of the target column (a single string, not a list)
target_col = prices_df.columns[-1]

In [None]:
print(list(input_cols))

In [None]:
len(input_cols)

In [None]:
print(target_col)

In [None]:
inputs_df = prices_df[input_cols].copy()
targets = prices_df[target_col]

In [None]:
inputs_df

In [None]:
targets

### Identify Numeric and Categorical Data

The next step in data preparation is to identify numeric and categorical columns. We can do this by looking at the data type of each column.

In [None]:
import numpy as np
numeric_cols = inputs_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = inputs_df.select_dtypes(include='object').columns.tolist()

In [None]:
print(list(numeric_cols))

In [None]:
print(list(categorical_cols))

### Impute Numerical Data

Some of the numeric columns in our dataset contain missing values (`nan`).

In [None]:
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

In [None]:
from sklearn.impute import SimpleImputer
# 1. Create the imputer
imputer = SimpleImputer(strategy='median')
# 2. Fit the imputer to the numeric colums
imputer.fit(prices_df[numeric_cols])
list(imputer.statistics_)

In [None]:
inputs_df[numeric_cols] = imputer.transform(inputs_df[numeric_cols])

In [None]:
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

### Scale Numerical Values

The numeric columns in our dataset have varying ranges. 

In [None]:
inputs_df[numeric_cols].describe().loc[['min', 'max']]

A good practice is to [scale numeric features](https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range) to a small range of values e.g. $(0,1)$. Scaling numeric features ensures that no particular feature has a disproportionate impact on the model's loss. Optimization algorithms also work better in practice with smaller numbers.


Scaling the numeric values to the  (0,1)  range using MinMaxScaler from sklearn.preprocessing.

In [None]:
from sklearn.preprocessing import MinMaxScaler
# Create the scaler
scaler = MinMaxScaler()
scaler.fit(inputs_df[numeric_cols])
inputs_df[numeric_cols] = scaler.transform(inputs_df[numeric_cols])

In [None]:
inputs_df[numeric_cols].describe().loc[['min', 'max']]

Encode Categorical Columns
Our dataset contains several categorical columns, each with a different number of categories.

In [None]:
inputs_df[categorical_cols].nunique().sort_values(ascending=False)

Since machine learning models can only be trained with numeric data, we need to convert categorical data to numbers. A common technique is to use one-hot encoding for categorical columns.

<img src="https://i.imgur.com/n8GuiOO.png" width="640">

One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.

encoding categorical columns in the dataset as one-hot vectors using OneHotEncoder from sklearn.preprocessing. Add a new binary (0/1) column for each category

In [None]:
from sklearn.preprocessing import OneHotEncoder
# 1. Create the encoder
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
# 2. Fit the encoder to the categorical colums
encoder.fit(inputs_df[categorical_cols])

In [None]:
# 3. Generate column names for each category
encoded_cols = list(encoder.get_feature_names(categorical_cols))
encoded_cols

In [None]:
inputs_df[encoded_cols] = encoder.transform(inputs_df[categorical_cols])
inputs_df

### Training and Validation Set

Finally, let's split the dataset into a training and validation set. We'll use a randomly select 25% subset of the data for validation. Also, we'll use just the numeric and encoded columns, since the inputs to our model must be numbers. 

In [None]:
from sklearn.model_selection import train_test_split
train_inputs, val_inputs, train_targets, val_targets = train_test_split(inputs_df[numeric_cols + encoded_cols], 
                                                                        targets, 
                                                                        test_size=0.25, 
                                                                        random_state=42)

In [None]:
train_inputs

In [None]:
train_targets

In [None]:
val_inputs

In [None]:
val_targets

## Step 3 - Train a Linear Regression Model

We're now ready to train the model. Linear regression is a commonly used technique for solving [regression problems](https://jovian.ai/aakashns/python-sklearn-logistic-regression/v/66#C6). In a linear regression model, the target is modeled as a linear combination (or weighted sum) of input features. The predictions from the model are evaluated using a loss function like the Root Mean Squared Error (RMSE).


Here's a visual summary of how a linear regression model is structured:

<img src="https://i.imgur.com/iTM2s5k.png" width="480">

However, linear regression doesn't generalize very well when we have a large number of input columns with co-linearity i.e. when the values one column are highly correlated with values in other column(s). This is because it tries to fit the training data perfectly. 

Instead, we'll use Ridge Regression, a variant of linear regression that uses a technique called L2 regularization to introduce another loss term that forces the model to generalize better. Learn more about ridge regression here: https://www.youtube.com/watch?v=Q81RR3yKn30

In [None]:
from sklearn.linear_model import Ridge
# Create the model
model = Ridge()
# Fit the model using inputs and targets
model.fit(train_inputs[numeric_cols + encoded_cols], train_targets)

## Step 4 - Make Predictions and Evaluate Your Model

The model is now trained, and we can use it to generate predictions for the training and validation inputs. We can evaluate the model's performance using the RMSE (root mean squared error) loss function.

In [None]:
from sklearn.metrics import mean_squared_error
train_preds = model.predict(train_inputs[numeric_cols + encoded_cols])
train_preds

In [None]:
train_rmse = mean_squared_error(train_targets,train_preds,squared=False )

In [None]:
print('The RMSE loss for the training set is $ {}.'.format(train_rmse))

In [None]:
val_preds = model.predict(val_inputs)
val_preds

In [None]:
val_rmse = mean_squared_error(val_targets,val_preds,squared=False )
print('The RMSE loss for the validation set is $ {}.'.format(val_rmse))

### Feature Importance

Let's look at the weights assigned to different columns, to figure out which columns in the dataset are the most important.

In [None]:
weights = model.coef_
weights_df = pd.DataFrame({
    'columns': train_inputs.columns,
    'weight': weights
}).sort_values('weight', ascending=False)
weights_df

### Saving the model

Let's save the model (along with other useful objects) to disk, so that we use it for making predictions without retraining.

In [None]:
import joblib
house_price_predictor = {
    'model': model,
    'imputer': imputer,
    'scaler': scaler,
    'encoder': encoder,
    'input_cols': input_cols,
    'target_col': target_col,
    'numeric_cols': numeric_cols,
    'categorical_cols': categorical_cols,
    'encoded_cols': encoded_cols
}
joblib.dump(house_price_predictor, 'house_price_predictor.joblib')

### As we have done training,fitting, optimizing our model, lets try testing it with test.csv dataset

In [None]:
testing_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
testing_data

We now have to transform the data similar to what we did with our train data

In [None]:
testing_data[numeric_cols] = imputer.transform(testing_data[numeric_cols])
testing_data[numeric_cols] = scaler.transform(testing_data[numeric_cols])
testing_data[encoded_cols] = encoder.transform(testing_data[categorical_cols].values)
X_input = testing_data[numeric_cols + encoded_cols]

In [None]:
testing_data

Followin our final output for testing dataset.

In [None]:
test_preds = model.predict(X_input)
outp = {'id': testing_data['Id'],
        'SalePrice': test_preds}

In [None]:
out_df = pd.DataFrame(outp)
out_df

In [None]:
out_df.to_csv('output.csv',index=False)