<h1 style="text-align: justify">🛑 During the analysis and model building I've built model that predict price of house with numerical variables and only categorical variable which is location</h1>

# 1</span> | </span> Data Cleaning and Preparation

# Import Libraries


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
import warnings
warnings.filterwarnings('ignore')
matplotlib.rcParams["figure.figsize"] = (16,8)

# 1| Load Dataset 📰

In [None]:
df = pd.read_csv('C:/Users/Mr Eugene/Downloads/Bengaluru_House_Data.csv')

In [None]:
df.head()

# 2| Dataset Exploration

In [None]:
df.shape

In [None]:
area_types_count = df['area_type'].value_counts(normalize=True)

In [None]:
fig, ax = plt.subplots(figsize=(5, 5))
area_types_count.plot(kind='bar', ax=ax)
plt.ylabel('Area Type')
plt.xlabel('Frequency')
plt.title('Area type counts');

In [None]:
fig, ax = plt.subplots(figsize=(5, 5))
(
    df.sort_values(by='price', ascending=False).head(20)
    .groupby('area_type')['area_type'].count()
    .plot(kind='bar', ax=ax)
)
plt.xlabel("Area Type")
plt.ylabel("Frequency")
plt.title("Top 20 expensive houses count (Area Type representation)")
plt.legend();

In [None]:
fig, ax = plt.subplots(figsize=(5, 5))
(
    df.sort_values(by='price', ascending=True).head(20)
    .groupby('area_type')['area_type'].count()
    .plot(kind='bar', ax=ax)
)
plt.xlabel("Area Type")
plt.ylabel("Frequency")
plt.title("20 Least expensive houses count (Area Type representation)")
plt.legend();

In [None]:
df['availability'].value_counts()

In [None]:
df['size'].unique()

# 3| Data Cleaning and Dataset Transformation

In [None]:
df.isnull().sum()

In [None]:
numerical_features = df.select_dtypes(include=[int, float])

In [None]:
numerical_features.isnull().sum() > 0

**I can fill missing values by the average as integer for these numerical features. Categorical features which are null will be deleted (dropped) from the dataset**

In [None]:
numerical_features.apply(lambda x: round(np.mean(x)))

In [None]:
df['bath'].fillna(3, inplace=True)

In [None]:
df['balcony'].fillna(2, inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df.shape

Nothing is yet removed from dataset

In [None]:
df['size'].unique()

In [None]:
df['bhk'] = df['size'].str.split(' ', expand=True)[0]
df['bhk'] = df['bhk'].astype(float)

In [None]:
df.head()

Now that becase bhk is exracted frin size, we can even remove this feature from dataset

In [None]:
df = df.drop(columns='size')

In [None]:
df.head()

In [None]:
df['bhk'].unique()

It is time to change `bhk` feature datatype from object to float

In [None]:
df['bhk'].astype(float)

We have to create new dataset that don't take into consideration entries where bhk in null

In [None]:
df_bhk = df[~df['bhk'].isnull()]

In [None]:
df_bhk.head()

In [None]:
df_bhk.isnull().sum()

Let me change the total_sqft data type to float

In [None]:
df_bhk[df_bhk['total_sqft'].apply(lambda x: '-' in x )].head()

Now let's create new `total_sqft` feature
But before I create new field, I've found that there are incorrect rows with strange`total_sqft` values labelled such as *34.46Sq. Meter, 4125Perch*. <span style="font-size: 1.5rem">This is strange 😅</span>

Let's check its index and drop it

In [None]:
# We will first need to change data type from string to float values
def convert_to_float (num):
    tokenized = num.split('-')
    if len(tokenized) == 2:
        return (float(tokenized[0]) + float(tokenized[1])) / 2
    try:
        return float(num)
    except Exception as e:
        return None

In [None]:
df_bhk['total_sqft'] = df_bhk['total_sqft'].apply(convert_to_float)

Let's check for `30` index which was initially has dirty `total_sqft` values

In [None]:
df_bhk.iloc[30]

As of above, you can see that `total_sqft` is now **2475.0**

<p style="font-size: 1.5rem; line-height: 1; font-style: italic; text-align: justify">When buying new house there is many features that is considered such as its <span style="color: green">location, dimension(square feet), number of rooms, architecture, bathrooms, etc.</span></p>

<p>Now we are going to look for those features against house price to check if there is a relationship.</p>

# Location vs Price

In [None]:
df_loc =df_bhk.copy()

In [None]:
df_loc.head()

In [None]:
 df_loc['location'].value_counts()

<p style="font-size: 1.5rem"> It seems like location can cause high cardinality issues. It is time to check for under represented location and group them</p>

In [None]:
less_than_10 = df_loc['location'].value_counts() < 10
less_than_10 = less_than_10[less_than_10 ]
less_than_10

In [None]:
df_loc['location'] = df_loc['location'].apply(lambda x: 'Other' if x in less_than_10 else x )
df_loc['location'].nunique()

Now location has been reduced to **254** values

In [None]:
price_mean = df_loc['price'].mean()
plt.figure(figsize=(10, 6))
sns.scatterplot(x='location', y='price', data=df_loc, alpha=.5)
plt.axhline(price_mean, color='red', linestyle='--', label=f'Price Mean (y = {price_mean:.2f})')
plt.ylabel('House Price')
plt.xlabel('House Location')
plt.xticks([])
plt.legend();

<p style="font-size: 1rem">Most of location, houses price is less than <b>1000.</b> What does this tell us?? There should be potential outliers.</p>
<br/>
<p style="font-size: 1rem">Another takeway is that mean is about 112. It seems that those outliers are skewing our data.</p>
Maybe we can exclude house with price that is above 1000. But first let me see statistical information


In [None]:
df_loc['price'].describe()

<p style="font-size: 1.3rem">It is likely to take off some data values. But wait 🤔 we can not treat data values as outliers based on price only. The best measure is to look on price per square feet. Price singly, can not give us useful information.</p>

# Price vs Square Feet

In [None]:
df_loc['price_per_sqft']  = round(df_loc['price']*100_000/df_loc['total_sqft'], 2)
df_loc.head()

Let's use price per square feets to check how price is spread accross the dataset

In [None]:
# Function that formats y axis to use M for millions and K for thousands
def ticker_values_formatter(value, ticker_num):
    if value >= 1e6:
        return f'{int(value/1e6)}M'
    elif value >=1e3:
        return f'{int(value/1e3)}K'
    else:
        return f'{int(value)}'

In [None]:
price_per_sqft_mean = df_loc['price_per_sqft'].mean()
plt.figure(figsize=(10, 6))
sns.scatterplot(x='location', y='price_per_sqft', data=df_loc, alpha=.5)
plt.axhline(price_per_sqft_mean,
            color='red', linestyle='--', 
            label=f'Price Mean (y = {price_per_sqft_mean:.2f})'
           )
plt.ylabel('House Price Per Square Feets')
plt.xlabel('House Location')
plt.xticks([])
plt.gca().yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(ticker_values_formatter))
plt.gca().yaxis.set_major_locator(matplotlib.ticker.MultipleLocator(1e6))
plt.legend();

<div>
    <p style="font-size: 1.3rem">From above you can see it is like all values fall to <span style="color: green">0</span> million while <span style="color: green">mean</span> is about <span style="color: green">7,912</span> 😱</p>

<p style="font-size: 1.3rem;">We can see one of data point that is about 12 Millions and one above 2 Millions</p>
</div>

# REMOVING OUTLIERS

In [None]:
df_no_extreme = df_loc.copy()

In [None]:
df_no_extreme.shape

In [None]:
df_cleaned_1 = df_no_extreme[df_no_extreme['price_per_sqft'] < 100_000]
df_cleaned_1.shape

In [None]:
price_per_sqft_mean = df_cleaned_1['price_per_sqft'].mean()
plt.figure(figsize=(10, 6))
sns.scatterplot(x='location', y='price_per_sqft', data=df_cleaned_1, alpha=.5)
plt.axhline(price_per_sqft_mean,
            color='red', linestyle='--', 
            label=f'Price Mean (y = {price_per_sqft_mean:.2f})'
           )
plt.ylabel('House Price Per Square Feets')
plt.xlabel('House Location')
plt.xticks([])
plt.gca().yaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(ticker_values_formatter))
plt.gca().yaxis.set_major_locator(matplotlib.ticker.MultipleLocator(100_000))
plt.legend();

I'm gonna remove values that fall bellow 1% and 99% percentiles

In [None]:
perc_99 = df_cleaned_1['price_per_sqft'] < np.percentile(df_cleaned_1['price_per_sqft'], 90)
perc_10 = df_cleaned_1['price_per_sqft'] > np.percentile(df_cleaned_1['price_per_sqft'], 10)

In [None]:
df_cleaned_2 = df_cleaned_1[perc_10 & perc_99]

In [None]:
df_cleaned_2.shape

In [None]:
plt.hist(df_cleaned_2['price_per_sqft'])

<p style="font-size: 2rem">We can even go deeper to what effect <span style="font-style: italic">bhk</span> can have to price of the house. However, I am going straing to encoding categorical features that I'll be using for model training</p>

# 4 | Building and Training a Model








It's time to encode categorical features because machine learning algorithms don't cope with texts

In [None]:
categorical_feat = df_cleaned_2.select_dtypes("object").drop(['society', 'area_type', 'availability'], axis='columns').columns.to_list()
categorical_feat

In [None]:
dumies = pd.get_dummies(df_cleaned_2[categorical_feat])
dumies.head(3)

In [None]:
df_cleaned_3 = (
    pd.concat(
        [
            df_cleaned_2.drop(columns=['society', 'area_type', 'availability']),
            dumies.drop('location_Other',axis = 'columns')
        ],
        axis='columns')
)
df_cleaned_3.head()

Next, we are going to drop `categorical features` that we already have their encoded values for model training

In [None]:
df_ready = df_cleaned_3.drop(categorical_feat, axis='columns')
df_ready.head()

In [None]:
df_ready.select_dtypes(include=[int, float]).corr()

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
sns.heatmap(df_ready.select_dtypes(include=[int, float]).corr(), ax=ax);

In [None]:
X = df_ready.drop('price', axis='columns')
X.head()

In [None]:
y = df_ready['price']
y.head()

# Building baseline

In [None]:
y_mean = round(np.mean(y), 2)
y_baseline = [y_mean] * len(y)

In [None]:
from sklearn.metrics import mean_absolute_error

Now it is time to split dataset for training and testing datasets

# Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=10)

In [None]:
mean_absolute_error(y, y_baseline)

# Fitting Linear Regression Model

Now we have to fit the model on algorithm

In [None]:
from sklearn.linear_model import LinearRegression
reg_model=  LinearRegression()
#fitting the model
reg_model.fit(X_train, y_train)

In [None]:
#Check check score on training data set
train_y_pred_score = round(reg_model.score(X_train, y_train), 2)
train_y_pred_score

In [None]:
#Check check score on training data set
test_y_pred_score = round(reg_model.score(X_test, y_test), 2)
test_y_pred_score

In [None]:
reg_model.intercept_

In [None]:
reg_model.coef_[:5]

In [None]:
y_pred = pd.Series(reg_model.predict(X_train)).apply(lambda x: round(x, 2))

In [None]:
print(pd.Series(y_pred[:5]))
pd.Series(y_train[:5])

<p style="font-size: 2rem">👆👆 If we check prediction vs real values it seems like model is building really good . </p>

In [None]:
round(mean_absolute_error(y_pred, y_train), 2)

<p style="font-size: 2rem">We are beating initial baseline which was around (45). Now it around 8 💪. </p>
<p style="font-size: 2rem">You can check the same on test dataset 💪. </p>
Let's visualize model residuals from prediction

In [None]:
residuals = y_pred - y_train
fig, ax= plt.subplots(figsize=(10, 5))
sns.histplot(residuals, kde=True, bins=30, ax=ax)
plt.title("Model Prediction Residuals distribution");

In [None]:
# For linear regression coefficients
coefficients = pd.Series(reg_model.coef_)
coefficients.sort_values().plot(kind='barh', figsize=(8, 6))
plt.title('Feature Importance')
plt.yticks([])
plt.show()

<p style="font-size: 1.5rem">Let's use cross validation to check model performance. But later on, I will use Grid search validation which allows to test more than one machine learning model at time</p>

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

In [None]:
cv_1 = KFold(n_splits=5, shuffle=False, random_state=None)
cv_2 = KFold(n_splits=5, shuffle=True, random_state=42)
print(f'Without shuffling: {cross_val_score(LinearRegression(), X, y, cv=cv_1)}')
print(f'With shuffling: {cross_val_score(LinearRegression(), X, y, cv=cv_2)}')

In [None]:
#Using ShuffleSplit method
cv_s = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
cross_val_score(LinearRegression(), X, y, cv=cv_s)

# 5 | Model Evaluation

As `owner` of the notebook I copied, I am going to use Grid Search technique to evaluate model on different machine learning algorithms to check which performs better.

I am going to test **Ridge, Lasso, LinearRegression, and DecisionTreeRegressor.**

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.tree import DecisionTreeRegressor

def grid_search_func(X,y):
    """
    Function: grid_search_func
    This function utilize GridSearchCV to use different machine learning algorithms to check algorithm which works better.
    Parameters:
        X: Predictors
        y: target variable (to be predicted)
    Return:
        Function return data frame with three columns/feaures which are 'model', best_score, best_params'
    """
    alg_config = {
        'LinearRegression' : {
            'model' : LinearRegression(),
            'params' : {
            }
        },
        'Ridge': {
            'model': Ridge(),
            'params':{
                'alpha': [1, 2]
            }
        },
        'Lasso' : {
            'model' : Lasso(),
            'params' : {
                'alpha' : [1,2],
                'selection' : ['random', 'cyclic']
            }
        },
        'decision_tree' : {
            'model' : DecisionTreeRegressor(),
            'params' : {
                'criterion' : ['squared_error', 'friedman_mse'],
                'splitter' : ['best', 'random']
            }
        }
    }

    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

    for model, config in alg_config.items():
        grid_search = GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        print(grid_search)
        grid_search.fit(X,y)
        print('finished fitting the moel')
        scores.append({
            'model': model,
            'best_score': grid_search.best_score_,
            'best_params': grid_search.best_params_
        })
    return pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])

In [None]:
#Testing function
model_perfomance = grid_search_func(X, y)
model_perfomance

In [None]:
model_perfomance

# 5| COMMUNICATION

In [None]:
from ipywidgets import widgets

Building functin that can recieve location, square feets, bath and bhk (bedrooms, hall, kitchen) and return prediction

In [399]:
def predict_price(location):
    """
    Generate prediction of one location
    Returns turple
        index 0: prediction
        index 1: Real Value
    """
    try:
        index = np.where(X.columns==location)[0][0]
        y_pred = round(reg_model.predict(X[X.index ==index])[0], 2)
        return(f'Prediction: {y_pred}, Real value: {y[y.index ==index].to_list()[0]}')
    except Exception as e:
        return None


In [395]:
#Initialize output area
output = widgets.Output()

In [396]:
def on_dropdown_value_change(change):
    location = change['new']
    with output:
        output.clear_output()
        prediction = predict_price(location)
        output.append_display_data(display_object=prediction)
        f"Prediction for {location} is: {prediction} Price"

In [397]:
options = [location for location in X.columns.to_list() if location.startswith('location')]
dropdown = widgets.Dropdown(
    options=options,
    value=options[49],
    description="Location Dropdown"
)

In [398]:
from IPython.display import display
display(dropdown, output)
#Attachhing observer function which run when value in dropdown changes
dropdown.observe(on_dropdown_value_change, names='value')

Dropdown(description='Location Dropdown', index=49, options=('location_ Devarachikkanahalli', 'location_1st Bl…

Output()