<h1 style="color:green" >Machine Learning Big Projects Series: </h1>

<img height=200 width=600 src="2.jpg" />

The ML Big Project Series aims to create a comprehensive end-to-end application by showcasing a diverse range of machine learning skills and concepts through a series of engaging and practical projects. These projects provide a hands-on learning experience for experimenting with various machine learning techniques and their real-world applications. The focus is on practical implementation and learning through experimentation, making the process both enjoyable and informative.

# Table of Contents
    
- [Data Science Regression Project: Predicting Home Prices in Banglore](#0.0.)
    - [1.0. Data Pre-processing:](#1.0.)
        - [1.1. Data Cleaning:](#1.1.)
        - [1.2. Feature Engineering: ](#1.2.)
        - [1.3. Dimensionality Reduction: ](#1.3.)
        - [1.4. Outlier Removal Using Business Logic: ](#1.4.)
        - [1.5. Outlier Removal Using Standard Deviation and Mean:](#1.5.)
        - [1.6. Outlier Removal Using Bathrooms Feature:](#1.6.)
        - [1.7. Use One Hot Encoding For Location:](#1.7.)
    - [2.0. Model Building:](#2.0.)
        - [2.1. Use K Fold cross validation to measure accuracy of our LinearRegression model:](#2.1.)
        - [2.2. Find best model using GridSearchCV:](#2.2.)
        - [2.3. Test the model for few properties: ](#2.3.)
    - [3.0. Export the tested model to a pickle file:](#3.0.)
        - [3.1. Export location and column information to a file that will be useful later on in our prediction application:](#3.1.)

<h1 style="color:green" >Data Science Regression Project: Predicting Home Prices in Banglore</h1><a id='0.0.'></a>

Project Brief: This project aims to develop a real estate price prediction website using a step-by-step process. The first step involves building a model using scikit-learn and linear regression, using the Bangalore home prices dataset from Kaggle.com. The second step is to develop a Python Flask server that will use the saved model to serve HTTP requests. The third component is a website built using HTML, CSS, and JavaScript that allows users to input a home's square footage, number of bedrooms, and other relevant features, and retrieve the predicted selling price by calling the Flask server.

Throughout the project, we will cover a range of data science concepts, such as data cleaning, outlier detection and removal, feature engineering, dimensionality reduction, hyperparameter tuning using GridSearchCV, and k-fold cross-validation. We will also utilize various tools and technologies, including Python, NumPy, Pandas, Matplotlib, Jupyter Notebook, PyCharm, Python Flask, and HTML/CSS/JavaScript. The ultimate goal is to create an accurate predictive model that can estimate the value of a home based on its unique features, to assist buyers and sellers in making informed decisions.

dataset credits: https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib 
matplotlib.rcParams["figure.figsize"] = (20,10)

<h3>1.0. Data Pre-processing:</h3><a id='1.0.'></a>

In [None]:
# Load the data into a dataframe
bhp = pd.read_csv("bengaluru_house_prices.csv")
bhp.head()

In [None]:
bhp.shape

In [None]:
bhp.columns

In [None]:
bhp["area_type"].unique()

In [None]:
bhp["area_type"].value_counts()

In [None]:
# Drop unnecessary features to build the model
bhp = bhp.drop(['area_type','availability','society','balcony'],axis = 'columns')
bhp.shape

In [None]:
bhp.head()

<h3 style='color:grey'> 1.1. Data Cleaning: </h3><a id='1.1.'></a>

In [1]:
# Count missing values in each column
bhp.isnull().sum()

In [None]:
bhp.shape

In [None]:
# Drop missing values
bhp2 = bhp.dropna()
bhp2.isnull().sum()

In [None]:
bhp2.shape

<h3 style='color:grey'>1.2. Feature Engineering: </h3><a id='1.2.'></a>

In [None]:
# Display unique values in 'size' column
bhp2['size'].unique()

In [None]:
# Add a new feature 'bhk' 
bhp2['bhk'] = bhp2['size'].apply(lambda x: int(x.split(' ')[0]))
bhp2.head()

In [None]:
bhp2.bhk.unique()

In [None]:
# Display rows where 'bhk' value is greater than 20
bhp2[bhp2.bhk>20]

In [None]:
# Explore 'total_sqft' feature
bhp2['total_sqft'].unique()

In [None]:
# Define a function to convert string to float
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [None]:
# Display rows where 'total_sqft' is not a float
bhp2[~bhp2['total_sqft'].apply(is_float)].head(10)

The above data indicates that the total square footage can be presented as a range, such as in the case of 2100-2850. To handle such scenarios, we can compute the average of the minimum and maximum values in the range. Additionally, there may be instances, such as with 34.46 square meters, where unit conversion is required to convert the area to square footage. However, for the sake of simplicity, we will remove such corner cases.

In [None]:
# Define a function to convert range values to float
def convert_sqft_to_num(x):
    tokens = x.split('-')
    if len(tokens) == 2:
        return (float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None 

In [None]:
# Test the function with a single value
convert_sqft_to_num('2100')

In [None]:
# Test the function with a range value
convert_sqft_to_num('2100 - 2850')

In [None]:
# Test the function with a corner case
convert_sqft_to_num('34.46Sq. Meter')

In [None]:
# Make a copy of the dataframe
bhp3 = bhp2.copy()
# Convert 'total_sqft' values to float
bhp3.total_sqft = bhp3.total_sqft.apply(convert_sqft_to_num)
# Remove rows with null values in 'total_sqft' column
bhp3 = bhp3[bhp3.total_sqft.notnull()]
bhp3.head()

In [None]:
# Display a specific row
bhp3.loc[30]

The above information indicates that the total_sqft is 2475, which is obtained by taking the average of the range 2100-2850, that is, ((2100+2850)/2).

In [None]:
# Add a new feature called 'price_per_sqft'
bhp4 = bhp3.copy()
# calculating price per sqft by dividing price by total square feet area
bhp4['price_per_sqft'] = bhp4['price']*100000/bhp4['total_sqft']
bhp4.head()

In [None]:
# Get the statistics of 'price_per_sqft'
bhp4['price_per_sqft'].describe()

In [None]:
# Save the pre-processed data to a new CSV file
bhp4.to_csv("bhp.csv",index=False)

In [None]:
len(bhp4.location.unique())

Upon examining the locations variable, we observe that it is a categorical variable. However, we encounter the problem of the curse of dimensionality, where the number of locations is too high. Therefore, we need to apply dimensionality reduction techniques to reduce the number of locations.

In [None]:
# Examine the 'location' column which is a categorical variable with high cardinality
# remove leading/trailing white spaces from each location value
bhp4.location = bhp4.location.apply(lambda x: x.strip())
# get the frequency of each location
location_stats = bhp4['location'].value_counts(ascending=False)
location_stats

In [None]:
# get the total number of data points
location_stats.values.sum()

In [None]:
# get the number of locations with more than 10 data points
len(location_stats[location_stats>10])

In [None]:
# get the total number of unique locations
len(location_stats)

In [None]:
# get the number of locations with less than or equal to 10 data points
len(location_stats[location_stats<=10])

<h3 style="color:grey">1.3. Dimensionality Reduction: </h3><a id='1.3.'></a>

To decrease the number of categories, we can label any location that has less than 10 data points as "other." This will greatly reduce the number of categories. When we later perform one hot encoding, it will result in fewer dummy columns.

In [None]:
# Perform dimensionality reduction by tagging locations with less than 10 data points as 'other'
location_stats_less_than_10 = location_stats[location_stats<=10]
location_stats_less_than_10

In [None]:
len(bhp4.location.unique())

In [None]:
bhp4.location = bhp4.location.apply(lambda x: 'other' if x in location_stats_less_than_10 else x)
len(bhp4.location.unique())

In [None]:
bhp4.head(10)

<h3 style="color:grey">1.4. Outlier Removal Using Business Logic: </h3><a id='1.4.'></a>

During a discussion with a business manager who has expertise in real estate, it was mentioned that a typical square footage per bedroom is 300. For instance, a 2 bhk apartment should be at least 600 sqft. If an apartment has 2 bhk and only 400 sqft of area, it can be considered as an outlier and removed. To remove such outliers, we will set the minimum threshold of square footage per bhk to be 300 sqft.

In [None]:
# Remove outliers using business logic - remove data points where total sqft per bhk is less than 300
# Display the data points that satisfy the condition
bhp4[bhp4.total_sqft/bhp4.bhk<300].head()

Looking at the data points above, it is clear that there are errors in the dataset, such as a 6 bhk apartment with only 1020 sqft and an 8 bhk apartment with only 600 sqft. These data points can be safely removed as they do not align with typical real estate standards.

In [None]:
bhp4.shape

In [None]:
# remove the data points that satisfy the condition and create a new dataset
bhp5 = bhp4[~(bhp4.total_sqft/bhp4.bhk<300)]
bhp5.shape

<h3 style='color:grey'>1.5. Outlier Removal Using Standard Deviation and Mean: </h3><a id='1.5.'></a>

In [None]:
# Remove outliers using standard deviation and mean
bhp5.price_per_sqft.describe()

The analysis reveals a large discrepancy in property prices, with the minimum price per sqft being 267 rs/sqft and the maximum being 12000000. To address this, we need to remove outliers based on the mean and one standard deviation per location.

In [None]:
# Define a function to remove the outliers per location using mean and one standard deviation
def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]
        df_out = pd.concat([df_out,reduced_df],ignore_index=True)
    return df_out

# Remove the outliers using the above function and create a new dataset
bhp6 = remove_pps_outliers(bhp5)
bhp6.shape

Let us analyze the prices of 2 BHK and 3 BHK properties for a given location.

In [None]:
# Define a function to plot the scatter chart for a given location, showing the prices of 2 BHK and 3 BHK properties
def plot_scatter_chart(df,location):
    # Filter data for 2 BHK and 3 BHK properties in the given location
    bhk2 = df[(df.location==location) & (df.bhk==2)]
    bhk3 = df[(df.location==location) & (df.bhk==3)]
    # Set figure size and plot the scatter chart for both types of properties
    matplotlib.rcParams['figure.figsize'] = (15,10)
    plt.scatter(bhk2.total_sqft,bhk2.price,color='blue',label='2 BHK', s=50)
    plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='green',label='3 BHK', s=50)
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price (Lakh Indian Rupees)")
    plt.title(location)
    plt.legend()

# Plot scatter chart for 'Rajaji Nagar' and 'Hebbal' before removing outliers
plot_scatter_chart(bhp6,"Rajaji Nagar")

In [None]:
plot_scatter_chart(bhp6,"Hebbal")

We need to ensure that the property prices are consistent for a given location. If the price of a 3 bedroom apartment is less than that of a 2 bedroom apartment with the same square footage in the same location, we should remove those properties as they are inconsistent. To achieve this, we will create a dictionary of statistics per number of bedrooms (BHK) for each location, such as shown below. Then, we can remove the 2 BHK apartments whose price_per_sqft is less than the mean price_per_sqft of the 1 BHK apartments in the same location.
```
{
    '1' : {
        'mean': 4000,
        'std: 2000,
        'count': 34
    },
    '2' : {
        'mean': 4300,
        'std: 2300,
        'count': 22
    },    
}
```

In [None]:
# Define a function to remove outliers from the dataset based on price_per_sqft
def remove_bhk_outliers(df):
    # Create an empty array to store the indices of data points to be excluded from the dataset
    exclude_indices = np.array([])
    # Group the data by location and bhk and calculate the mean, standard deviation, and count of each group
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df.price_per_sqft),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
        # Iterate through each bhk group for the current location and exclude data points whose price_per_sqft is less than the mean price_per_sqft of the previous bhk group (if it exists and has more than 5 data points)
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
    # Remove the excluded data points from the dataset and return the cleaned dataset
    return df.drop(exclude_indices,axis='index')

# Remove outliers from the dataset and store the cleaned dataset in a new variable
bhp7 = remove_bhk_outliers(bhp6)
# df8 = df7.copy()
bhp7.shape

In [None]:
# Plot scatter chart for 'Rajaji Nagar' and 'Hebbal' after removing outliers
plot_scatter_chart(bhp7,"Rajaji Nagar")

In [None]:
plot_scatter_chart(bhp7,"Hebbal")

Based on the charts above, it is clear that there are certain data points that are highlighted in red and they can be considered as outliers. These outliers are being removed by using the function called remove_bhk_outliers.

In [None]:
# Plot histogram of price_per_sqft to visualize the effect of removing outliers
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)
plt.hist(bhp7.price_per_sqft,rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

<h3 style='color:grey'>1.6. Outlier Removal Using Bathrooms Feature: </h3><a id='1.6.'></a>

In [None]:
# Identify and remove outliers from the dataset based on the number of bathrooms
bhp7.bath.unique()

In [None]:
# Plot histogram of number of bathrooms to visualize the distribution
plt.hist(bhp7.bath,rwidth=0.8)
plt.xlabel("Number of bathrooms")
plt.ylabel("Count")

In [None]:
# Identify the outliers where the number of bathrooms is greater than 10 and print those rows
bhp7[bhp7.bath>10]

In [None]:
# According to the business manager, having 2 more bathrooms than the number of bedrooms is unusual. 
# Therefore, we remove the outliers where the number of bathrooms is greater than the number of bedrooms + 2
bhp7[bhp7.bath>bhp7.bhk+2]

During a conversation with the business manager,he pointed out that if a home has four bedrooms, the maximum number of bathrooms should be equal to the sum of total bedrooms and one guest bathroom. Any value higher than that indicates an outlier or a data error, and should be removed.

In [None]:
bhp8 = bhp7[bhp7.bath<bhp7.bhk+2]
bhp8.shape

In [None]:
bhp8.head()

In [None]:
# Drop the 'size' and 'price_per_sqft' columns as they are no longer required
bhp9 = bhp8.drop(['size','price_per_sqft'],axis='columns')
bhp9.head()

<h3 style='color:grey'>1.7. Use One Hot Encoding For Location: </h3><a id='1.7.'></a>

In [None]:
# Use one hot encoding to convert the categorical 'location' column into numerical data for the model
dummies = pd.get_dummies(bhp9.location)
dummies.head(10)

In [None]:
# Concatenate the one-hot encoded 'location' columns with the original dataframe 'bhp9'
# Drop the 'other' column from the one-hot encoded data as it is not required 
bhp10 = pd.concat([bhp9,dummies.drop('other',axis='columns')],axis='columns')
bhp10.head()

In [None]:
# Drop the 'location' column as we have now converted it to numerical data
bhp11 = bhp10.drop('location',axis='columns')
bhp11.head()

<h3 style='color:grey'>2.0. Model Building: </h3><a id='2.0.'></a>

In [None]:
bhp11.shape

In [None]:
# Split the data into training and testing datasets
X = bhp11.drop(['price'],axis='columns')
X.head()

In [None]:
X.shape

In [None]:
y = bhp11.price
y.head()

In [None]:
len(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10)

In [None]:
# Use Linear Regression to build the model and fit it on the training data
from sklearn.linear_model import LinearRegression
lr_clf = LinearRegression()
lr_clf.fit(X_train,y_train)

# Calculate the accuracy of the model on the testing data
lr_clf.score(X_test,y_test)

<h3 style='color:grey'>2.1. Use K Fold cross validation to measure accuracy of our LinearRegression model: </h3><a id='2.1.'></a>

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
cross_val_score(LinearRegression(), X, y, cv=cv)

After conducting 5 iterations, we observe that the model yields a score consistently above 80%. While this result is satisfactory, we aim to explore other regression algorithms to determine if we can achieve an even better performance. To accomplish this, we will utilize the GridSearchCV function.

<h3 style='color:grey'>2.2. Find best model using GridSearchCV: </h3><a id='2.2.'></a>

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor

# Define the algorithms to test along with their respective hyperparameters
def find_best_model_using_gridsearchcv(X,y):
    algos = {
        'linear_regression' : {
            'model': LinearRegression(),
            'params': {
                'copy_X': [True, False],
                'n_jobs': [1, -1]
            }
        },
        'lasso': {
            'model': Lasso(),
            'params': {
                'alpha': [1,2],
                'selection': ['random', 'cyclic']
            }
        },
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion' : ['mse','friedman_mse'],
                'splitter': ['best','random']
            }
        }
    }
    # Initialize an empty list to store the results
    scores = []
    # Define the cross-validation method and number of splits
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    # Iterate through each algorithm and perform a grid search to find the best hyperparameters
    for algo_name, config in algos.items():
        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X,y)
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })
    # Convert the results to a pandas dataframe and return it
    return pd.DataFrame(scores,columns=['model','best_score','best_params'])

# Call the function to find the best model
find_best_model_using_gridsearchcv(X,y)

After analyzing the results, we can conclude that LinearRegression has the highest score among the algorithms tested. Therefore, we will use it as our model.

<h3 style='color:grey'> 2.3. Test the model for few properties:  </h3><a id='2.3.'></a>

In [None]:
# Define a function to predict the price given the location, sqft, bath and bhk
def predict_price(location,sqft,bath,bhk):  
    # Get the index of the location in the one-hot encoded columns
    loc_index = np.where(X.columns==location)[0][0]
    
    # Create a feature vector of zeros with length equal to the number of columns
    x = np.zeros(len(X.columns))
    # Set the values for the sqft, bath and bhk features
    x[0] = sqft
    x[1] = bath
    x[2] = bhk
    # If the location is present in the dataset, set the corresponding feature value to 1
    if loc_index >= 0:
        x[loc_index] = 1
        
    # Use the trained model to predict the price
    return lr_clf.predict([x])[0]

In [None]:
# Test the model for a few properties
predict_price('1st Phase JP Nagar',1000, 2, 2)

In [None]:
predict_price('1st Phase JP Nagar',1000, 3, 3)

In [None]:
predict_price('Indira Nagar',1000, 2, 2)

In [None]:
predict_price('Indira Nagar',1000, 3, 3)

<h3 style='color:grey'>3.0. Export the tested model to a pickle file: </h3><a id='3.0.'></a>

In [None]:
# Export the tested model to a pickle file
import pickle
with open('banglore_home_prices_model.pickle','wb') as f:
    pickle.dump(lr_clf,f)

<h3 style='color:grey'>3.1. Export location and column information to a file that will be useful later on in our prediction application: </h3><a id='3.1.'></a>

In [None]:
import json
columns = {
    'data_columns' : [col.lower() for col in X.columns]
}
with open("columns.json","w") as f:
    f.write(json.dumps(columns))