![](https://i.imgur.com/ZghLZVU.png)

![](https://i.imgur.com/j6FuBvD.jpg)

![](https://i.imgur.com/thaKJ1N.png)

### What are the things that a potential home buyer considers before purchasing a house? The location, the size of the property, vicinity to offices, schools, parks, restaurants, hospitals or the stereotypical white picket fence? What about the most important factor — the price?

### Now with the lingering impact of demonetization, the enforcement of the Real Estate (Regulation and Development) Act (RERA), and the lack of trust in property developers in the city, housing units sold across India in 2017 dropped by 7 percent. In fact, the property prices in Bengaluru fell by almost 5 percent in the second half of 2017, said a study published by property consultancy Knight Frank.

### For example, for a potential homeowner, over 9,000 apartment projects and flats for sale are available in the range of ₹42-52 lakh, followed by over 7,100 apartments that are in the ₹52-62 lakh budget segment, says a report by property website Makaan. According to the study, there are over 5,000 projects in the ₹15-25 lakh budget segment followed by those in the ₹34-43 lakh budget category.

### Buying a home, especially in a city like Bengaluru, is a tricky choice. While the major factors are usually the same for all metros, there are others to be considered for the Silicon Valley of India. With its help millennial crowd, vibrant culture, great climate and a slew of job opportunities, it is difficult to ascertain the price of a house in Bengaluru.

![](https://i.imgur.com/Gd1CZrN.png)

### By analyzing these Bangalore house data we will determine the approximate price for the houses.

![](https://i.imgur.com/vl7xtxF.png)

## Columns:

    area_type
    availability
    location
    size
    society
    total_sqft
    bath
    balcony
    price

![](https://i.imgur.com/VQf9nV2.png)

### 1. The cost of a mis-classification can be high.
### 2. There is strict latency concerns.
### 3. From this project we will able to understand how house prices depend on other factors. 

![](https://i.imgur.com/WPXJm4c.png)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib 
matplotlib.rcParams["figure.figsize"] = (20,10)
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from prettytable import PrettyTable

#### Reading the Data from the CSV file

In [None]:
df = pd.read_csv("../input/bengaluru-house-price-data/Bengaluru_House_Data.csv")
df.head()

![](https://i.imgur.com/8nbAftN.png)

#### Printing the shape of the dataframe

In [None]:
df.shape

#### Total number of columns in the dataset


In [None]:
df.columns

#### Information about the dataset

In [None]:
df.info()

#### To know the description about the dataset

In [None]:
df.describe()

In [None]:
df.describe().T

#### Checking if there is some null values or not

In [None]:
df.isnull()

#### Checking if there is some null values or not (Sum)

In [None]:
df.isnull().sum()

#### Performing Group by operation on Area Type

In [None]:
df.groupby("area_type")["area_type"].agg("count")

#### Checking what different "Area Types" are present in the Dataset

In [None]:
df['area_type'].unique()

![](https://i.imgur.com/NoGC4qm.png)

In [None]:
df.area_type.value_counts().plot(kind='bar')

In [None]:
df['bath'].unique()

In [None]:
df.bath.value_counts().plot(kind='bar')

In [None]:
(df["area_type"].value_counts()).plot.pie(autopct="%.1f%%", shadow=True,rotatelabels=True, wedgeprops={'linewidth': 6}, radius=2)
plt.show()

In [None]:
plt.figure(figsize = (20,8))
ax=sns.countplot(x = 'balcony', data = df)
plt.xticks(rotation = 90)
for p in ax.patches:
    ax.annotate(int(p.get_height()), (p.get_x()+0.25, p.get_height()+1), va = 'bottom', color = 'black')

In [None]:
import warnings
warnings.filterwarnings('ignore')

sns.FacetGrid(df, hue='area_type', height=6).map(sns.distplot, 'bath').add_legend()
plt.show()

In [None]:
sns.FacetGrid(df, hue='area_type', height=6).map(sns.distplot, 'balcony').add_legend()
plt.show()

#### Dropping less important features

In [None]:
df = df.drop(["area_type", "society","balcony", "availability"], axis = "columns")

In [None]:
df.shape

#### Dropping null values

In [None]:
df = df.dropna()

In [None]:
df.isnull().sum()

In [None]:
df.shape

![](https://i.imgur.com/bZn8N2J.png)

#### Applying unique function on feature called Size

In [None]:
df["size"].unique()

#### From the above we can clearly see that Bedroom is represented with 2 different methods. One is BHK and the other one is Bedroom. So we are making a new column called BHK and we are discarding all the units (like BHK, Bedroom).

In [None]:
df['BHK'] = df["size"].apply(lambda x: int(x.split(" ")[0]))

In [None]:
df.head()

In [None]:
plt.figure(figsize = (20,8))
ax=sns.countplot(x = 'BHK', data = df)
plt.xticks(rotation = 90)
for p in ax.patches:
    ax.annotate(int(p.get_height()), (p.get_x()+0.25, p.get_height()+1), va = 'bottom', color = 'black')

In [None]:
df.total_sqft.unique()

In [None]:
sns.boxplot(x = 'BHK', y = 'price', data = df)
plt.show()

In [None]:
sns.violinplot(x='BHK', y = 'bath', data = df)
plt.show()

#### Exploring total_sqft feature

In [None]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [None]:
df[~df["total_sqft"].apply(is_float)].head(10)

#### From the above we can see that total_sqft can be a range (say, 3090-5002). For such cases we can just take average of the minimum and maximum value in the range. There are other cases such as 34.46Sq. Meter which one can convert to square ft using unit conversion. So, we are going to just drop such corner cases to keep things simple.

#### Converting Sq.Ft to Number

In [None]:
def convert_sqft_to_number(x):
    tokens = x.split("-")
    if len(tokens) == 2:
        return (float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None

In [None]:
df = df.copy()
df["total_sqft"] = df["total_sqft"].apply(convert_sqft_to_number)
df.head(10)

#### Here, we are adding a new feature called Price per Square Feet

In [None]:
df = df.copy()
df["price_per_sqft"] = df["price"]*100000/df["total_sqft"]
df.head()

#### Here, we are going to use Dimentionality Reduction for the data which are categorical variable. We need to apply Dimensionality Reduction here to reduce number of locations.

In [None]:
df.location = df.location.apply(lambda x: x.strip())
location_stats = df['location'].value_counts(ascending=False)
location_stats

In [None]:
len(location_stats[location_stats<=10])

In [None]:
location_stats_less_than_10 = location_stats[location_stats<=10]
location_stats_less_than_10

In [None]:
df.location = df.location.apply(lambda x: 'other' if x in location_stats_less_than_10 else x)
len(df.location.unique())

In [None]:
df.head()

#### Here we will discard some more data. Because, normally if a square ft per bedroom is 300 (i.e. 2 bhk apartment is minimum 600 sqft. If you have for example 400 sqft apartment with 2 bhk than that seems suspicious and can be removed as an outlier. We will remove such outliers by keeping our minimum thresold per bhk to be 300 sqft

In [None]:
df[df.total_sqft/df.BHK<300].head()

In [None]:
df = df[~(df.total_sqft/df.BHK<300)]
df.shape

#### Here we find that min price per sqft is 267 rs/sqft whereas max is 12000000, this shows a wide variation in property prices. We should remove outliers per location using mean and one Standard Deviation

In [None]:
def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]
        df_out = pd.concat([df_out,reduced_df],ignore_index=True)
    return df_out
df = remove_pps_outliers(df)
df.shape

#### Ploting the Scatter Chart for 2 BHK and 3 BHK properties

In [None]:
def plot_scatter_chart(df,location):
    bhk2 = df[(df.location==location) & (df.BHK==2)]
    bhk3 = df[(df.location==location) & (df.BHK==3)]
    matplotlib.rcParams['figure.figsize'] = (8,6)
    plt.scatter(bhk2.total_sqft,bhk2.price,color='blue',label='2 BHK', s=50)
    plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='green',label='3 BHK', s=50)
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price (Lakh Indian Rupees)")
    plt.title(location)
    plt.legend()
    
plot_scatter_chart(df,"Rajaji Nagar")

#### Ploting the histogram for Price Per Square Feet vs Count

In [None]:
plt.hist(df.price_per_sqft,rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

#### Ploting the histogram for Number of bathrooms vs Count

In [None]:
plt.hist(df.bath,rwidth=0.8)
plt.xlabel("Number of bathrooms")
plt.ylabel("Count")

In [None]:
df[df.bath>10]

#### It is unusual to have 2 more bathrooms than number of bedrooms in a home. So we are discarding that also.

In [None]:
df[df.bath>df.BHK+2]

In [None]:
df.head()

#### Using One Hot Encoding for Location

In [None]:
dummies = pd.get_dummies(df.location)
dummies.head()

#### Concatinating both the dataframes together

In [None]:
df = pd.concat([df,dummies.drop('other',axis='columns')],axis='columns')
df.head()

In [None]:
df = df.drop('location',axis='columns')
df.head()

In [None]:
X = df.drop(['price'],axis='columns')
X.head()

In [None]:
y = df.price
y.head()

In [None]:
X = X.drop(['size'],axis='columns')
X.head()

In [None]:
X = X.drop(['price_per_sqft'],axis='columns')
X.head()

In [None]:
X.shape

![](https://i.imgur.com/lUO8zIo.png)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [None]:
lr_clf = LinearRegression()
lr_clf.fit(X_train,y_train)
lr_clf.score(X_test,y_test)

In [None]:
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
cross_val_score(LinearRegression(), X, y, cv=cv)

Here we are using Grid Search CV for 3 different types of Regression models:

<b>1. Linear Regression</b>

<b>2. Lasso Regression</b>

<b>3. Decision Tree Regression</b>

![](https://i.imgur.com/EKH21H1.png)

In [None]:
def find_best_model_using_gridsearchcv(X,y):
    algos = {
        'linear_regression' : {
            'model': LinearRegression(),
            'params': {
                'normalize': [True, False]
            }
        },
        'lasso': {
            'model': Lasso(),
            'params': {
                'alpha': [1,2],
                'selection': ['random', 'cyclic']
            }
        },
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion' : ['mse','friedman_mse'],
                'splitter': ['best','random']
            }
        }
    }
    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    for algo_name, config in algos.items():
        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X,y)
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })

    return pd.DataFrame(scores,columns=['model','best_score','best_params'])

![](https://i.imgur.com/dGf4FQS.png)

#### Comparing all the models

In [None]:
find_best_model_using_gridsearchcv(X,y)

![](https://i.imgur.com/ZfSCUnS.png)

In [None]:
def predict_price(location,sqft,bath,bhk):    
    loc_index = np.where(X.columns==location)[0][0]

    x = np.zeros(len(X.columns))
    x[0] = sqft
    x[1] = bath
    x[2] = bhk
    if loc_index >= 0:
        x[loc_index] = 1

    return lr_clf.predict([x])[0]

#### Here we are predicting the house prices based on Location, Size, Bathroom, and BHK

In [None]:
predict_price('1st Phase JP Nagar',1000, 2, 2)

In [None]:
predict_price('1st Phase JP Nagar',200, 3, 3)

In [None]:
predict_price('Indira Nagar',5000, 2, 2)

In [None]:
predict_price('Indira Nagar',1500, 3, 3)

## References and Acknowledgements

<b> Bengaluru House price data - https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data

<b> CodeBasics YouTube Channel - https://www.youtube.com/channel/UCh9nVJoWXmFb7sLApWGcLPQ