# House Price Clustering and Content Based Recommendation System

## Table of Content

1. **[Header Files](#lib)**
2. **[About Data Set](#about)**
3. **[Data Preparation](#prep)**
    - 3.1 - **[Read Data](#read)**
    - 3.2 - **[Analysing Missing Values](#miss)**
    - 3.3 - **[Removing Outliers](#outliers)**
4. **[Explarotary Data Analysis](#eda)**
5. **[Hierarchial Clustering](#hier)**
6. **[K Means Clustering](#kmeans)**
7. **[Density Based Clustering](#DBScan)**
8. **[Principal Component Analysis](#PCA)**
9. **[Reccomendation System](#RS)**
10. **[Application](#app)**



<a id="lib"></a>
## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from numpy import isnan
from pandas import read_csv
from sklearn.impute import KNNImputer
from scipy.cluster.hierarchy import linkage,dendrogram,cut_tree
from sklearn.decomposition import KernelPCA
import sys
import numpy
numpy.set_printoptions(threshold=sys.maxsize)

plt.rcParams['figure.figsize']=[12,8]

<a id="about"></a>
## 2. About the Dataset
##### AREA_TYPE : The type of the area where the property is located
##### AVAILABILITY : Whether the property is available currently or not
##### LOCATION : Location where the property is situtated
##### SIZE : Number of Bedrooms
##### SOCIETY : The type of society where the property is located
##### TOTAL_SQFT : Total Square feet of the property
##### BATH : Number of Bathrooms
##### BALCONY : Number of Balconies
##### PRICE : Price of the property in Lakhs

<a id="prep"></a>
## 3. Data Preperation

<a id="read"></a>
## 3.1 Read Data

In [None]:
df=pd.read_csv('../input/bangalore-house-price/Bengaluru_House_Data.csv')
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df.info()


## 3.2 Analysing Missing Values

In [None]:
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull())

In [None]:
(df.isnull().sum()/len(df))*100

## Categorical Variables

In [None]:
#40% of the data is missing so droping the column
df.drop('society',axis=1,inplace=True)

In [None]:
df.location.replace(to_replace=np.NaN,value=df.location.mode()[0],inplace=True)
df.area_type.replace(to_replace=np.NaN,value=df.area_type.mode()[0],inplace=True)

In [None]:
df['size']=df['size'].str.split(' ',expand=True)[0].astype(np.number)
df.bath=df.bath.astype(np.number)

In [None]:
df.availability=(df.availability=='Ready To Move')
df.availability.replace({True:1,False:0},inplace=True)
df.availability=df.availability.astype(object)
df.availability.head()

In [None]:
df.isnull().sum() 

## Numerical Variables

In [None]:
df_cat=df.select_dtypes(object)
df_cat.head(1)

In [None]:
df_num=df.select_dtypes(np.number)
df_num.head(1)

In [None]:
imputer = KNNImputer()

# fit on the dataset
imputer.fit(df_num)

# transform the dataset
df_num_impute = pd.DataFrame(imputer.transform(df_num),columns=df_num.columns)

In [None]:
df=pd.concat([df_cat,df_num_impute],axis=1)

In [None]:
sns.heatmap(df.isnull())

In [None]:
(df.isnull().sum()/len(df))*100

<a id="outliers"></a>
## 3.3 Removing Outliers

In [None]:
df.boxplot()

In [None]:
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
IQR = q3-q1
IQR

df = df[~((df) > (q3 + (1.5 * IQR))).any(axis=1)]

df.head()

In [None]:
df.boxplot()

In [None]:
df.shape

<a id="eda"></a>
## 4.EDA

In [None]:
sns.violinplot(y = df.price)

In [None]:
plt.bar(x=df[df['availability'] == 1].groupby('location')['availability'].agg('count').sort_values(ascending=False).head(10).index,height=df[df['availability'] == 1].groupby('location')['availability'].agg('count').sort_values(ascending=False).head(10))
plt.xticks(rotation=45)
plt.ylabel('Availability of Rooms')

In [None]:
sns.countplot(df.area_type)
plt.xticks(rotation = 45)
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(df.availability)
plt.show()

In [None]:
plt.figure(figsize=(15,5))
sns.scatterplot(x="total_sqft", y="price", data=df)
plt.show()

In [None]:
df10 = df.copy()
df10['size'] = np.round(df10['size'])
plt.bar(x=df10.groupby('size')['price'].agg('mean').sort_values(ascending=False).head(10).index,height=df10.groupby('size')['price'].agg('mean').sort_values(ascending=False).head(10))
plt.ylabel('Average Price in Lakhs')
plt.xlabel('Number of Bedrooms')

In [None]:
# Costliest locations in Bangalore

In [None]:
plt.plot(df.groupby(by='location')['price'].agg('mean').sort_values(ascending=False).head(10))
plt.xlabel('Locations in Bangalore')
plt.ylabel('Average Price in Lakhs')
plt.xticks(rotation=45)

In [None]:
# Costliest Per sq.ft price in bglore

In [None]:
df['per_sqft'] = df['price']/df['total_sqft']

In [None]:
plt.plot(df.groupby(by='location')['per_sqft'].agg('mean').sort_values(ascending=False).head(10))
plt.xticks(rotation = 45)
plt.xlabel('Locations in Bangalore')
plt.ylabel('Average Price / Sq.ft in Lakhs')

In [None]:
df = df.drop('per_sqft',axis=1)

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(),annot=True,cbar=False)
plt.show()

## Standardisation

In [None]:
df.head()

In [None]:
data_cat=df.select_dtypes(exclude=np.number)
data_num=df.select_dtypes(np.number)
data_cat=data_cat.reset_index()

In [None]:
ss=StandardScaler()
data_num_scaled=pd.DataFrame(ss.fit_transform(data_num),columns=data_num.columns)
data_num_scaled=data_num_scaled.reset_index()
data_num_scaled.head()

In [None]:
data_scaled=pd.concat([data_num_scaled,data_cat],axis=1).drop('index',axis=1)
data_scaled.head()

## Encoding

In [None]:
# Reference Table For Displaying Reason for Selecting
data_encoded_refernce=pd.get_dummies(df,columns=['area_type','location'])
data_encoded_refernce.head()

In [None]:
data_encoded_scaled=pd.get_dummies(data_scaled,columns=['area_type','location'])
data_encoded_scaled.head()

<a id="hier"></a>
# 5.Hierarchical Clustering

In [None]:
d1 = data_encoded_scaled.copy()

In [None]:
data_cluster=data_encoded_refernce.copy()

In [None]:
cls=d1[['total_sqft','price']]
mergings=linkage(cls,method='ward',metric='euclidean')
dendrogram(mergings,truncate_mode='lastp')
plt.show()

In [None]:
cluster=cut_tree(mergings,n_clusters=3)

In [None]:
cluster_cut = pd.Series(cut_tree(mergings,n_clusters=3).reshape(-1))

In [None]:
cluster_cut.value_counts()

In [None]:
d1['cluster']=cluster

In [None]:
data_cluster['cluster']=cluster

In [None]:
sns.scatterplot(y=df.price,x=df['total_sqft'],hue=data_cluster['cluster'],palette='deep')

<a id="kmeans"></a>
# 6.KMEANS Clustering

In [None]:
ssd = []
for k in range(1,10):
    kmeans = KMeans(n_clusters=k,random_state=4)
    kmeans.fit(cls)
    ssd.append(kmeans.inertia_)

In [None]:
plt.plot(range(1,10),ssd,marker='*',color='b')

In [None]:
from sklearn.metrics import silhouette_score

score = []
for k in range(2,10):
    kmeans = KMeans(n_clusters=k,random_state=4)
    kmeans.fit(cls)
    labels = kmeans.labels_
    ss = silhouette_score(d1,labels)
    score.append(ss)

In [None]:
plt.plot(range(2,10),score,marker='*',color='r')
plt.ylabel('Average silhouette score')
plt.xlabel('no of clusters')

In [None]:
def svisualizer(x, ncluster):
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    import numpy as np
    from matplotlib import cm
    from sklearn.metrics import silhouette_samples

    km = KMeans(n_clusters=ncluster, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=0)
    y_km = km.fit_predict(x)

    cluster_labels = np.unique(y_km)
    n_clusters = cluster_labels.shape[0]
    silhouette_vals = silhouette_samples(x, y_km, metric='euclidean')
    y_ax_lower, y_ax_upper = 0, 0

    yticks = []
    for i, c in enumerate(cluster_labels):
        c_silhouette_vals = silhouette_vals[y_km==c]
        c_silhouette_vals.sort()
        y_ax_upper += len(c_silhouette_vals)
        color = cm.jet(i / n_clusters)
        plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color)

        yticks.append((y_ax_lower + y_ax_upper) / 2)
        y_ax_lower += len(c_silhouette_vals)

    silhouette_avg = np.mean(silhouette_vals)
    plt.axvline(silhouette_avg, color="red", linestyle="--") 

    plt.yticks(yticks, cluster_labels + 1)
    plt.ylabel('Cluster')
    plt.xlabel('Silhouette coefficient')

    plt.tight_layout()
    plt.show()	

In [None]:
svisualizer(cls,2)

In [None]:
svisualizer(cls,3)

In [None]:
svisualizer(cls,4)

In [None]:
model=KMeans(n_clusters=3,random_state=10)
cluster_kmeans=model.fit_predict(data_encoded_refernce)
data_cluster['kmeans_cluster']=cluster_kmeans
data_cluster.head()

In [None]:
df3 = data_cluster.iloc[:,[0,1,2,3,4,5,1119]]

In [None]:
df4 = df3.groupby(by='kmeans_cluster')
df4[['size','bath','balcony','total_sqft','price']].mean()

In [None]:
df4[['size','bath','balcony','total_sqft','price']].mean().plot.bar()
plt.show()

In [None]:
def cluster_plot(data, nclusters):
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    X = data.copy()
    cols = list(X.columns)
    km = KMeans(n_clusters=nclusters, init='random', n_init=10, max_iter=300, tol=1e-04, random_state=0)
    y_km = km.fit_predict(X)


    # Visualize it:
    plt.figure(figsize=(8, 6))
    plt.scatter(X.iloc[:,0], X.iloc[:,1], c=km.labels_.astype(float))

    # plot the centroids
    plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=250, marker='*', c='red', label='centroids')
    plt.xlabel(cols[0])
    plt.ylabel(cols[1])
    plt.legend(scatterpoints=1)
    plt.grid()
    plt.show()

In [None]:
cluster_plot(cls,3)

<a id="DBScan"></a>
### 7.DBSCAN

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
dbscan = DBSCAN(eps=0.25,min_samples=4)

dbscan.fit(cls)

pd.Series(dbscan.labels_).value_counts()

In [None]:
lbl = pd.Series(dbscan.labels_)
lbl.loc[lbl >= 0] = 0
lbl.value_counts()

In [None]:
plt.scatter(cls['total_sqft'],cls['price'],c=lbl)

<a id="PCA"></a>
## 8.Principal Component Analysis(PCA)

In [None]:
from sklearn.decomposition import PCA

In [None]:
d2 = data_num_scaled.drop(['index'],axis=1)

In [None]:
d2.head()

In [None]:
from sklearn.decomposition import PCA

pca=PCA()
pca.fit(d2)

print(np.cumsum(pca.explained_variance_ratio_*100))

In [None]:
pca=PCA(n_components=2)
pca.fit_transform(d2)

print(pca.explained_variance_ratio_*100)


In [None]:
components = pca.components_.T
pd.DataFrame(components,index=d2.columns,columns=['PC1','PC2'])

In [None]:
data_pca=pd.DataFrame(pca.transform(d2),columns=['PC1','PC2'])
data_pca

In [None]:
model=KMeans(n_clusters=3,random_state=10)
cluster_kmeans=model.fit_predict(data_pca)
data_pca['km_cluster_pca']=cluster_kmeans

In [None]:
#cluster_kmeans=model.fit_predict(data_pca).reshape(-1,1)

In [None]:
ssd = []
for k in range(1,10):
    kmeans = KMeans(n_clusters=k,random_state=4)
    kmeans.fit(data_pca)
    ssd.append(kmeans.inertia_)

plt.plot(range(1,10),ssd,marker='*',color='b')
plt.axhline(15000)
plt.show()

In [None]:
sns.scatterplot(x=data_pca['PC1'],y=data_pca['PC2'],hue=data_pca['km_cluster_pca'])

<a id="RS"></a>
# 9.Recommendation Systems

## Content Based Filter

In [None]:
model=NearestNeighbors(metric='cosine')
model.fit(data_encoded_scaled)

In [None]:
selected=7

In [None]:
data_encoded_scaled.iloc[selected].values[0:8]

In [None]:
dist,index=model.kneighbors(data_encoded_scaled.iloc[selected].values.reshape(1,-1),n_neighbors=6)
index

In [None]:
suggest=[]
for x in index:
    suggest.append(x)
suggest

In [None]:
data_encoded_refernce = data_encoded_refernce.reset_index()
data_encoded_refernce = data_encoded_refernce.drop('index',axis=1)

## Example 1

In [None]:
# Apartment user showed intrest in
r=data_encoded_refernce.iloc[[selected]][data_encoded_refernce.iloc[[selected]]>0].dropna(axis=1)
r

In [None]:
# Availabilty - 1           --> Ready to move
# Size        - 2           --> 2 BKH Apartment
# Total SqFt  - 1000        --> 1000 Sqft Apartment
# Price       - 38          --> Apartment price 38
# bath        - 2           --> Apartment with 2 bathrooms
# balcony     - 1           --> Apartment with 1 Balcony
# location    - JP Nagar    --> Apartment Near JP Nagar

In [None]:
# Similar Reccomendations
col=r.columns
for x in index:
    suggestions=data_encoded_refernce.iloc[list(x)][col]
suggestions

## Example 2

In [None]:
selected=125

data_encoded_scaled.iloc[selected].values[0:8]

dist,index=model.kneighbors(data_encoded_scaled.iloc[selected].values.reshape(1,-1),n_neighbors=6)
index
suggest=[]
for x in index:
    suggest.append(x)
suggest

In [None]:
# Apartment user showed intrest in
r=data_encoded_refernce.iloc[[selected]][data_encoded_refernce.iloc[[selected]]>0].dropna(axis=1)
r

In [None]:
# Size        - 2                           --> 2 BKH Apartment
# Total SqFt  - 1020                        --> 1020 Sqft Apartment
# Price       - 30.6                        --> Apartment price 30.6
# bath        - 2                           --> Apartment with 2 bathrooms
# balcony     - 1                           --> Apartment with 1 Balcony
# location    - Electronic City Phase II    --> Apartment Near Electronic City Phase II

In [None]:
# Similar Reccomendations
col=r.columns
for x in index:
    suggestions=data_encoded_refernce.iloc[list(x)][col]
suggestions

The Reccomendations given to the user are of apartments in the same location has similar number of bedrooms,balconies and bathrooms
and available at a similar Price Tag.

<a id="app"></a>
## Application:

Suggesting product with similar features is an important feature that can be accomedated in any website that does any kind sales to provide customers with a variety of options that intrigue the customer as well hopefully landing a customer by showcasing the variety of options available with the seller.  