Q1. In order to predict house price based on several characteristics, such as location, square footage,
number of bedrooms, etc., you are developing an SVM regression model. Which regression metric in this
situation would be the best to employ?

Dataset link: https://drive.google.com/file/d/1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0/view?

usp=share_link

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [35]:
data = pd.read_csv('/content/Bengaluru_House_Data.csv')

In [36]:
data.shape

(13320, 9)

In [37]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


In [38]:
data = data.drop(columns=['area_type', 'availability', 'society', 'balcony'])
data.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0


In [39]:
data.isnull().sum()

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

In [40]:
data = data.dropna()

In [41]:
data.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

In [42]:
data['bhk'] = data['size'].str.split().str.get(0).astype(int)

In [43]:
data['bhk'].unique()

array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18])

In [44]:
data[data['bhk']>20]

Unnamed: 0,location,size,total_sqft,bath,price,bhk
1718,2Electronic City Phase II,27 BHK,8000,27.0,230.0,27
4684,Munnekollal,43 Bedroom,2400,40.0,660.0,43


In [45]:
data['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

In [46]:
def convert_range(x):
  temp = x.split('-')
  if len(temp) == 2:
    return (float(temp[0]) + float(temp[1]))
  try:
    return float(x)
  except:
    return None

In [47]:
data['total_sqft'] = data['total_sqft'].apply(convert_range)

In [48]:
data['price_per_sqft'] = data['price']*100000 / data['total_sqft']

In [49]:
data

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.00,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.00,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.00,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.00,2,4250.000000
...,...,...,...,...,...,...,...
13315,Whitefield,5 Bedroom,3453.0,4.0,231.00,5,6689.834926
13316,Richards Town,4 BHK,3600.0,5.0,400.00,4,11111.111111
13317,Raja Rajeshwari Nagar,2 BHK,1141.0,2.0,60.00,2,5258.545136
13318,Padmanabhanagar,4 BHK,4689.0,4.0,488.00,4,10407.336319


In [50]:
data['location'].value_counts()

Whitefield           534
Sarjapur  Road       392
Electronic City      302
Kanakpura Road       266
Thanisandra          233
                    ... 
Vidyapeeta             1
Maruthi Extension      1
Okalipura              1
Old Town               1
Abshot Layout          1
Name: location, Length: 1304, dtype: int64

In [51]:
data['location'] = data['location'].apply(lambda x : x.strip())
location_count = data['location'].value_counts()

In [52]:
location_count

Whitefield                        535
Sarjapur  Road                    392
Electronic City                   304
Kanakpura Road                    266
Thanisandra                       236
                                 ... 
Vasantapura main road               1
Bapuji Layout                       1
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
Abshot Layout                       1
Name: location, Length: 1293, dtype: int64

In [53]:
location_count_less_10 = location_count[location_count<=10]
location_count_less_10

Naganathapura                     10
Sadashiva Nagar                   10
Nagappa Reddy Layout              10
BTM 1st Stage                     10
Sector 1 HSR Layout               10
                                  ..
Vasantapura main road              1
Bapuji Layout                      1
1st Stage Radha Krishna Layout     1
BEML Layout 5th stage              1
Abshot Layout                      1
Name: location, Length: 1052, dtype: int64

In [54]:
data['location'] = data['location'].apply(lambda x: 'other' if x in location_count_less_10 else x)

In [55]:
data['location'].value_counts()

other                 2881
Whitefield             535
Sarjapur  Road         392
Electronic City        304
Kanakpura Road         266
                      ... 
Nehru Nagar             11
Banjara Layout          11
LB Shastri Nagar        11
Pattandur Agrahara      11
Narayanapura            11
Name: location, Length: 242, dtype: int64

In [56]:
data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,13200.0,13246.0,13246.0,13246.0,13200.0
mean,1573.1051,2.692586,112.389392,2.801902,7893.298
std,1266.432547,1.341506,149.076587,1.295758,106728.1
min,1.0,1.0,8.0,1.0,267.8298
25%,1100.0,2.0,50.0,2.0,4230.769
50%,1280.0,2.0,72.0,3.0,5416.667
75%,1685.0,3.0,120.0,3.0,7307.692
max,52272.0,40.0,3600.0,43.0,12000000.0


In [57]:
(data['total_sqft']/data['bhk']).describe()

count    13200.000000
mean       581.040216
std        396.942188
min          0.250000
25%        473.333333
50%        553.333333
75%        626.666667
max      26136.000000
dtype: float64

In [58]:
data = data[((data['total_sqft']/data['bhk'])>=300)]

In [59]:
data.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250.0


In [60]:
data.shape

(12456, 7)

In [61]:
def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]
        df_out = pd.concat([df_out,reduced_df],ignore_index=True)
    return df_out

In [62]:
data = remove_pps_outliers(data)

In [63]:
data.shape

(10222, 7)

In [64]:
def remove_bhk_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df.price_per_sqft),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')

In [65]:
data = remove_bhk_outliers(data)

In [66]:
data.shape

(7303, 7)

In [67]:
data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,7303.0,7303.0,7303.0,7303.0,7303.0
mean,1494.385818,2.452691,98.785051,2.504176,6125.221728
std,857.157929,1.007056,92.900584,0.923655,2424.906442
min,300.0,1.0,10.0,1.0,1300.0
25%,1100.0,2.0,50.0,2.0,4583.333333
50%,1260.0,2.0,73.31,2.0,5670.926518
75%,1680.0,3.0,112.0,3.0,6903.382228
max,30000.0,16.0,2200.0,16.0,24509.803922


In [68]:
data = data.drop(columns=['size', 'price_per_sqft'])

In [69]:
data.sample(10)

Unnamed: 0,location,total_sqft,bath,price,bhk
227,8th Phase JP Nagar,1160.0,2.0,95.0,2
5193,Nagarbhavi,884.0,2.0,36.0,1
6881,Uttarahalli,850.0,2.0,35.0,2
1931,Electronic City,1360.0,2.0,75.0,3
9426,other,1464.0,2.0,135.0,3
5499,R.T. Nagar,1560.0,3.0,125.0,3
6955,Uttarahalli,900.0,2.0,35.0,2
7060,Vasanthapura,1037.0,2.0,36.28,2
4723,Kumaraswami Layout,1081.0,2.0,60.0,2
8810,other,1170.0,2.0,86.0,2


In [70]:
X = data.drop(columns=['price'])
y = data['price']

In [71]:
from sklearn.model_selection import train_test_split

In [72]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [73]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

In [74]:
column_trans = make_column_transformer((OneHotEncoder(sparse=False),['location']), remainder='passthrough')

In [75]:
scaler = StandardScaler()

In [76]:
svr = SVR()

In [77]:
pipe = make_pipeline(column_trans, scaler, svr)

In [78]:
pipe.fit(X_train, y_train)



In [79]:
y_pred = pipe.predict(X_test)

In [80]:
r2_score(y_test, y_pred)

0.17683618329051776

Q2. You have built an SVM regression model and are trying to decide between using MSE or R-squared as
your evaluation metric. Which metric would be more appropriate if your goal is to predict the actual price
of a house as accurately as possible?

In [None]:
# MSE measure of the overall accuracy of the model's predictions. A lower MSE indicates that the model is better at predicting the actual price of a house.
# R-squared is a useful metric for understanding how well the independent variables explain the variance in the dependent variable, it does not directly measure the accuracy of the model's predictions.
# MSE is more appropriate

Q3. You have a dataset with a significant number of outliers and are trying to select an appropriate
regression metric to use with your SVM model. Which metric would be the most appropriate in this
scenario?

In [None]:
# When dealing with a dataset that has a significant number of outliers, Mean Absolute Error (MAE) would be the most appropriate regression metric to use with an SVM model.

Q4. You have built an SVM regression model using a polynomial kernel and are trying to select the best
metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values
are very close. Which metric should you choose to use in this case?

In [None]:
# either metric could be used to evaluate the performance of the SVM regression model with a polynomial kernel.

Q5. You are comparing the performance of different SVM regression models using different kernels (linear,
polynomial, and RBF) and are trying to select the best evaluation metric. Which metric would be most
appropriate if your goal is to measure how well the model explains the variance in the target variable?

In [None]:
# R-squared would be the most appropriate evaluation metric to use if your goal is to measure how well the model explains the variance in the target variable. 