<a href="https://colab.research.google.com/github/soumo99/AI_ML_Projects/blob/main/Real_Estate_Regression_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


Loading the dataset

In [None]:
data_1 = pd.read_csv('/content/Bengaluru_House_Data.csv')
data_1.shape

In [None]:
data_1.head(10)

In [None]:
data_1.info

Grouping area type with the speicifc category

In [None]:
data_1.groupby('area_type')['area_type'].agg('count')

Dropping the extra columns 

In [None]:
data_2 = data_1.drop(['area_type','society','balcony','availability'],axis='columns')
data_2.head(10)

Checking for null values in the dataframe 

In [None]:
data_2.isnull().sum()

Dropping the NA rows since we have a lot number of rows for each category so dropping NA rows will not affect otherwise we would fill those values wiht the mean or median value

In [None]:
data_3 = data_2.dropna()
data_3.isnull().sum()

In [None]:
data_3.shape

unique funtion  will give the unique values of a particular column from the dataset

In [None]:
data_3['size'].unique()

Creating a new column named bhk and assigning a new variable x for each element for size column and from there only extracting the numbers. 

In [None]:
data_3['bhk'] = data_3['size'].apply(lambda x:int(x.split(' ')[0]))

In [None]:
data_3.head()

In [None]:
data_3['bhk'].unique()

In [None]:
data_3[data_3.bhk>20]

In [None]:
data_3.total_sqft.unique()

Trying to convert the total square feet value to float and if some values didn't able to convert then it will fall under the except block .

In [None]:
def is_float(x):
  try:
    float(x)
  except:
    return False
  return True

In [None]:
data_3[~data_3['total_sqft'].apply(is_float)] #~ symbol is used for checking the values from the dataset which are not converted to float 

Data cleaning process started 

Converting the range values to fixed values by calculating the avreages between the two otherwise changing the values to float .

In [None]:
def convert_sqft_to_num(x):
  tokens = x.split('-')
  if len(tokens) == 2:
    return (float(tokens[0])+float(tokens[1]))/2
  try:
    return float(x)
  except:
    return None

In [None]:
convert_sqft_to_num('2166')

In [None]:
convert_sqft_to_num('2100-2500')

In [None]:
convert_sqft_to_num('34.46Sq. Meter')

In [None]:
data_4 = data_3.copy()
data_4['total_sqft'] = data_4['total_sqft'].apply(convert_sqft_to_num)
data_4.head(10)

In [None]:
data_4.loc[410]

In [None]:
data_4.loc[30]

Calculating price per square feet

In [None]:
data_5 = data_4.copy()
data_5['price_per_sqft'] = data_5['price']*100000 / data_5['total_sqft']

In [None]:
data_5.head(10)

Checking the number of unique locations 

In [None]:
data_5.location.unique()


In [None]:
len(data_5.location.unique())

In [None]:
#For removing the extra white space 
data_5.location = data_5.location.apply(lambda x:x.strip())

#Calculating or sorting the location of datapoints
location_stats = data_5.groupby('location')['location'].agg('count').sort_values(ascending=False)
location_stats

Checking the number of location has less than 10 datapoints
Then it will considered as other locations 

In [None]:
len(location_stats[location_stats <= 10])

In [None]:
location_stats_less_than_10 = location_stats[location_stats <= 10]
location_stats_less_than_10

In [None]:
len(data_5.location.unique())

Considering the location whcih are less than 10 as other

In [None]:
data_5.location = data_5.location.apply(lambda x:'other' if x in location_stats_less_than_10 else x)

In [None]:
len(data_5.location.unique())

In [None]:
data_5.head(20)

Outlier detection and removal

In [None]:
data_5[data_5.total_sqft/data_5.bhk < 300].head()

In [None]:
data_6 = data_5[~(data_5.total_sqft/data_5.bhk < 300)]
data_6.shape

checking for price per square feet

In [None]:
data_6.price_per_sqft.describe()

Writing a function so that the extreme cases like the max price can be removed on a standard deviation

In [None]:
def remove_pps_outlier(df):
  df_out = pd.DataFrame()
  for key,subdf in df.groupby('location'):
    m = np.mean(subdf.price_per_sqft)
    st = np.std(subdf.price_per_sqft)
    reduced_df = subdf[(subdf.price_per_sqft > (m - st)) & (subdf.price_per_sqft <= (m + st))]
    df_out = pd.concat([df_out,reduced_df],ignore_index = True)
  return df_out


data_7 = remove_pps_outlier(data_6)
data_7.shape

Plotting a scatter plot for 2 and 3 bed rooms price checking . 

In [None]:
import matplotlib.pyplot as plt

def plot_scatter_chart(df,location):
  bhk_2 = df[(df.location == location) & (df.bhk == 2)]
  bhk_3 = df[(df.location == location) & (df.bhk == 3)]
  plt.rcParams['figure.figsize'] = (15,10)
  plt.scatter(bhk_2.total_sqft,bhk_2.price, color = 'blue' , label = '2 BHK', s = 50)
  plt.scatter(bhk_3.total_sqft,bhk_3.price, color = 'red' , label = '3 BHK', s = 50, marker  = '*')
  plt.xlabel = ('Total Square Feet Area ')
  plt.ylabel = ('Price Per Square Feet')
  plt.title(location)
  plt.legend()

plot_scatter_chart(data_7,'Rajaji Nagar')