## House Price Prediction Challenge

House Price Prediction Challenge -

Overview: Welcome to the House Price Prediction Challenge, you will test your regression skills by designing an algorithm to accurately predict the house prices in India. Accurately predicting house prices can be a daunting task. The buyers are just not concerned about the size(square feet) of the house and there are various other factors that play a key role to decide the price of a house/property. It can be extremely difficult to figure out the right set of attributes that are contributing to understanding the buyer's behavior as such.


This dataset has been collected across various property aggregators across India. In this competition, provided the 12 influencing factors your role as a data scientist is to predict the prices as accurately as possible. Also, in this competition, you will get a lot of room for feature engineering and mastering advanced regression techniques such as Random Forest, Deep Neural Nets, and various other ensembling techniques.



[Click here to Get Dataset](https://www.kaggle.com/datasets/anmolkumar/house-price-prediction-challenge)

### Import Libararies 


In [1]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline  # for more controls - magic keyword



UsageError: unrecognized arguments: # for more controls - magic keyword


### Read Data - Extract Some info and Statistics

In [None]:
data = pd.read_csv('train.csv')

data.head()

### Explore Data

In [None]:
data.info()

In [None]:
data.describe()

#### There is som outliers in our data such as the minimum and maximum in SQUARE_FT and in TARGET(PRICE_IN_LACS)





In [None]:



# check if there is null values

In [None]:
data.isnull().sum()

#### There is no null values in our data

### Data Cleaning and handling the outliers  -   Exploratory Data Analysis 

In [None]:
# Handling the outliers - 1 - using filtering


sns.boxplot(data['SQUARE_FT']) # 1- first way to check outliers (boxplot)

In [None]:
#sns.histplot(data['SQUARE_FT']) - the second way using histogram

In [None]:
data = data[(data['SQUARE_FT'] > 430) & (data['SQUARE_FT'] < 2.550688e+03)]

data

In [None]:
data.describe()

In [None]:
sns.boxplot(data['SQUARE_FT'])

In [None]:
data.shape

#### Now there is some logic in our data (data is logic and there is less than outliers data)

### Data Visualization

In [None]:
# ways of analysis (Visualize) [univariate  Analysis (one by one), Bivariate  Analysis (two features), Multivariate  Analysis(multiple features)]

##### Univariate analysis

In [None]:

sns.countplot(data['UNDER_CONSTRUCTION'])  # has constructed or still

In [None]:
sns.countplot(data['BHK_OR_RK'])  # type of room

In [None]:
data['BHK_OR_RK'].value_counts() # just only 4 and almost data is BHK type

In [None]:
# there is no variance (BHK_OR_RK) column or feature - No variaty - then this feature doesn't affect in prediction - Droping

In [None]:
sns.countplot(data['POSTED_BY']) # posted by who

In [None]:
sns.countplot(data['READY_TO_MOVE'])

In [None]:
sns.countplot(data['RERA'])

In [None]:
sns.countplot(data['BHK_NO.'])

In [None]:
sns.countplot(data['READY_TO_MOVE'])

In [None]:
data['READY_TO_MOVE'].value_counts()

In [None]:
sns.countplot(data['RESALE'])

##### Bivariate analysis


In [None]:
# using Scatterplot
# SQUARE_FT VS TARGET(PRICE_IN_LACS
sns.scatterplot(data['SQUARE_FT'], data['TARGET(PRICE_IN_LACS)'])

In [None]:
# there is no big relationship with them

In [None]:
sns.scatterplot(data['BHK_NO.'], data['TARGET(PRICE_IN_LACS)'])

In [None]:
sns.scatterplot(data['READY_TO_MOVE'], data['TARGET(PRICE_IN_LACS)'])

In [None]:
sns.scatterplot(data['POSTED_BY'], data['TARGET(PRICE_IN_LACS)'])

In [None]:
sns.scatterplot(data['RESALE'], data['TARGET(PRICE_IN_LACS)'])

In [None]:
sns.scatterplot(data['UNDER_CONSTRUCTION'], data['TARGET(PRICE_IN_LACS)'])

##### Multivariate analysis

In [None]:
#using pairplot
sns.pairplot(data)

In [None]:
plt.figure(figsize=[15, 8])
plt.title("Relationship between the whole data")
sns.heatmap(data.corr(), annot=True, )

### Finding and Answering Questions (Q & A)

#### How much department's price that selled by Deller, Owner Builder ?

In [None]:
data.groupby('POSTED_BY').sum()['TARGET(PRICE_IN_LACS)']

In [None]:
data.groupby('POSTED_BY').sum()['TARGET(PRICE_IN_LACS)'].plot.bar()

In [None]:
#data.groupby('POSTED_BY').sum()['TARGET(PRICE_IN_LACS)'].plot.pie()

#### How many Departments posted by Deller, Owner, Builder ?

In [None]:
data['POSTED_BY'].value_counts()

In [None]:
data['POSTED_BY'].value_counts().plot.bar()

#### What is the mean price for every department accourding to the number of rooms ?

In [None]:
data.groupby('BHK_NO.').mean()['TARGET(PRICE_IN_LACS)']

In [None]:
data.groupby('BHK_NO.').mean()['TARGET(PRICE_IN_LACS)'].plot.bar()

### Feature Engineering and Feature Selecting

In [None]:
# feature engineering we create new features from existing features and feature selecting we selecet concerned features

#### 1- Feature Engineering 

In [None]:
data['location'] = data['ADDRESS'].apply(lambda x: str(x).split(',')[-1]) # split returns a list and -1 to return the last item

data['location']

In [None]:
data.head()

In [None]:
#-----------------------------------------  Done - New Feature Added  --------------------------#

In [None]:
data['location'].value_counts()

In [None]:
#data['location'].unique()

In [None]:
plt.figure(figsize=[15, 10])
data.groupby('location').sum().head(5)['TARGET(PRICE_IN_LACS)'].plot.bar()

In [None]:
#-------------------------- Here we noticed that the location is important to us --------------------#

#### 2 - Feature Selecting

###### 2.1 - Drop the unimportant columns

In [None]:
data.drop(['BHK_OR_RK', 'ADDRESS', 'LATITUDE', 'LONGITUDE'], axis=1, inplace=True)

In [None]:
data.head()

###### 2.2- Convert Categorical to numerical

In [None]:
data = pd.get_dummies(data, drop_first=True)

In [None]:
data.head()

### Spliting Data to [Train,  Test] 

In [None]:
X = data.drop('TARGET(PRICE_IN_LACS)', axis=1)
Y = data['TARGET(PRICE_IN_LACS)']

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2,random_state=42)# to choose a specific random sample  

In [None]:
x_train.head()

##### Feature Scaling 

In [None]:
# Normalization and Standarization ---> two ways to apply scaling

In [None]:
# Normalization - all the values between (1, -1) we can't use when there are outliers

In [None]:
# Standarization - we calculate the mean and standard deviation  and use StandardScaler module to perform that

In [None]:
# in our case we will use Standarization

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)

In [None]:
#x_train_scaled

In [None]:
x_test_scaled = scaler.transform(x_test) # we don't use fit because x_test is hidden data for the model

#x_test_scaled

### Machine Learing Model Building

#### 1- Building the model

In [None]:

from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

linear_model.fit(x_train_scaled, y_train)

#### 2- Calculate Y Prediction (Predict the results) 

In [None]:
y_predicted = linear_model.predict(x_test_scaled)

In [None]:
y_predicted

#### 3- Calculate the Performance and Accuracy of the model

In [None]:
linear_model.score(x_train_scaled, y_train) # evaluate the model on train data

In [None]:
linear_model.score(x_test_scaled, y_test) # evaluate the model on train data-the model is under fitting -1.1198126...

In [None]:
# After calculate the Accuracy we found out that model is not good for prediction - use a nother model

#### Using Lasso model

In [None]:

from sklearn.linear_model import Lasso

lasso_model = Lasso()

In [None]:
lasso_model.fit(x_train_scaled, y_train)

In [None]:
y_predicted_lasso = lasso_model.predict(x_test_scaled)

In [None]:
y_predicted_lasso

In [None]:
lasso_model.score(x_train_scaled, y_train)

In [None]:
lasso_model.score(x_test_scaled, y_test)

In [None]:
# another measure -- to measure the error using mean_squared_error from sklearn module 

from sklearn.metrics import mean_squared_error
m_s_error = mean_squared_error(y_test, y_predicted_lasso)

m_s_error

### Using Ridge Model

In [None]:
from sklearn.linear_model import Ridge

ridge_model = Ridge()

In [None]:
ridge_model.fit(x_train_scaled, y_train)

In [None]:
y_predicted_ridge = ridge_model.predict(x_test_scaled)

y_predicted_ridge

In [None]:
ridge_model.score(x_train_scaled, y_train)

In [None]:
ridge_model.score(x_test_scaled, y_test)

#### Using Random Forest Regressor Model

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor()

In [None]:
rf_model.fit(x_train_scaled, y_train)

In [None]:
y_predicted_rf = rf_model.predict(x_test_scaled)

y_predicted_rf

In [None]:
rf_model.score(x_train_scaled, y_train)

In [None]:
rf_model.score(x_test_scaled, y_test)

In this case we found out that this model is overfitting  -- evaluation on train 90 and on test 46 - but it is the best between them (RandomForestRegressor Model)

In [None]:
newData = pd.DataFrame({'TARGET(PRICE_IN_LACS)': y_predicted_rf})
newData

In [None]:
newData.to_csv('submission.csv')