# Linear Regression Model to know House Price Prediction

By using dataset from Kaggle, i try to build a LR model using pandas, matplotlib, seaborn and scikitlearn to predict house price in Sydney and Melbourne. Let's try to clean and know the data looks like first.

## Importing Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

## Load Data

In [None]:
df = pd.read_csv('/content/data.csv')

## Cleaning Data

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   float64
 3   bathrooms      4600 non-null   float64
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   float64
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float

In [None]:
new_data = pd.DataFrame(df)
new_data.head(3)

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA


In [None]:
df.dtypes

Unnamed: 0,0
date,object
price,float64
bedrooms,float64
bathrooms,float64
sqft_living,int64
sqft_lot,int64
floors,float64
waterfront,int64
view,int64
condition,int64


In [None]:
new_data.drop(['street','statezip','country','date','sqft_lot','sqft_above','sqft_basement','yr_renovated'], inplace=True)
# one hot encoding for city
categorical_columns = ['city']
label_encoder = LabelEncoder()

# Fit the label encoder and transform the labels to numerical values
for column in categorical_columns:
    new_data[column] = label_encoder.fit_transform(new_data[column])

In [None]:
df.dtypes

In [None]:
df.duplicated()

In [None]:
df.isnull().sum()

In [None]:
from scipy import stats

z_scores = stats.zscore(df['price'])
outliers = df[(z_scores > 3)]
outliers

In [None]:
data=df.select_dtypes(include=['number']) #select numerical values
corr_matrix = data.corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(7,7))
sb.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
data2= df.groupby("city")["price"].mean().head(10)
plt.pie(data2, labels=data2.index, autopct='%1.1f%%')
plt.title('Average Price by City (Top 10)')
plt.show()

In [None]:
plt.figure(figsize=(10,10))
outliers = sb.boxplot(df)
plt.xticks(ticks=range(len(df.columns)), labels=df.columns, rotation=45, ha='right')
plt.show()

There are outliers in price, bedrooms, bathrooms, sqft_living, floors, waterfront, view, condition, sqft_above. It's marked with dot.

## Split Data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
X=df.drop('price',axis=1)
y=df['price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

In [None]:
X_train.head(3).reset_index(drop=True)

In [None]:
y_test.head(3).reset_index(drop=True)

# Build Model

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr

In [None]:
lr.fit(X_train, y_train)