## Exploratory Data Analysis on House Price Prediction by Simran Goyal

The aim of this report is to predict the house prices based on various parameters.
Our main objective here is to define a report consisting of *Data Visualisation* using various libraries like Seaborn and Matplotlib. Here, we will visualise and predict the house sales using the test data.

The parameters used in this data set are very general and relatable. The various parameters considered in this data set are- the condition of the house, the location of the house, what is the actual market value, how many bedrooms does it have etc.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

### Data Description

The data used here is in csv format.
A CSV (Comma-Separated Values) file stores data in a tabular format consisting of rows and columns. The data in every row is separated by commas. It looks like a database or a spreadsheet.

This data is produced in Washington (can identify this due to the zip codes in the data set).
The data set used here will be helpful for us to predict the house sales based on the different parameters like- when was it first bought, how many bathrooms does it have, how much is the area covered by the house, what is the condition of the house etc. This is a general data set, that is observed and collected on various different parameters. Mostly parameters are continuous in nature except the condition and grade of the house, which are ordinal.

### Attribute Information

The various attributes on which the data is observed are:

1. Date it was bought on
3. The market value
4. Number of bedrooms
5. Number of bathrooms
6. The living area in sqft
7. The lot area in sqft
8. The condition of the house
9. The year it was built in
10. The year it was renovated in
11. The zipcode of the area
12. The latitude
13. The longitude

And many more parameters which we will observe and analyse in this report.

![House](download.jpg)

In [2]:
# Importing the required libraries:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline 
import seaborn as sns

Let us view the data in a dataframe format.

We use the head function to view the first five rows of the data set.

In [3]:
df = pd.read_csv('C:\Users\Simran Goyal\Desktop\data.csv')
pd.options.display.max_columns = None
df.head()

IOError: File C:\Users\Simran Goyal\Desktop\data.csv does not exist

We will now observe the shape of the data set using the shape property, which tells us about the rows and columns in the data set.

In [None]:
df.shape

We will now observe the size of the data set using the size property, which tells us about the total number of records in the data set.

In [None]:
df.size

We will now describe the data set using the describe function which tells us about the statistical data related to the data set.

In [None]:
df.describe()

We will now check if our data set contains any null values or not.

In [None]:
df.isnull().sum()

This shows that there is no null value present in the data set.

# Univariate Analysis

'Uni' stands for One. This means that the analysis is done considering one variable. This is the simplest method of analysis.
Univariate analysis is used to describe the data and observe the pattern. Hence, it is also helpful during the bivariate or multivariate analysis.

## Price- Histogram

Here is the histogram for the market price of the houses:

In [None]:
plt.hist(x = 'price', data = df, bins = 30)
sns.despine()

From the above histogram, we can observe:
1. There is skewness in the above distribution.
2. The above data is continuous in nature.
3. The market value of most of the houses is less than 1000000.
4. There are some outliers present in the above graph.

## Bedroom- Count plot

Here is the count plot for the number of bedrooms in the house:

In [None]:
sns.countplot(x = 'bedrooms', data = df)
sns.despine()

From the above count plot, we can infer:
1. The houses with 3 and 4 bedrooms are high in number.
2. There are even houses with 0 bedrooms and some people prefer them because of low prices.
3. Houses with 4 and less bedrooms are high in number, as compared to the houses with more number of bedrooms as the price varies and maintaining such big houses is difficult.

## Bathroom- Count plot

Here is the count plot for the number of bathrooms in the house:

In [None]:
sns.countplot(x = 'bathrooms', data = df)
sns.set(rc={'figure.figsize':(11.7,8.27)})

From the above count plot, we can infer:

1. The houses having 1-3 bathrooms are high in number.
2. There are very less number of houses which have number of bathrooms more than 4.

## Living area(in sqft)- Histogram

Here is the histogram for the living area (in sqft) in the house:

In [None]:
plt.hist(x = 'sqft_living', data = df, bins = 'auto')
sns.despine()

From the above histogram, we can infer:

1. The above distribution is skewed.
2. There are a number of outliers present.
3. The distribution is continuous in nature.
4. Living area of most of the houses is less that 4000 sqft.

## Lot area(in sqft)- Histogram
Here is the histogram for the lot area (in sqft) in the house:

In [None]:
plt.hist(x = 'sqft_lot',data=df,bins=20)
sns.despine()

From the above histogram, we can infer:

1. The plot is skewed in nature.
2. Most of the houses have a lot area much less than 250000 sqft.
3. There are outliers in the distribution.

## Floors- Count plot

Here is the count plot for number of floors in the house:

In [None]:
sns.countplot(x = 'floors', data = df)
sns.despine()

From the above count plot, we can infer:

1. The above distribution is continuous in nature.
2. Most houses have 1-2 floorings in their house.

## Waterfront- Count plot

Here is the count plot for waterfront in the house:

In [None]:
sns.countplot(x = 'waterfront', data = df)
sns.despine()

From the above countplot, we infer:

1. There are only two cases in the above distribution- either the houses have a waterfront or they don't have it.
2. Most of the houses don't have a waterfront.
3. There are few houses that have a waterfront in it.

## View- Count Plot

Here is the count plot for the number of views around the house:

In [None]:
sns.countplot(x = 'view', data = df)
sns.despine()

From the above count plot, we infer:

1. The above distribution shows that the number of houses having 0 views around it is high in number.
2. The houses having 2 views around it are high in number than 1, 3 or 4 views.

## Condition- Count Plot

Here is the count plot stating the condition of the house:

In [None]:
sns.countplot(df['condition'])
sns.despine()

From the above count plot, we infer:

1. The above data is ordinal in nature.
2. The condition of most of the houses is mediocre.
3. The houses with a bad condition are very less in number as compared to the houses which are good in condition.

## Year built- Distribution Plot

Here is the distribution plot for the built year:

In [None]:
sns.distplot(df['yr_built'])
sns.despine()

From the above distribution plot, we infer:

1. Most of the houses are built between 1960 to the present year.
2. The distribution is continuous.
3. There are no houses built before 1900, stating that no one is interested in buying such old houses.

## Zipcode- Histogram

Here is the histogram for the area zipcode:

In [None]:
plt.hist(x = 'zipcode', data =df)
sns.despine()

From the above histogram, we infer:

1. The above data is continuous in nature.
2. Most of the houses are from some areas of Hobart and Seattle in Washington.
3. These houses are scattered across the district and we'll see if location affects the price of the houses or not.

## Latitude- Distribution Plot

Here is the distribution plot for the latitude:

In [None]:
sns.distplot(df['lat'])
sns.despine()

From the above distribution plot, we infer:

1. The above distribution is skewed.
2. The data is continuous.
3. There are outliers present.
4. Most houses are present at a latitudinal scale of 47.5-47.7.

## Longitude- Distribution Plot

Here is the distribution plot for longitude:

In [None]:
sns.distplot(df['long'])
sns.despine()

From the above distribution plot, we infer:

1. The above distribution is skewed.
2. The data is continuous.
3. There are outliers present.
4. Most houses are present at a longitudinal scale of-122.4 to -122.0.

# Bivariate Analysis

'Bi' stands for Twp. This means that the analysis is done considering two variables. It is the simultaneous analysis of two variables. It tells us about the dependencies, association and relation between these two specific variables. After analysing their relation, we can infer the level of their dependencies on each other.

We will now analyses the co-relation between all the considered parameters

In [None]:
df.corr()

We use heatmap to plot the relation between the various parameters.
The positive values show direct relation whereas the negative values represent opposite or indirect relation.

In [None]:
plt.figure(figsize =(18,12))
sns.heatmap(df.corr(),annot=True)
plt.show()

## Price vs Living Area(sqft)- Reg Plot

Here is the regression plot for price vs living area(sqft):

In [None]:
sns.regplot(x = 'price',y = 'sqft_living',data = df , x_jitter=0.2, scatter_kws={'alpha':0.2})
sns.despine()

From the above regression plot, we infer:

1. The houses having living area less than 4000 sqft have prices less than 1000000.
2. The above distibution shows that as the living area increases the price of the house is also increasing.
3. This shows that there exists a direct relation of living area with the price of the house.

## Bedrooms vs Price- Reg Plot

Here is the regression plot for number of bedrooms vs price:

In [None]:
sns.regplot(x = 'bedrooms',y='price',data = df , x_jitter=0.3, scatter_kws={'alpha':0.3}, color = 'orange')
sns.despine()

From the above regression plot, we infer:

1. Most houses have 2-5 bedrooms in their house.
2. The prices of most of the houses is less than 1000000 with respect to the number of houses.
3. The price of the house and the number of bedrooms are dependent on each other.

## Price vs Latitude- Reg Plot

Here is the regression plot for price vs latitude:

In [None]:
sns.regplot(x = 'lat',y = 'price',data = df , x_jitter=0.2, scatter_kws={'alpha':0.2},color='black')
sns.despine()

From the above regression plot, we infer:

1. There exists a direct relation between the price and the latitudinal location of the house.
2. The houses located between 47.4and 47.8 latitude are high in number in the data set.
3. The houses located at 47.0-47.2 latitude are cheaper than the houses present between 47.4-47.8 latitude.

## Price vs Longitude- Reg Plot

Here is the regression plt for price vs longitude:

In [None]:
sns.regplot(x = 'long',y = 'price',data = df , x_jitter=0.2, scatter_kws={'alpha':0.2},color='green')
sns.despine()

From the above regression plot, we infer:

1. There exists a direct relation between the price and the longitudinal location of the house.
2. The houses located between -122.4 to -122.0 longitude are high in number in the data set.
3. The houses located between -122.6 to -122.4 longtiude are cheaper than the houses present between -122.4 to -122.0 longitude.

## Condition vs Built Year- Reg Plot

Here is the regression plot for condition vs built year:

In [None]:
sns.regplot(x = 'condition',y = 'yr_built',data = df , x_jitter=0.2, scatter_kws={'alpha':0.2},color='red')
sns.despine()

From the above regression plot, we infer:

1. It shows the indirect relation between the condition of the house to the year it was built in.
2. Most of the houses built after 2000 are in good condition.
3. The plot shows that the houses with bad condition are mostly built before 1980.

## Price vs View- Box Plot

Here is the boxplot for price vs view:

In [None]:
sns.boxplot(x = 'view',y = 'price',data = df )
sns.despine()

From the above boxplot, we infer:

1. There exists a direct relation between the price and the view around the house.
2. The houses with 0 views around it are having mostly a price of 1000000.
3. The houses having more views around have more price compared to the houses with less price.

## Price vs Waterfront- Box Plot

Here is the boxplot for price vs waterfront:

In [None]:
sns.boxplot(x = 'waterfront',y = 'price',data = df)
sns.despine()

From the above boxplot, we infer:

1. There exists a direct relation between the price and the waterfront in the house.
2. Most of the houses don't have a waterfront in it.
3. The houses with waterfront in it have high prices rather than the houses with no waterfront in it.

# Multivariate Analysis

'Multi' stands for Many. This means that the analysis is done considering more than two variables. It is the simultaneous analysis of many variables. It tells us about the dependencies, association and relation between the variables. After analysing their relation, we can infer the level of their dependencies on each other.

Our main aim of this analysis is to predict the price of the houses for sale. Hence, we will analyses evey aspect in terms of price here.

## Bedrooms vs Price in terms of Condition- Facet grid

Here is the facet grid for price vs bedrooms in terms of the condition:

In [None]:
grid = sns.FacetGrid(df, col='condition',col_wrap=2)
grid.map(plt.scatter,'bedrooms','price',alpha = 0.2)
sns.despine()

From the above graphs, we infer:

1. More the number of bedrooms, more the price of the house where the condition is good.
2. For the houses having bad condition, even the houses with more number of bedrooms possess low price of the house.

## Price vs Living Area in terms of Condition- Facet Grid

Here is the facet grid for price vs living area in terms of condition:

In [None]:
grid = sns.FacetGrid(df, col='condition',col_wrap=2)
grid.map(plt.scatter,'sqft_living','price',alpha = 0.2)
sns.despine()

From the above graphs, we infer:

1. The houses having bad condition are having less prices, irrespective of the living area of the houses.
2. The price of the houses increase with the living area of the houses, in the case of houses having good condition.

## Price vs View in terms of Condition- Facet Grid

Here is the facet grid for price vs view in terms of condition:

In [None]:
grid = sns.FacetGrid(df, col='condition',col_wrap=2)
grid.map(plt.scatter,'view','price',alpha = 0.2)
sns.despine()

From the above graphs, we infer:

1. The houses with bad condition are mostly having less views and hence, the prices are also less.
2. The houses having good condition, have high prices and even have more views around.
3. More the number of views around, more the price of the house.

# Final Plot Section

This is the final section where three plots are added that are useful in predicting the price of the houses.

## Plot-1
## Price vs View- Box Plot

We have included this graph as views around the house plays a major role in depicting the prices of the house.
Here is the box plot for price vs view:

In [None]:
sns.boxplot(x = 'view',y = 'price',data = df )
sns.despine()

From the above box plot, we infer:

1. There exists a direct relation between the price and the view around the house.
2. The houses with 0 views around it are having mostly a price of 1000000.
3. The houses having more views around have more price compared to the houses with less price.

## Plot-2
## Built Year- Distribution Plot

We have included this plot as the built year helps us to price the houses well.
Here is the distribution plot for the built year:

In [None]:
sns.distplot(df['yr_built'])
sns.despine()

From the above distribution plot, we infer:

1. Most of the houses are built between 1960 to the present year.
2. The distribution is continuous.
3. There are no houses built before 1900, stating that no one is interested in buying such old houses.

## Plot-3
## Living Area vs Price in terms of Condition- Facet Grid

We have included this plot as living area and the condition defines the price of the house well.
Here is the facet grid for price vs living area in terms of condition:

In [None]:
grid = sns.FacetGrid(df, col='condition',col_wrap=2)
grid.map(plt.scatter,'sqft_living','price',alpha = 0.2)
sns.despine()

From the above graphs, we infer:

1. The houses having bad condition are having less prices, irrespective of the living area of the houses.
2. The price of the houses increase with the living area of the houses, in the case of houses having good condition.

# Summary

From the above plots, we conclude that there are various parameters involved in determining the price of the houses.
Like:
1. Better the condition, more the price.
2. More the views arond, more the price.
3. More the number of bedrooms, more the price.
4. More the number of bathrooms, more the price.
5. More the living area, more the price.
6. Later the built year, more the price.
7. If waterfront, more the price.
8. If more lot area, more the price.

There are many such observations that one can see and infer from the above plots.

The problems faced here are that the parameters like floor is in decimals which is somewhat not possible and the house id which is of no use in predicting the price of the house.

So, we conclude that the price of a house can be predicted by the above defined parameters and plots.