In [None]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

## Data Definition 
<br>**SalePrice** - the property's sale price in dollars. This is the target variable that you're trying to predict. 
<br>**MSSubClass**: The building class
<br>**MSZoning:** The general zoning classification
<br>**LotFrontage:** Linear feet of street connected to property
<br>**LotArea:** Lot size in square feet
<br>**Street:** Type of road access
<br>**Alley:** Type of alley access
<br>**LotShape:** General shape of property
<br>**LandContour:** Flatness of the property
<br>**Utilities:** Type of utilities available
<br>**LotConfig:** Lot configuration
<br>**LandSlope:** Slope of property
<br>**Neighborhood:** Physical locations within Ames city limits
<br>**Condition1:** Proximity to main road or railroad
<br>**Condition2:** Proximity to main road or railroad (if a second is present)
<br>**BldgType:** Type of dwelling
<br>**HouseStyle:** Style of dwelling
<br>**OverallQual:** Overall material and finish quality
<br>**OverallCond:** Overall condition rating
<br>**YearBuilt:** Original construction date
<br>**YearRemodAdd:** Remodel date
<br>**RoofStyle:** Type of roof
<br>**RoofMatl:** Roof material
<br>**Exterior1st:** Exterior covering on house
<br>**Exterior2nd:** Exterior covering on house (if more than one material)
<br>**MasVnrType:** Masonry veneer type
<br>**MasVnrArea:** Masonry veneer area in square feet
<br>**ExterQual:** Exterior material quality
<br>**ExterCond:** Present condition of the material on the exterior
<br>**Foundation:** Type of foundation
<br>**BsmtQual:** Height of the basement
<br>**BsmtCond:** General condition of the basement
<br>**BsmtExposure:** Walkout or garden level basement walls
<br>**BsmtFinType1:** Quality of basement finished area
<br>**BsmtFinSF1:** Type 1 finished square feet
<br>**BsmtFinType2:** Quality of second finished area (if present)
<br>**BsmtFinSF2:** Type 2 finished square feet
<br>**BsmtUnfSF:** Unfinished square feet of basement area
<br>**TotalBsmtSF:** Total square feet of basement area
<br>**Heating:** Type of heating
<br>**HeatingQC:** Heating quality and condition
<br>**CentralAir:** Central air conditioning
<br>**Electrical:** Electrical system
<br>**1stFlrSF:** First Floor square feet
<br>**2ndFlrSF:** Second floor square feet
<br>**LowQualFinSF:** Low quality finished square feet (all floors)
<br>**GrLivArea:** Above grade (ground) living area square feet
<br>**BsmtFullBath:** Basement full bathrooms
<br>**BsmtHalfBath:** Basement half bathrooms
<br>**FullBath:** Full bathrooms above grade
<br>**HalfBath:** Half baths above grade
<br>**Bedroom:** Number of bedrooms above basement level
<br>**Kitchen:** Number of kitchens
<br>**KitchenQual:** Kitchen quality
<br>**TotRmsAbvGrd:** Total rooms above grade(does not include bathrooms)
<br>**Functional:** Home functionality rating
<br>**Fireplaces:** Number of fireplaces
<br>**FireplaceQu:** Fireplace quality
<br>**GarageType:** Garage location
<br>**GarageYrBlt:** Year garage was built 
<br>**GarageFinish:** Interior finish of the garage
<br>**GarageCars:** Size of garage in car capacity
<br>**GarageArea:** Size of garage in square feet
<br>**GarageQual:** Garage quality
<br>**GarageCond:** Garage condition 
<br>**PavedDrive:** Paved driveway
<br>**WoodDeckSF:** Wood deck area in square feet
<br>**OpenPorchSF:** Open porch area in square feet
<br>**EnclosedPorch:** Enclosed porch area in square feet
<br>**3SsnPorch:** Three season porch area in square feet
<br>**ScreenPorch:** Screen porch area in square feet
<br>**PoolArea:** Pool area in square feet
<br>**PoolQC:** Pool quality
<br>**Fence:** Fence quality
<br>**MiscFeature:** Miscellaneous feature not covered in other categories
<br>**MiscVal:** Value of miscellaneous feature
<br>**MoSold:** Month Sold
<br>**YrSold:** Year Sold
<br>**SaleType:** Type of sale
<br>**SaleCondition:** Condition of sale

In [None]:
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
house=pd.read_csv("../input/train.csv")
house.head(4)

In [None]:
#Display all rows and columns
pd.options.display.max_rows=None
pd.options.display.max_columns=None
#View the observations
house.head(4)

In [None]:
#To check the dimension or no of observations and columns in the dataframe
house.shape

From the output we can understand that there are 1460 observations and 81 columns in the dataframe.

In [None]:
#View the datatypes of variables-
house.dtypes
#In the output we can see the data types for each variable. 

## Summary Statistics

In [None]:
#View the statistics for all the numeric data
house.describe()

We can see from the output that the minimum value for Sale Price is 34900 and the maximum value is 7555000.

From the output we can see the count, mean, standard deviation, minimum value and maximum value, 1st quartile, 2nd quartile and  3rd quartile for each and every numeric variable.

Display statistics for Categorical Variables

In [None]:
cat_var=house.select_dtypes(include = 'object').describe()
cat_var

From the output we can see the count for each categorical variable like MSSubClass has 1460 observations out of which 15 are unique values while Alley only has 3 unique values. Along with this we can also see the frequecy for each categorical variable.

In [None]:
#We can see there are many variables which should be categroical or object but are taken here as integer. We need to convert
#these variables into categorical. 
#Converting MSSubClass, OverallQual, and OverallCond to categorical.
house['MSSubClass']=house['MSSubClass'].astype('object')
house['OverallQual']=house['OverallQual'].astype('object')
house['OverallCond']=house['OverallCond'].astype('object')

Here we are converting MSSubClass, OverallQual and OverallCond to categorical variables.

In [None]:
house.dtypes

In this output we can see that MSSubClass, OverallQual and OverallCond has been converted to object datatype.

In [None]:
house.info()

The info function is used to get a concise summary of the dataframe. Here we can see that how many non null observations are there for each variable out of total 1460 observations. Also it gives the total count of how many int, float and object variables are there in the dataframe.

In [None]:
house.GarageCars.unique()

From the output we can see that there are 5 unique values- ranging from 0 to 4- for Garage Cars in the dataset house.

In [None]:
house.GarageCars.value_counts()

We can see from the output that there are mostly 2 car capacity size of garages and 4 car capacity size of garages are very few.

In [None]:
#Now we'll look for missing values in the dataset
house.isnull().sum()

From th output we can see count for missing values in the given observations. Like PoolQC has the highest number of missing values which is equal to 1453.

In [None]:
# No of missing values in percentage
no_of_na=house.isnull().sum()

percent_of_na=no_of_na*100/house.shape[0]
print(percent_of_na)

In the above output the missing values are shown in percentage.

In [None]:
%matplotlib inline
plt.figure(figsize=(20,10))

colormap=sns.cubehelix_palette(light=1, as_cmap=True, reverse=True)
sns.heatmap(house.isnull(),cmap=colormap)

#HEat map is used to look at missing values with another angle

From the above heatmap we see missing values and which columns have missing values. The white lines represent the missing value in the dataframe.

In [None]:
# any() function returns only those columns which have missing values
columns_with_missing_values=house.columns[house.isnull().any()]
print(columns_with_missing_values)

The output above the shows the name of columns which contain missing values.

In [None]:
house[columns_with_missing_values].isnull().sum()

In [None]:
# fig is used to plot graphs in axis. Withing fig we have axis then we have plots.
# 1. Define fig, axis and then rows and columns.
# 2. Plot type. For 2 axis ax1 and ax2 we can say- ax1.Plottype and ax2.Plottype, ax1.Title, ax2.Title
# and then labels ax1.label ax2.label
#TO hold variable names
labels=[]

#To hold the count of missing values for each variable
valuecount=[]

# To hold the percentage of missing values
percentcount=[]
for col in columns_with_missing_values:
    labels.append(col)
    valuecount.append(house[col].isnull().sum())
    percentcount.append(house[col].isnull().sum()*100/house.shape[0])
ind=np.arange(len(labels))
fig, (ax1,ax2)=plt.subplots(1,2,figsize=(20,18))

rects=ax1.barh(ind, np.array(valuecount),color='blue')
ax1.set_yticks(ind)
ax1.set_yticklabels(labels,rotation='horizontal')
ax1.set_xlabel("Count of missing values")
ax1.set_title("Variables with missing values");


rects=ax2.barh(ind, np.array(percentcount),color='pink')
ax2.set_yticks(ind)
ax2.set_yticklabels(labels, rotation='horizontal')
ax2.set_xlabel("Percentage of missing values")
ax2.set_title("Variables with missing values");

The graph on the left represents the count of missing values while the one on the right shows the percentage of missing values.

In [None]:
#Replacing NA values with their original meaning
house['BsmtQual'].fillna('No basement',inplace=True)
house['BsmtCond'].fillna('No basement',inplace=True)
house['BsmtExposure'].fillna('No basement',inplace=True)
house['BsmtFinType1'].fillna('No basement',inplace=True)
house['BsmtFinType2'].fillna('No basement',inplace=True)

house['GarageType'].fillna('No garage',inplace=True)

#For obsservations where garagetype is null we replace null values in garageYrBuilt
house['Alley'].fillna('No alley access',inplace=True)
house['GarageFinish'].fillna('No garage',inplace=True)
house['GarageQual'].fillna('No garage',inplace=True)
house['GarageCond'].fillna('No garage',inplace=True)

house['PoolQC'].fillna('No pool',inplace=True)
house['Fence'].fillna('No fence',inplace=True)

house['MiscFeature'].fillna('None',inplace=True)
house['FireplaceQu'].fillna('No fireplace',inplace=True)

#For observations where GarageType is null, we replace null values in GarageYrBlt=0
house['GarageYrBlt'].fillna(0,inplace=True)

Here we have replaced the missing values for the variables having categorical data. Now we will replace the missing values in the int variable which is LotFrontage with median.

In [None]:
#Replacing the missing values in LotFrontage with median. As median does not include outliers which cause skewness in the data.
medianlotfront=house['LotFrontage'].median()
house['LotFrontage'].fillna(medianlotfront,inplace=True)
house.isnull().sum()

In [None]:
plt.figure(figsize=(20,10))
colormap=sns.cubehelix_palette(light=1, as_cmap=True, reverse=True)
sns.heatmap(house.isnull(),cmap=colormap)

We can see from the heatmap that there are still a few number of missing values in MasVnrArea. So we generate a cross tabulation of MasVnrArea and MasVnrType to see where are the missing values.

In [None]:
#Using crosstab to generate the count of Mas Van Area by Mas Van type
print(pd.crosstab(house['MasVnrType'],\
                 house['MasVnrArea'],dropna=False,margins=True))

In [None]:
#WE can see wherever MasVnrType=None there MasVnrArea=0
#Except 2 cases where MasVnrArea=1
house['MasVnrType'].fillna('None',inplace=True)
house['MasVnrArea'].fillna(0,inplace=True)

Since in MasVnrType most of the observations are in None so we fill NA with None and in MasVnrArea we fill NA with 0.

In [None]:
#Generating cross tabulation of electrical and MSSubClass
print(pd.crosstab(house['Electrical'],\
                 house['MSSubClass'],dropna=False, margins=True))

In [None]:
#Let us take a look at the observation where electrical has a missing value
house['MSSubClass'][house['Electrical'].isnull()]

In [None]:
#We note that where electrical is missing the MSSubClass is 80. We notice that when MSSubClass is 80, then electrical type is
# SBrkr
house['Electrical'].fillna('SBrkr',inplace=True)
plt.figure(figsize=(20,10))
colormap=sns.cubehelix_palette(light=1, as_cmap=True, reverse=True)
sns.heatmap(house.isnull(),cmap=colormap)

The heat map generated above shows there are no missing values in the dataframe now. We have replaced all the NA.

In [None]:
#Creating new Meaningful Variables

#Many variabes are not useful in their ownself. However, transforming them we can throw a lot of insights to our analysis.
#A few variables like YearBuilt and YearRemodAdd represent original construction data and remodel data respectively.
#However if these variables can be converted into age they can tell how old the building is.

#Importing datetime package for date time operations
import datetime as dt

#Using datetime package to find the current year
current_year=int(dt.datetime.now().year)

#Subtracting the YearBuilt from current year to find out the age of building
building_age=current_year-house['YearBuilt']
building_age

We have added another variable for building age in which we store the age of each building. In order to calculate that we have subtracted value of Year Built with Current Year value. The output shows the age of each building or observation.

In [None]:
house_col=[col for col in house.columns.values if house[col].dtype=='object']
house_cat=house[house_col]
house_num=house.drop(house_cat,axis =1)

In the above code we are separating the numerical and categorical columns.

In [None]:
house_cat.head(3)

# Numerical and Categorical Features

In [None]:
#Pulling out names of numerical variables by conditioning datatypes
#Not equal to object type

numerical_features=house.dtypes[house.dtypes!="object"].index

print("Number of numerical features",len(numerical_features))
print(numerical_features)

#Pulling out names of categorical variables by conditioning dtypes
#Equal to object type

categorical_features=house.dtypes[house.dtypes=="object"].index
print("Number of categorical features",len(categorical_features))
print(categorical_features)

Here in the output we can see the names of both numerical and categorical variables separately.

In [None]:
f=pd.melt(house, id_vars =['SalePrice'], value_vars =['MSSubClass'])
f

In the above output we can see the value of SalePrice for each value of MSSubClass.

# Distribution for all Numeric variables

In [None]:
f=pd.melt(house, id_vars =['SalePrice'], value_vars = numerical_features[numerical_features !='SalePrice'])
f

In [None]:
#sharex means share the same x axis, sharey means share the same y axis
#sharex means share the same x axis, sharey means share the same y axis
g = sns.FacetGrid(f, col="variable", col_wrap=5, sharex=False, sharey=False)
g = g.map(sns.distplot,"value", color="red")
g

In the above output we can see the distribution plot for each numerical variable.

In [None]:
sns.set(font_scale=1.2)

f=pd.melt(house,id_vars=['SalePrice'],value_vars=house_cat)

facetobject=sns.FacetGrid(f,col='variable',col_wrap=2,sharex=False,sharey=False,height=6)

facetobject.map(sns.boxplot,"value","SalePrice",palette="Set3")
facetobject.fig.subplots_adjust(wspace=.25,hspace=0.25)

for ax in facetobject.axes.flat:
    plt.setp(ax.get_xticklabels(),rotation=90)

Here we have plotted the boxplot for every categorical variable with respect to the Sale Price. We can see there are many outliers in the boxplot.

# Removing Outliers-

In [None]:
q1 = house.quantile(0.25,axis=0)
q3 = house.quantile(0.75,axis=0)
iqr = q3 - q1
print(iqr)

Here we have calculate the IQR for each numerical variable.

In [None]:
house_num_iq= house_num[~((house_num < (q1 - 1.5 * iqr))|(house_num > (q3 + 1.5 * iqr))).any(axis=1)]
house_num_iq.shape

After calculating we have removed the outliers, we can see using shape attribute that there 585 observations which are outliers.

In [None]:
house_num.shape

In [None]:
house_num_iq.head()

In [None]:
#Get correlation of numeric variables
df_numerical_features=house.select_dtypes(include=[np.number])

#Storing the correlation between all numeric variables
correlation=df_numerical_features.corr()

#Sort the correlation of all numeric variables with SalePrice
correlation["SalePrice"].sort_values(ascending=False)*100

#Correlation with heatmap (Seaborn library)
fig, ax=plt.subplots(figsize=(40,40))

#Setting title for the correlation heat map
plt.title("Correlation of numeric features with Sale Price",y=1,size=20)

#cmap-matplotlib colormap name or object - can be used to set the color 
#vmin and vmax is used to anchor the colormap
sns.heatmap(correlation,square=True, vmin=-0.2, vmax=0.8,cmap="YlGnBu",annot=True)

We can see the correlation map above, which also displays the value of correlation between variables.Sale Price increases when the Garage car capacity is more. Price is mostly impacted by Garage Area, Area above ground in square feet and Total Square feet of Basement Area.

Please upvote if you find this relevant.