In [None]:
# This Python 3 environment comes with many helpful analyvaluesibraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train= pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
train

**Some insight of data.**

In [None]:
train.shape

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
for i in train.columns:
    print(i,' ', train[i].isnull().sum())

In [None]:
#Ctaegorical Features
categorical_feats= train.dtypes[train.dtypes =='object']
categorical_feats

**In** order to understand our data, we can look at each variable and try to understand their meaning and relevance to this problem. I know this is time-consuming, but it will give us the flavour of our dataset.

**Variable - Variable name.**
Type - Identification of the variables' type. There are two possible values for this field: 'numerical' or 'categorical'. By 'numerical' we mean variables for which the values are numbers, and by 'categorical' we mean variables for which the values are categories.

**Segment** - Identification of the variables' segment. We can define three possible segments: building, space or location. When we say 'building', we mean a variable that relates to the physical characteristics of the building (e.g. 'OverallQual'). When we say 'space', we mean a variable that reports space properties of the house (e.g. 'TotalBsmtSF'). Finally, when we say a 'location', we mean a variable that gives information about the place where the house is located (e.g. 'Neighborhood').

**Expectation** - Our expectation about the variable influence in 'SalePrice'. We can use a categorical scale with 'High', 'Medium' and 'Low' as possible values.

**Conclusion** - Our conclusions about the importance of the variable, after we give a quick look at the data. We can keep with the same categorical scale as in 'Expectation'.

**Comments** - Any general comments that occured to us.

While 'Type' and 'Segment' is just for possible future reference, the column 'Expectation' is important because it will help us develop a 'sixth sense'. To fill this column, we should read the description of all the variables and, one by one, ask ourselves:

* Do we think about this variable when we are buying a house? (e.g. When we think about the house of our dreams, do we care about its 'Masonry veneer type'?).

* If so, how important would this variable be? (e.g. What is the impact of having 'Excellent' material on the exterior instead of 'Poor'? And of having 'Excellent' instead of 'Good'?).

* Is this information already described in any other variable? (e.g. If 'LandContour' gives the flatness of the property, do we really need to know the 'LandSlope'?).

After this daunting exercise, we can filter the spreadsheet and look carefully to the variables with 'High' 'Expectation'. Then, we can rush into some scatter plots between those variables and 'SalePrice', filling in the 'Conclusion' column which is just the correction of our expectations.

I went through this process and concluded that the following variables can play an important role in this problem:

* OverallQual (which is a variable that I don't like because I don't know how it was computed; a funny exercise would be to predict 'OverallQual' using all the other variables available).
* YearBuilt.
* TotalBsmtSF.
* GrLivArea.

I ended up with two 'building' variables ('OverallQual' and 'YearBuilt') and two 'space' variables ('TotalBsmtSF' and 'GrLivArea'). This might be a little bit unexpected as it goes against the real estate mantra that all that matters is 'location, location and location'. It is possible that this quick data examination process was a bit harsh for categorical variables. For example, I expected the 'Neigborhood' variable to be more relevant, but after the data examination I ended up excluding it. Maybe this is related to the use of scatter plots instead of boxplots, which are more suitable for categorical variables visualization. The way we visualize data often influences our conclusions.

However, the main point of this exercise was to think a little about our data and expectactions, so I think we achieved our goal. Now it's time for 'a little less conversation, a little more action please'. Let's shake it!

**First things first: analysing 'SalePrice'**

SalePrice' is the reason of our quest. 



In [None]:
train['SalePrice'].describe()

*Minimum value is greater than 0. That's a green signal for us.*

In [None]:
sns.distplot(train['SalePrice'])

**Ah**! With the help of our loving seaborn following things we can conclude:

* Deviate from the normal distribution.
* Have appreciable positive skewness.
* Show peakedness.

In [None]:
print('Skewness :', train['SalePrice'].skew())
print('Kurtosis :', train['SalePrice'].kurt())

**Relationship with numerical variables**

In [None]:
sns.scatterplot(x=train['GrLivArea'],y=train['SalePrice'])
plt.plot()

Hmmm... It seems that 'SalePrice' and 'GrLivArea' are really old friends, with a **linear relationship**.

And what about 'TotalBsmtSF'?

In [None]:
sns.scatterplot(x=train['TotalBsmtSF'],y=train['SalePrice'])
plt.plot()

TotalBsmtSF' is also a great friend of 'SalePrice' but this seems a much more emotional relationship! Everything is ok and suddenly, in a **strong linear (exponential?) reaction**, everything changes. Moreover, it's clear that sometimes 'TotalBsmtSF' closes in itself and gives zero credit to 'SalePrice'.

**Relationship with categorical features**

In [None]:
#box plot overallqual/saleprice
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=train)
fig.axis(ymin=0, ymax=800000);

Like all the pretty girls, 'SalePrice' enjoys 'OverallQual'. Note to self: consider whether McDonald's is suitable for the first date.

In [None]:
f, ax = plt.subplots(figsize=(13, 8))
fig = sns.boxplot(x='YearBuilt', y="SalePrice", data=train)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);

Although it's not a strong tendency, I'd say that 'SalePrice' is more prone to spend more money in new stuff than in old relics.

**Note**: we don't know if 'SalePrice' is in constant prices. Constant prices try to remove the effect of inflation. If 'SalePrice' is not in constant prices, it should be, so than prices are comparable over the years.

**In summary**

Stories aside, we can conclude that:

* 'GrLivArea' and 'TotalBsmtSF' seem to be linearly related with 'SalePrice'. Both relationships are positive, which means that as one variable increases, the other also increases. In the case of 'TotalBsmtSF', we can see that the slope of the linear relationship is particularly high.
* 'OverallQual' and 'YearBuilt' also seem to be related with 'SalePrice'. The relationship seems to be stronger in the case of 'OverallQual', where the box plot shows how sales prices increase with the overall quality.

We just analysed four variables, but there are many other that we should analyse. The trick here seems to be the choice of the right features (feature selection) and not the definition of complex relationships between them (feature engineering).

That said, let's separate the wheat from the chaff.

To explore the universe, we will start with some practical recipes to make sense of our 'plasma soup':

* Correlation matrix (heatmap style).
* 'SalePrice' correlation matrix (zoomed heatmap style).
* Scatter plots between the most correlated variables (move like Jagger style).



**Correlation matrix (heatmap style)**

In [None]:
#correlation matrix
corrmat = train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)

At first sight, there are two red colored squares that get my attention. The first one refers to the 'TotalBsmtSF' and '1stFlrSF' variables, and the second one refers to the 'GarageX' variables. Both cases show how significant the correlation is between these variables. Actually, this correlation is so strong that it can indicate a situation of multicollinearity. If we think about these variables, we can conclude that they give almost the same information so multicollinearity really occurs. Heatmaps are great to detect this kind of situations and in problems dominated by feature selection, like ours, they are an essential tool.

Another thing that got my attention was the 'SalePrice' correlations. We can see our well-known 'GrLivArea', 'TotalBsmtSF', and 'OverallQual' saying a big 'Hi!', but we can also see many other variables that should be taken into account. That's what we will do next.

'SalePrice' correlation matrix (zoomed heatmap style)

In [None]:
corrmat

In [None]:
corrmat.nlargest(5,'SalePrice')

In [None]:
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
#cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(train[cols].corr(), cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

In [None]:
#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(train[cols], size = 2.5)
plt.show();

In [None]:
Total= train.isnull().sum()
Percent= (train.isnull().sum()/len(train.isnull()))
missing_data= pd.concat([Total,Percent],axis=1, keys= ['Total','Percent']).sort_values(by='Percent',ascending=False)
missing_data.head(20)

In [None]:
train= train.drop((missing_data[missing_data['Total']>0]).index,1)

In [None]:
train.shape

 # **Outliars!**

In [None]:
sns.scatterplot(x=train['GrLivArea'],y=train['SalePrice'])
plt.plot()

There two points which do not follows crowd.Let's delete them.

In [None]:
train_df= train.sort_values('GrLivArea',ascending= False)
train_df= train_df.drop(train_df[train_df['Id']==1299].index)
train_df= train_df.drop(train_df[train_df['Id']==524].index)
train_df

In [None]:
sns.scatterplot(x=train_df['TotalBsmtSF'],y=train['SalePrice'])
plt.plot()

We can feel tempted to eliminate some observations (e.g. TotalBsmtSF > 3000) but I suppose it's not worth it. We can live with that, so we'll not do anything.

<h4>In serch of normality</h4>

In [None]:
from scipy.stats import norm
sns.distplot(train['SalePrice'], fit= norm)
plt.figure()
pro= stats.probplot(train_df['SalePrice'], plot= plt)


Data is skewed and not normally distributed.
Applying log normal to make it uniformally distributed.

In [None]:
#applying log transformation
train_df['SalePrice']= np.log(train['SalePrice'])

In [None]:
sns.distplot(train['SalePrice'], fit= norm)
plt.figure()
pro= stats.probplot(train['SalePrice'], plot= plt)

It looks somehow normally distributed.

In [None]:
sns.distplot(train_df['TotalBsmtSF'],fit= norm)
fig= plt.figure()
prob= stats.probplot(train_df['TotalBsmtSF'],plot= plt)

Ok, now we are dealing with the big boss. What do we have here?

* Something that, in general, presents skewness.
* A significant number of observations with value zero (houses without basement).
* A big problem because the value zero doesn't allow us to do log transformations.


To apply a log transformation here, we'll create a variable that can get the effect of having or not having basement (binary variable). Then, we'll do a log transformation to all the non-zero observations, ignoring those with value zero. This way we can transform data, without losing the effect of having or not basement.

I'm not sure if this approach is correct. It just seemed right to me. That's what I call 'high risk engineering'.

In [None]:
train_df['Hasbsmt']= pd.Series(len(train_df['TotalBsmtSF']),index= train_df.index)
train_df['Hasbsmt']= 0
train_df.loc[train_df['TotalBsmtSF']>0,'Hasbsmt']=1

In [None]:
train_df.loc[train_df['Hasbsmt']==1,'TotalBsmtSF']= np.log(train_df['TotalBsmtSF'])

In [None]:
sns.distplot(train_df[train_df['TotalBsmtSF']>0]['TotalBsmtSF'],fit= norm)
fig= plt.figure()
prob= stats.probplot(train_df[train_df['TotalBsmtSF']>0]['TotalBsmtSF'],plot= plt)

In [None]:
plt.scatter(train_df['GrLivArea'], train_df['SalePrice']);

End Of the Notebook