The data is still the housing data. This is a different set of explorations by another dude name Pedra Marcelino. 



# Goal of this notebook

1. Understand the problem. We'll look at each variable and do a philosophical analysis about their meaning and importance for this problem.
2. Univariable study. We'll just focus on the dependent variable ('SalePrice') and try to know a little bit more about it.
3. Multivariate study. We'll try to understand how the dependent variable and independent variables relate.
4. Basic cleaning. We'll clean the dataset and handle the missing data, outliers and categorical variables.
5. Test assumptions. We'll check if our data meets the assumptions required by most multivariate techniques.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [3]:
#read the data in
fname='../Python EDA/train.csv'
df_train=pd.read_csv(fname)

In [8]:
df_train.columns.tolist()

['Id',
 'MSSubClass',
 'MSZoning',
 'LotFrontage',
 'LotArea',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'MasVnrArea',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinSF1',
 'BsmtFinType2',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'KitchenQual',
 'TotRmsAbvGrd',
 'Functional',
 'Fireplaces',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageCars',
 'GarageArea',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'WoodDeckSF',
 'OpenPorchSF'

# 1. So... What can we expect?
In order to understand our data, we can look at each variable and try to understand their meaning and relevance to this problem. I know this is time-consuming, but it will give us the flavour of our dataset.

In order to have some discipline in our analysis, we can create an Excel spreadsheet with the following columns:

1. Variable - Variable name.
2. Type - Identification of the variables' type. There are two possible values for this field: 'numerical' or 'categorical'. By 'numerical' we mean variables for which the values are numbers, and by 'categorical' we mean variables for which the values are categories.
3. Segment - Identification of the variables' segment. We can define three possible segments: building, space or location. When we say 'building', we mean a variable that relates to the physical characteristics of the building (e.g. 'OverallQual'). When we say 'space', we mean a variable that reports space properties of the house (e.g. 'TotalBsmtSF'). Finally, when we say a 'location', we mean a variable that gives information about the place where the house is located (e.g. 'Neighborhood').
4. Expectation - Our expectation about the variable influence in 'SalePrice'. We can use a categorical scale with 'High', 'Medium' and 'Low' as possible values.
5. Conclusion - Our conclusions about the importance of the variable, after we give a quick look at the data. We can keep with the same categorical scale as in 'Expectation'.
6. Comments - Any general comments that occured to us.

While 'Type' and 'Segment' is just for possible future reference, the column 'Expectation' is important because it will help us develop a 'sixth sense'. To fill this column, we should read the description of all the variables and, one by one, ask ourselves:

Do we think about this variable when we are buying a house? (e.g. When we think about the house of our dreams, do we care about its 'Masonry veneer type'?).
If so, how important would this variable be? (e.g. What is the impact of having 'Excellent' material on the exterior instead of 'Poor'? And of having 'Excellent' instead of 'Good'?).
Is this information already described in any other variable? (e.g. If 'LandContour' gives the flatness of the property, do we really need to know the 'LandSlope'?).


In [18]:
test=pd.DataFrame([[x,df_train[x].dtype] for x in df_train.columns.tolist()])
test.columns=["Variable","Type"]

In [25]:
test.head()

Unnamed: 0,Variable,Type
0,Id,int64
1,MSSubClass,int64
2,MSZoning,object
3,LotFrontage,float64
4,LotArea,int64


In [30]:
fname1="../Python EDA/Col Descriptions.csv"
description=pd.read_csv(fname1,header=None)
description.head()

Unnamed: 0,0,1
0,SalePrice,the property's sale price in dollars. This is ...
1,MSSubClass,The building class
2,MSZoning,The general zoning classification
3,LotFrontage,Linear feet of street connected to property
4,LotArea,Lot size in square feet


In [31]:
description.columns=["Variable","Description"]
description.head()

Unnamed: 0,Variable,Description
0,SalePrice,the property's sale price in dollars. This is ...
1,MSSubClass,The building class
2,MSZoning,The general zoning classification
3,LotFrontage,Linear feet of street connected to property
4,LotArea,Lot size in square feet


In [34]:
test.set_index("Variable").head()

Unnamed: 0_level_0,Type
Variable,Unnamed: 1_level_1
Id,int64
MSSubClass,int64
MSZoning,object
LotFrontage,float64
LotArea,int64


In [50]:
test["Variable"].dtype

dtype('O')

In [35]:
description.set_index("Variable").head()

Unnamed: 0_level_0,Description
Variable,Unnamed: 1_level_1
SalePrice,the property's sale price in dollars. This is ...
MSSubClass,The building class
MSZoning,The general zoning classification
LotFrontage,Linear feet of street connected to property
LotArea,Lot size in square feet


In [51]:
#Joins the two data frame together 
result=description.merge(test,on="Variable") #I used join before but still don't know how this works but the other doesn't
result.head()

Unnamed: 0,Variable,Description,Type
0,SalePrice,the property's sale price in dollars. This is ...,int64
1,MSSubClass,The building class,int64
2,MSZoning,The general zoning classification,object
3,LotFrontage,Linear feet of street connected to property,float64
4,LotArea,Lot size in square feet,int64


In [54]:
result["Segment"]=""
result["Expectation"]=""
result["Conclusion"]=""
result=pd.DataFrame(

Unnamed: 0,Variable,Description,Type,Segment,Expectation,Conclusion
0,SalePrice,the property's sale price in dollars. This is ...,int64,,,
1,MSSubClass,The building class,int64,,,
2,MSZoning,The general zoning classification,object,,,
3,LotFrontage,Linear feet of street connected to property,float64,,,
4,LotArea,Lot size in square feet,int64,,,


In [57]:
result.to_csv("result.csv")