The data set describes the sale of individual residential property in Ames, Iowa
from 2006 to 2010. The data set contains 2930 observations and a large number of explanatory
variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home
values.

In this note book we will explore the Ames housing data set. We will focus on:
1. Removing outliers 
2. Dealing with missing data
3. Building and assessing the model

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns

## Setting max displayed rows to 500, in order to display the full output of any command 
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
# read the data 
df = pd.read_csv("../input/ames-housing-data/Ames_Housing_Data.csv")

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
df.info()

### 1. Checking for outliers
The following example shows why outliers are very dangerous. They significantly affect the mean and the standard deviation and thus affecting the estimators of the model.

|| | Data without outlier |  | Data with outlier | 
|--||--||--|
|**Data**| |1,2,3,3,4,5,4 |  |1,2,3,3,4,5,**400** | 
|**Mean**| |3.142 | |**59.714** |  
|**Median**| |3|  |3|
|**Standard Deviation**| |1.345185| |**150.057**|

In order to visually see outliers, we need a box plot or a scatter plot. 
Therefore, lets see the most correlated features with sale price to plot them a gainst each others.

In [None]:
df.corr()["SalePrice"].sort_values(ascending = False)

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.scatterplot(data = df, x = "Overall Qual", y = "SalePrice");

As we can see there are some points with very high quality (10/10) but very low price. Lets explore other highly correlated features with Sale Price

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.scatterplot(data = df, x = "Gr Liv Area", y = "SalePrice");

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.scatterplot(data = df, x = "Total Bsmt SF", y = "SalePrice");

The points that indicate very high price and also very high living area (at the top right corner) are not outliers. They make sense as they are follwing a trend, therefore they will not hurt our model.

On the other hand The 3 points at the right-lower corner indicate very high living area but very low price. They are very likely to be outliers because they are not following the general trend.



#### Lets now check those points closely

In [None]:
df[(df["SalePrice"] < 200000) & (df["Overall Qual"] > 8)]

In [None]:
df[(df["SalePrice"] < 200000) & (df["Overall Qual"] > 8) & (df["Gr Liv Area"] > 4000)]

In [None]:
drop_index = df[(df["SalePrice"] < 200000) & (df["Overall Qual"] > 8) & (df["Gr Liv Area"] > 4000)].index

In [None]:
df = df.drop(drop_index, axis = 0)

#### Lets now repeat one of the scatter plots that we had before

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.scatterplot(data = df, x = "Gr Liv Area", y = "SalePrice");

### 2. Dealing with missing data

In [None]:
df.head()

PID is just an identifier, it has no numeric value for the model. Set it as index, or drop it. Dropping it will not make any problems, because we have the default identifier (0, 1, 2, 3, ... ) 

In [None]:
df = df.drop("PID", axis = 1)

In [None]:
df.info()

In [None]:
## lets create a functions that can be used for any future data
def percent_missing_data(df):
    missing_count = df.isna().sum().sort_values(ascending = False)
    missing_percent = 100 * df.isna().sum().sort_values(ascending = False) / len(df)
    missing_count = pd.DataFrame(missing_count[missing_count > 0])
    missing_percent = pd.DataFrame(missing_percent[missing_percent > 0])
    missing_table = pd.concat([missing_count,missing_percent], axis = 1)
    missing_table.columns = ["missing_count", "missing_percent"]
    
    return missing_table

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x = percent_nan.index, y = percent_nan.values[:,1])
plt.xticks(rotation = 90)
plt.show()

In principle we should go through each feature and decide whether we will keep it, fill it or drop it. When we speak about dropping we can drop columns or rows.

For example Pool QC values are missing for 99.6 percent of houses. This might be due to:
1. These houses have no pools, and instead of nan it should have been zero.
2. These houses have pools, but the data is actually missing.

We should go back to the description file and try to understand it better. But now, lets deal with columns with very few missing values.

In [None]:
## lets see the features that has less than on percent missing
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x = percent_nan.index, y = percent_nan.values[:,1])
plt.xticks(rotation = 90)
plt.ylim(0,1)
plt.show()

lets now look at these rows, there might be houses with missing values across all features

In [None]:
percent_nan[percent_nan["missing_percent"] < 1]

In [None]:
index = percent_nan[percent_nan["missing_percent"] < 1].index
for name in index:
    print(df[df["BsmtFin SF 2"].isnull()][name])

In [None]:
df[df["Garage Cars"].isnull()]["Garage Area"]

In [None]:
df = df.dropna(axis = 0, subset = ["Garage Cars"])

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

In [None]:
df[df["BsmtFin SF 1"].isnull()]

It seems that all features related Basement have very high number of missing values. If we go back to data description you will find that Nan actually means that the house do not has a basement. It is not missing, it just has one. Therefore, it does make sense to replace nan values with a string saying that the house has no Basement. This will work for Basement string columns, as for Basement numeric columns we will replace them with zero.

In [None]:
## basement numeric features ==> fillna 0
bsmt_num_cols = ['BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF','Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath']
df[bsmt_num_cols] = df[bsmt_num_cols].fillna(0)

## basement string features ==> fillna none
bsmt_str_cols =  ['Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2']
df[bsmt_str_cols] = df[bsmt_str_cols].fillna('None')

In [None]:
# now if you check again, you will find no nulls
df[df["BsmtFin SF 1"].isnull()]

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

Electrical still has 1 missing value, lets look at it closely and decide

In [None]:
df[df["Electrical"].isnull()]

In [None]:
# You have the choice of filling it with the mode or dropping it, I will drop it
df = df.dropna(axis = 0, subset = ["Electrical"])

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

Both "Mas Vnr Area" and "Mas Vnr Type" have less than 1 percent of null values. How to deal with them? 

Going back to data description, we found that there is a category for none: It does not have "Mas Vnr". We can assume that those missing values are also none but they are mistakenly filled with Nan.

In [None]:
df[["Mas Vnr Area"]] = df[["Mas Vnr Area"]].fillna(0)
df[["Mas Vnr Type"]] = df[["Mas Vnr Type"]].fillna("None")

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

#### What to do with the rest?
The rest of the features have more than 1% missing data. We need to carefully look at each one and decide how to deal with them. For sure, dropping rows is not a possible strategy any more. so we need to figure out something else. We have two options:
1. Fill in missing values
2. Drop thr feature column

As for all garage features, going back to data description we found that Nan means that there is no garage. Therefore, it is resonable to fill it with zero for numeric features and "none" for text features. 

In [None]:
gar_str_cols = ['Garage Type', 'Garage Finish', 'Garage Qual', 'Garage Cond']
df[gar_str_cols] = df[gar_str_cols].fillna('None')

In [None]:
df['Garage Yr Blt'] = df['Garage Yr Blt'].fillna(0)

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x = percent_nan.index, y = percent_nan.values[:,1])
plt.xticks(rotation = 90)
plt.show()

Some of the above features have more than 99 percent missing data, dropping these features can be the best strategy to opt for.

In [None]:
df = df.drop(["Pool QC", "Misc Feature", "Alley", "Fence"], axis = 1)

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x = percent_nan.index, y = percent_nan.values[:,1])
plt.xticks(rotation = 90)
plt.show()

Now we are left with just to columns. You have to be carefull and do a lot of thinking because you can not just drop the rows nor the feature columns. Not enough to drop the feature but not too little to drop the rows.

In [None]:
df["Fireplace Qu"].value_counts()

Since it is a categorical variable we can fill missing data with "None"

In [None]:
df["Fireplace Qu"] = df["Fireplace Qu"].fillna("None")

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

In [None]:
df["Lot Frontage"].value_counts()

It is tricky, it is numeric. I can not longer go back to the description and fill it with a convenient text. 
We will use the Neighborhood feature calculate the missing feature.

Neighborhood: Physical locations within Ames city limits

LotFrontage: Linear feet of street connected to property

We will operate under the assumption that the Lot Frontage is related to what neighborhood a house is in.

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.boxplot(x = "Neighborhood", y = "Lot Frontage", data = df)
plt.xticks(rotation = 90)
plt.show()

As we can see each category is unique enough to make the assumption that we can impute the LotFrontage based on Neighborhood categories. 

In [None]:
df.groupby("Neighborhood")["Lot Frontage"].mean()

To achieve the intended result, we will use pandas transform method. I calls group by and fill in missing vsalues based on it. 

In [None]:
df["Lot Frontage"] = df.groupby("Neighborhood")["Lot Frontage"].transform(lambda value: value.fillna(value.mean()))

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

In [None]:
df["Lot Frontage"] = df["Lot Frontage"].fillna(0)

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

**Yeah! Congratulations! we did it. Nothing is missing any more!**
 
