### Import the Libraries

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import gaussian_kde, skew, kurtosis

# Data Processing

In [37]:
# import the data/train.csv file
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

print(train.shape)
print(test.shape)

(1460, 81)
(1459, 80)


In [38]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


The target variable is `SalePrice`. We have 79 features in the dataset by removing the columns `ID`

In [39]:
train = train.set_index('Id')

In [40]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuilt    

## House Price Distribution

In this section we explore the target variable `SalePrice` and its distribution, identifying potential outliers and deciding how to handle them.

In [41]:
print(train['SalePrice'].describe())

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64


We will plot the distribution of `SalePrice` and calculate the skewness and kurtosis of the data. 

We will also use Kernel Density Estimation (KDE) to estimate the probability density function of the data.

**Brief Theory: Kernel Density Estimation (KDE)**

Kernel Density Estimation (KDE) is a non-parametric method to estimate the probability density function (PDF) of a dataset. 
It smooths the histogram by placing a small, continuous kernel (e.g., Gaussian) at each data point.

**Formula**
$$
f(x) = \frac{1}{n \times h}\sum(\frac{K((x - x_i)}{h})
$$

Where:
- $n$: Number of data points
- $h$: Bandwidth (controls smoothness)
- $K$: Kernel function (e.g., Gaussian kernel)

In [49]:
# Calculate KDE for SalePrice
kde = gaussian_kde(train['SalePrice'])
x_vals = np.linspace(train['SalePrice'].min(), train['SalePrice'].max(), 1000)
kde_vals = kde(x_vals)

# Create the histogram with proportion normalization
fig = px.histogram(train, 
                   x='SalePrice', 
                   nbins=100, 
                   title='SalePrice Distribution with Density',
                   labels={'SalePrice': 'Sale Price'},
                   template='plotly_white',
                   histnorm='probability density')  # Normalize to proportion

fig.update_traces(marker_color='blue', opacity=0.75)

# Add the density curve
fig.add_trace(
    go.Scatter(
        x=x_vals,
        y=kde_vals,
        mode='lines',
        line=dict(color='black', width=2),
        name='Density Estimation'
    )
)

fig.update_layout(
    yaxis_title='Density',
    xaxis_title='Sale Price',
    showlegend=True
)

fig.show()

In [43]:
# Calculate skewness and kurtosis of the SalePrice
saleprice_skewness = skew(train['SalePrice'])
saleprice_kurtosis = kurtosis(train['SalePrice'])

print(f"Skewness: {saleprice_skewness}")
print(f"Kurtosis: {saleprice_kurtosis}")

Skewness: 1.880940746034036
Kurtosis: 6.509812011089439


#### Interpretation:

**Skewness:**

- Skewness = $ 1.88$: The distribution of SalePrice is highly right-skewed, indicating a long tail with higher-priced houses. 

- A right-skewed distribution can lead to poor model performance because the model struggles to predict extreme values. 

- We will maybe need to apply a transformation to make the distribution more symmetric.

**Kurtosis:**

- Kurtosis = $6.51$: The distribution is leptokurtic, meaning it has a sharper peak and heavier tails compared to a normal distribution. 

- This suggests more extreme values (outliers) in the dataset.

- Outliers can:
    - skew the training process of sensitive models (e.g., linear regression)

    - Overemphasize extreme values in tree-based models (e.g., random forests, XGBoost).

To handle outliers, we can:

- Removing extreme outliers if they are errors or irrelevant.

- Applying robust models that are less sensitive to outliers (e.g., gradient boosting, robust regression).

In [48]:
fig = px.box(train, y='SalePrice', title='SalePrice Boxplot',
             labels={'SalePrice': 'Sale Price'},
             template='plotly_white')
fig.update_traces(marker_color='blue')
fig.show()

#### How to read the SalePrice Boxplot

- **Median**: The central line in the boxplot represents the median sale price, approximately $200k.

- **Interquartile Range (IQR)**: The box spans the middle 50% of the data, from the 25th percentile to the 75th percentile.

- **Whiskers**: Extend to the non-outlier data points within 1.5 times the IQR.

- **Outliers**: Points above the upper whisker are considered outliers, representing high-priced properties.

#### Interpretation:
The boxplot shows a significant number of outliers in the higher price range, confirming the right-skewness of `SalePrice`.