## Advanced Housing Prices- Feature Engineering + Feature Selection

We will be performing all the below steps in Feature Engineering

1. Missing values
2. Temporal variables
3. Categorical variables: remove rare labels
4. Standarise the values of the variables to the same range

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [56]:
df = pd.read_csv('./datasets/train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [57]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop('SalePrice', axis=1), df['SalePrice'], test_size=0.1, random_state=0)



In [58]:
X_train.shape, X_test.shape

((1314, 80), (146, 80))

## Missing Values

In [59]:
# Identify categorical features with missing values
categorical_features_nan = [feature for feature in df.columns if df[feature].isnull().sum() > 0 and df[feature].dtypes == 'O']
missing_categorical_summary = pd.DataFrame({
    'Feature': categorical_features_nan,
    'Missing Values': df[categorical_features_nan].isnull().sum(),
    'Percentage Missing': np.round(df[categorical_features_nan].isnull().mean() * 100, 4)
}).sort_values(by='Percentage Missing', ascending=False)

print("Summary of Missing Values in Categorical Features:")
print(missing_categorical_summary)


Summary of Missing Values in Categorical Features:
                   Feature  Missing Values  Percentage Missing
PoolQC              PoolQC            1453             99.5205
MiscFeature    MiscFeature            1406             96.3014
Alley                Alley            1369             93.7671
Fence                Fence            1179             80.7534
MasVnrType      MasVnrType             872             59.7260
FireplaceQu    FireplaceQu             690             47.2603
GarageType      GarageType              81              5.5479
GarageFinish  GarageFinish              81              5.5479
GarageQual      GarageQual              81              5.5479
GarageCond      GarageCond              81              5.5479
BsmtExposure  BsmtExposure              38              2.6027
BsmtFinType2  BsmtFinType2              38              2.6027
BsmtQual          BsmtQual              37              2.5342
BsmtCond          BsmtCond              37              2.5342
Bsmt

In [60]:
def replace_cat_feature(df, features_nan, replacement_label='Missing'):
    """
    Replace missing values in categorical features with a specified label.

    Parameters:
    - df: DataFrame
        The input DataFrame.
    - features_nan: list
        List of categorical features with missing values.
    - replacement_label: str, optional (default='Missing')
        The label to use for replacing missing values.

    Returns:
    - DataFrame
        A copy of the DataFrame with missing values replaced.
    """
    data = df.copy()

    # Check if any specified features have missing values
    features_with_missing = [feature for feature in features_nan if feature in data.columns and data[feature].isnull().any()]

    if features_with_missing:
        data[features_with_missing] = data[features_with_missing].fillna(replacement_label)
        print(f"Missing values in {features_with_missing} replaced with '{replacement_label}'.")
    else:
        print("No specified categorical features with missing values found.")

    return data

# Specify the replacement label
replacement_label = 'NotAvailable'


In [61]:

# Replace missing values in categorical features
df = replace_cat_feature(df, categorical_features_nan, replacement_label)


Missing values in ['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature'] replaced with 'NotAvailable'.


In [62]:

# Check if missing values are replaced
print(df[categorical_features_nan].isnull().sum())


Alley           0
MasVnrType      0
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
Electrical      0
FireplaceQu     0
GarageType      0
GarageFinish    0
GarageQual      0
GarageCond      0
PoolQC          0
Fence           0
MiscFeature     0
dtype: int64


In [63]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,NotAvailable,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,NotAvailable,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,NotAvailable,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,NotAvailable,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,NotAvailable,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,NotAvailable,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,NotAvailable,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,NotAvailable,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,12,2008,WD,Normal,250000


In [64]:
# Identify numerical features with missing values
numerical_with_nan = [feature for feature in df.columns if df[feature].isnull().sum() > 1 and df[feature].dtypes != 'O']

# Create a summary dataframe for missing numerical features
missing_numerical_summary = pd.DataFrame({
    'Feature': numerical_with_nan,
    'Missing Values': df[numerical_with_nan].isnull().sum(),
    'Percentage Missing': np.around(df[numerical_with_nan].isnull().mean() * 100, 4)
}).sort_values(by='Percentage Missing', ascending=False)

print("Summary of Missing Values in Numerical Features:")
print(missing_numerical_summary)


Summary of Missing Values in Numerical Features:
                 Feature  Missing Values  Percentage Missing
LotFrontage  LotFrontage             259             17.7397
GarageYrBlt  GarageYrBlt              81              5.5479
MasVnrArea    MasVnrArea               8              0.5479


In [65]:
def replace_numerical_feature(df, features_with_nan, strategy='median'):
    """
    Replace missing values in numerical features with a specified strategy.

    Parameters:
    - df: DataFrame
        The input DataFrame.
    - features_with_nan: list
        List of numerical features with missing values.
    - strategy: str, optional (default='median')
        Imputation strategy. Options: 'mean', 'median', 'mode', or a constant value.

    Returns:
    - DataFrame
        A copy of the DataFrame with missing values replaced.
    """
    data = df.copy()

    # Check if any specified features have missing values
    features_with_missing = [feature for feature in features_with_nan if feature in data.columns and data[feature].isnull().any()]

    if features_with_missing:
        for feature in features_with_missing:
            if strategy == 'mean':
                imputation_value = data[feature].mean()
            elif strategy == 'median':
                imputation_value = data[feature].median()
            elif strategy == 'mode':
                imputation_value = data[feature].mode().iloc[0]
            else:
                imputation_value = strategy

            # Create a new feature to capture nan values
            data[feature + 'nan'] = np.where(data[feature].isnull(), 1, 0)
            data[feature].fillna(imputation_value, inplace=True)

            print(f"Missing values in {feature} replaced with {strategy} ({imputation_value:.4f}).")
    else:
        print("No specified numerical features with missing values found.")

    return data


In [66]:

# Replace missing values in numerical features using median as the default strategy
df = replace_numerical_feature(df, numerical_with_nan, strategy='median')


Missing values in LotFrontage replaced with median (69.0000).
Missing values in MasVnrArea replaced with median (0.0000).
Missing values in GarageYrBlt replaced with median (1980.0000).


In [67]:

# Check if missing values are replaced
print(df[numerical_with_nan].isnull().sum())


LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64


In [68]:
df.head(50)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontagenan,MasVnrAreanan,GarageYrBltnan
0,1,60,RL,65.0,8450,Pave,NotAvailable,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,NotAvailable,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,2,2008,WD,Normal,208500,0,0,0
1,2,20,RL,80.0,9600,Pave,NotAvailable,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,NotAvailable,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,5,2007,WD,Normal,181500,0,0,0
2,3,60,RL,68.0,11250,Pave,NotAvailable,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,9,2008,WD,Normal,223500,0,0,0
3,4,70,RL,60.0,9550,Pave,NotAvailable,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,NotAvailable,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,2,2006,WD,Abnorml,140000,0,0,0
4,5,60,RL,84.0,14260,Pave,NotAvailable,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,12,2008,WD,Normal,250000,0,0,0
5,6,50,RL,85.0,14115,Pave,NotAvailable,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,VinylSd,VinylSd,NotAvailable,0.0,TA,TA,Wood,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,NotAvailable,Attchd,1993.0,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,NotAvailable,MnPrv,Shed,700,10,2009,WD,Normal,143000,0,0,0
6,7,20,RL,75.0,10084,Pave,NotAvailable,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,8,5,2004,2005,Gable,CompShg,VinylSd,VinylSd,Stone,186.0,Gd,TA,PConc,Ex,TA,Av,GLQ,1369,Unf,0,317,1686,GasA,Ex,Y,SBrkr,1694,0,0,1694,1,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2004.0,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,8,2007,WD,Normal,307000,0,0,0
7,8,60,RL,69.0,10382,Pave,NotAvailable,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,1973,1973,Gable,CompShg,HdBoard,HdBoard,Stone,240.0,TA,TA,CBlock,Gd,TA,Mn,ALQ,859,BLQ,32,216,1107,GasA,Ex,Y,SBrkr,1107,983,0,2090,1,0,2,1,3,1,TA,7,Typ,2,TA,Attchd,1973.0,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,NotAvailable,NotAvailable,Shed,350,11,2009,WD,Normal,200000,1,0,0
8,9,50,RM,51.0,6120,Pave,NotAvailable,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,BrkFace,Wd Shng,NotAvailable,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,1931.0,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,4,2008,WD,Abnorml,129900,0,0,0
9,10,190,RL,50.0,7420,Pave,NotAvailable,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,MetalSd,MetalSd,NotAvailable,0.0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,1939.0,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,1,2008,WD,Normal,118000,0,0,0


In [69]:
def transform_temporal_feature(df, temporal_feature, reference_feature='YrSold'):
    """
    Transform a temporal feature by subtracting it from a reference feature.

    Parameters:
    - df: DataFrame
        The input DataFrame.
    - temporal_feature: str
        The name of the temporal feature to be transformed.
    - reference_feature: str, optional (default='YrSold')
        The reference feature to subtract the temporal feature from.

    Returns:
    - DataFrame
        A copy of the DataFrame with the transformed temporal feature.
    """
    data = df.copy()

    # Check if the specified temporal feature is present
    if temporal_feature in data.columns:
        data[temporal_feature] = data[reference_feature] - data[temporal_feature]
        print(f"Transformed {temporal_feature} by subtracting from {reference_feature}.")
    else:
        print(f"{temporal_feature} not found in the dataset.")

    return data


In [70]:

# Transform temporal features: 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt'
for feature in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
    df = transform_temporal_feature(df, feature)


Transformed YearBuilt by subtracting from YrSold.
Transformed YearRemodAdd by subtracting from YrSold.
Transformed GarageYrBlt by subtracting from YrSold.


In [71]:

# Display the transformed DataFrame
print(df[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']].head())


   YearBuilt  YearRemodAdd  GarageYrBlt
0          5             5          5.0
1         31            31         31.0
2          7             6          7.0
3         91            36          8.0
4          8             8          8.0


In [72]:
df[['YearBuilt','YearRemodAdd','GarageYrBlt']].head()

Unnamed: 0,YearBuilt,YearRemodAdd,GarageYrBlt
0,5,5,5.0
1,31,31,31.0
2,7,6,7.0
3,91,36,8.0
4,8,8,8.0


## Numerical Variables
Since the numerical variables are skewed we will perform log normal distribution

In [73]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontagenan,MasVnrAreanan,GarageYrBltnan
0,1,60,RL,65.0,8450,Pave,NotAvailable,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,5,5,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,NotAvailable,Attchd,5.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,2,2008,WD,Normal,208500,0,0,0
1,2,20,RL,80.0,9600,Pave,NotAvailable,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,31,31,Gable,CompShg,MetalSd,MetalSd,NotAvailable,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,31.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,5,2007,WD,Normal,181500,0,0,0
2,3,60,RL,68.0,11250,Pave,NotAvailable,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,7,6,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,7.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,9,2008,WD,Normal,223500,0,0,0
3,4,70,RL,60.0,9550,Pave,NotAvailable,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,91,36,Gable,CompShg,Wd Sdng,Wd Shng,NotAvailable,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,8.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,2,2006,WD,Abnorml,140000,0,0,0
4,5,60,RL,84.0,14260,Pave,NotAvailable,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,8,8,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,8.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,12,2008,WD,Normal,250000,0,0,0


In [74]:
def apply_log_transform(df, numerical_features, exception_features=[]):
    """
    Apply logarithmic transformation to specified numerical features.

    Parameters:
    - df: DataFrame
        The input DataFrame.
    - numerical_features: list
        List of numerical features to be transformed.
    - exception_features: list, optional (default=[])
        List of features to exclude from transformation.

    Returns:
    - DataFrame
        A copy of the DataFrame with the specified numerical features transformed.
    """
    data = df.copy()

    for feature in numerical_features:
        # Check if the feature is in the dataset and not in the exception list
        if feature in data.columns and feature not in exception_features:
            # Check if there are zero or negative values before applying log transformation
            if (data[feature] > 0).all():
                data[feature] = np.log(data[feature])
                print(f"Logarithmic transformation applied to {feature}.")
            else:
                print(f"Cannot apply log transformation to {feature} due to zero or negative values.")
        else:
            print(f"{feature} not found in the dataset or excluded from transformation.")

    return data


In [75]:

# Apply logarithmic transformation to numerical features: 'LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice'
df = apply_log_transform(df, numerical_features=['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice'])


Logarithmic transformation applied to LotFrontage.
Logarithmic transformation applied to LotArea.
Logarithmic transformation applied to 1stFlrSF.
Logarithmic transformation applied to GrLivArea.
Logarithmic transformation applied to SalePrice.


In [76]:

# Display the transformed DataFrame
print(df[['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']].head())


   LotFrontage   LotArea  1stFlrSF  GrLivArea  SalePrice
0     4.174387  9.041922  6.752270   7.444249  12.247694
1     4.382027  9.169518  7.140453   7.140453  12.109011
2     4.219508  9.328123  6.824374   7.487734  12.317167
3     4.094345  9.164296  6.867974   7.448334  11.849398
4     4.430817  9.565214  7.043160   7.695303  12.429216


In [77]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontagenan,MasVnrAreanan,GarageYrBltnan
0,1,60,RL,4.174387,9.041922,Pave,NotAvailable,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,5,5,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,6.75227,854,0,7.444249,1,0,2,1,3,1,Gd,8,Typ,0,NotAvailable,Attchd,5.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,2,2008,WD,Normal,12.247694,0,0,0
1,2,20,RL,4.382027,9.169518,Pave,NotAvailable,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,31,31,Gable,CompShg,MetalSd,MetalSd,NotAvailable,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,7.140453,0,0,7.140453,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,31.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,5,2007,WD,Normal,12.109011,0,0,0
2,3,60,RL,4.219508,9.328123,Pave,NotAvailable,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,7,6,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,6.824374,866,0,7.487734,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,7.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,9,2008,WD,Normal,12.317167,0,0,0
3,4,70,RL,4.094345,9.164296,Pave,NotAvailable,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,91,36,Gable,CompShg,Wd Sdng,Wd Shng,NotAvailable,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,6.867974,756,0,7.448334,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,8.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,2,2006,WD,Abnorml,11.849398,0,0,0
4,5,60,RL,4.430817,9.565214,Pave,NotAvailable,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,8,8,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,7.04316,1053,0,7.695303,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,8.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,12,2008,WD,Normal,12.429216,0,0,0


## Handling Rare Categorical Feature

We will remove categorical variables that are present less than 1% of the observations

In [78]:
categorical_features=[feature for feature in df.columns if df[feature].dtype=='O']

In [79]:
categorical_features

['MSZoning',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'SaleType',
 'SaleCondition']

In [80]:
def handle_rare_categories(df, categorical_features, target_feature='SalePrice', threshold=0.01, replacement_label='Rare_var'):
    """
    Handle rare categories in specified categorical features.

    Parameters:
    - df: DataFrame
        The input DataFrame.
    - categorical_features: list
        List of categorical features to handle rare categories.
    - target_feature: str, optional (default='SalePrice')
        The target feature used for determining rare categories.
    - threshold: float, optional (default=0.01)
        The threshold to consider a category as rare (percentage of total).
    - replacement_label: str, optional (default='Rare_var')
        The label to use for replacing rare categories.

    Returns:
    - DataFrame
        A copy of the DataFrame with rare categories handled in specified categorical features.
    """
    data = df.copy()

    for feature in categorical_features:
        temp = data.groupby(feature)[target_feature].count() / len(data)
        temp_df = temp[temp > threshold].index
        data[feature] = np.where(data[feature].isin(temp_df), data[feature], replacement_label)

        print(f"Handled rare categories in {feature} using threshold {threshold*100}%.")

    return data


In [81]:

# Handle rare categories in categorical features
df = handle_rare_categories(df, categorical_features)


Handled rare categories in MSZoning using threshold 1.0%.
Handled rare categories in Street using threshold 1.0%.
Handled rare categories in Alley using threshold 1.0%.
Handled rare categories in LotShape using threshold 1.0%.
Handled rare categories in LandContour using threshold 1.0%.
Handled rare categories in Utilities using threshold 1.0%.
Handled rare categories in LotConfig using threshold 1.0%.
Handled rare categories in LandSlope using threshold 1.0%.
Handled rare categories in Neighborhood using threshold 1.0%.
Handled rare categories in Condition1 using threshold 1.0%.
Handled rare categories in Condition2 using threshold 1.0%.
Handled rare categories in BldgType using threshold 1.0%.
Handled rare categories in HouseStyle using threshold 1.0%.
Handled rare categories in RoofStyle using threshold 1.0%.
Handled rare categories in RoofMatl using threshold 1.0%.
Handled rare categories in Exterior1st using threshold 1.0%.
Handled rare categories in Exterior2nd using threshold 1.

In [82]:

# Display the DataFrame with rare categories handled
print(df[categorical_features].head())


  MSZoning Street         Alley LotShape LandContour Utilities LotConfig  \
0       RL   Pave  NotAvailable      Reg         Lvl    AllPub    Inside   
1       RL   Pave  NotAvailable      Reg         Lvl    AllPub       FR2   
2       RL   Pave  NotAvailable      IR1         Lvl    AllPub    Inside   
3       RL   Pave  NotAvailable      IR1         Lvl    AllPub    Corner   
4       RL   Pave  NotAvailable      IR1         Lvl    AllPub       FR2   

  LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle RoofStyle  \
0       Gtl      CollgCr       Norm       Norm     1Fam     2Story     Gable   
1       Gtl     Rare_var      Feedr       Norm     1Fam     1Story     Gable   
2       Gtl      CollgCr       Norm       Norm     1Fam     2Story     Gable   
3       Gtl      Crawfor       Norm       Norm     1Fam     2Story     Gable   
4       Gtl      NoRidge       Norm       Norm     1Fam     2Story     Gable   

  RoofMatl Exterior1st Exterior2nd    MasVnrType ExterQual Ext

In [83]:
df.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontagenan,MasVnrAreanan,GarageYrBltnan
0,1,60,RL,4.174387,9.041922,Pave,NotAvailable,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,5,5,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,6.75227,854,0,7.444249,1,0,2,1,3,1,Gd,8,Typ,0,NotAvailable,Attchd,5.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,2,2008,WD,Normal,12.247694,0,0,0
1,2,20,RL,4.382027,9.169518,Pave,NotAvailable,Reg,Lvl,AllPub,FR2,Gtl,Rare_var,Feedr,Norm,1Fam,1Story,6,8,31,31,Gable,CompShg,MetalSd,MetalSd,NotAvailable,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,7.140453,0,0,7.140453,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,31.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,5,2007,WD,Normal,12.109011,0,0,0
2,3,60,RL,4.219508,9.328123,Pave,NotAvailable,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,7,6,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,6.824374,866,0,7.487734,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,7.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,9,2008,WD,Normal,12.317167,0,0,0
3,4,70,RL,4.094345,9.164296,Pave,NotAvailable,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,91,36,Gable,CompShg,Wd Sdng,Wd Shng,NotAvailable,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,6.867974,756,0,7.448334,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,8.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,2,2006,WD,Abnorml,11.849398,0,0,0
4,5,60,RL,4.430817,9.565214,Pave,NotAvailable,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,8,8,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,7.04316,1053,0,7.695303,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,8.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,12,2008,WD,Normal,12.429216,0,0,0
5,6,50,RL,4.442651,9.554993,Pave,NotAvailable,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,16,14,Gable,CompShg,VinylSd,VinylSd,NotAvailable,0.0,TA,TA,Rare_var,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,6.679599,566,0,7.216709,1,0,1,1,1,1,TA,5,Typ,0,NotAvailable,Attchd,16.0,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,NotAvailable,MnPrv,Shed,700,10,2009,WD,Normal,11.8706,0,0,0
6,7,20,RL,4.317488,9.218705,Pave,NotAvailable,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,8,5,3,2,Gable,CompShg,VinylSd,VinylSd,Stone,186.0,Gd,TA,PConc,Ex,TA,Av,GLQ,1369,Unf,0,317,1686,GasA,Ex,Y,SBrkr,7.434848,0,0,7.434848,1,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,3.0,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,8,2007,WD,Normal,12.634603,0,0,0
7,8,60,RL,4.234107,9.247829,Pave,NotAvailable,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,36,36,Gable,CompShg,HdBoard,HdBoard,Stone,240.0,TA,TA,CBlock,Gd,TA,Mn,ALQ,859,BLQ,32,216,1107,GasA,Ex,Y,SBrkr,7.009409,983,0,7.644919,1,0,2,1,3,1,TA,7,Typ,2,TA,Attchd,36.0,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,NotAvailable,NotAvailable,Shed,350,11,2009,WD,Normal,12.206073,1,0,0
8,9,50,RM,3.931826,8.719317,Pave,NotAvailable,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,77,58,Gable,CompShg,BrkFace,Wd Shng,NotAvailable,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,6.929517,752,0,7.480992,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,77.0,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,4,2008,WD,Abnorml,11.77452,0,0,0
9,10,190,RL,3.912023,8.911934,Pave,NotAvailable,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Rare_var,2fmCon,Rare_var,5,6,69,58,Gable,CompShg,MetalSd,MetalSd,NotAvailable,0.0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,6.981935,0,0,6.981935,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,69.0,RFn,1,205,Rare_var,TA,Y,0,4,0,0,0,0,NotAvailable,NotAvailable,NotAvailable,0,1,2008,WD,Normal,11.67844,0,0,0


In [84]:
def encode_categorical_by_mean(df, categorical_features, target_feature='SalePrice', ascending_order=True):
    """
    Encode categorical features based on the mean of the target variable.

    Parameters:
    - df: DataFrame
        The input DataFrame.
    - categorical_features: list
        List of categorical features to be encoded.
    - target_feature: str, optional (default='SalePrice')
        The target feature used for calculating mean values.
    - ascending_order: bool, optional (default=True)
        Whether to encode in ascending or descending order of mean values.

    Returns:
    - DataFrame
        A copy of the DataFrame with categorical features encoded based on mean values.
    """
    data = df.copy()

    for feature in categorical_features:
        labels_ordered = data.groupby([feature])[target_feature].mean().sort_values(ascending=ascending_order).index
        labels_ordered = {k: i for i, k in enumerate(labels_ordered, 0)}
        data[feature] = data[feature].map(labels_ordered)

        order_type = "ascending" if ascending_order else "descending"
        print(f"Encoded {feature} based on the mean of {target_feature} in {order_type} order.")

    return data


In [85]:

# Encode categorical features based on the mean of 'SalePrice'
df = encode_categorical_by_mean(df, categorical_features)


Encoded MSZoning based on the mean of SalePrice in ascending order.
Encoded Street based on the mean of SalePrice in ascending order.
Encoded Alley based on the mean of SalePrice in ascending order.
Encoded LotShape based on the mean of SalePrice in ascending order.
Encoded LandContour based on the mean of SalePrice in ascending order.
Encoded Utilities based on the mean of SalePrice in ascending order.
Encoded LotConfig based on the mean of SalePrice in ascending order.
Encoded LandSlope based on the mean of SalePrice in ascending order.
Encoded Neighborhood based on the mean of SalePrice in ascending order.
Encoded Condition1 based on the mean of SalePrice in ascending order.
Encoded Condition2 based on the mean of SalePrice in ascending order.
Encoded BldgType based on the mean of SalePrice in ascending order.
Encoded HouseStyle based on the mean of SalePrice in ascending order.
Encoded RoofStyle based on the mean of SalePrice in ascending order.
Encoded RoofMatl based on the mean o

In [86]:

# Display the DataFrame with categorical features encoded
print(df[categorical_features].head())


   MSZoning  Street  Alley  LotShape  LandContour  Utilities  LotConfig  \
0         3       1      2         0            1          1          0   
1         3       1      2         0            1          1          2   
2         3       1      2         1            1          1          0   
3         3       1      2         1            1          1          1   
4         3       1      2         1            1          1          2   

   LandSlope  Neighborhood  Condition1  Condition2  BldgType  HouseStyle  \
0          0            14           2           1         3           5   
1          0            11           1           1         3           3   
2          0            14           2           1         3           5   
3          0            16           2           1         3           5   
4          0            22           2           1         3           5   

   RoofStyle  RoofMatl  Exterior1st  Exterior2nd  MasVnrType  ExterQual  \
0          0     

In [87]:
df.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontagenan,MasVnrAreanan,GarageYrBltnan
0,1,60,3,4.174387,9.041922,1,2,0,1,1,0,0,14,2,1,3,5,7,5,5,5,0,0,10,10,2,196.0,2,3,4,3,3,1,6,706,5,0,150,856,2,4,1,3,6.75227,854,0,7.444249,1,0,2,1,3,1,2,8,4,0,1,4,5.0,2,2,548,2,3,2,0,61,0,0,0,0,0,4,2,0,2,2008,2,3,12.247694,0,0,0
1,2,20,3,4.382027,9.169518,1,2,0,1,1,2,0,11,1,1,3,3,6,8,31,31,0,0,4,3,1,0.0,1,3,2,3,3,4,4,978,5,0,284,1262,2,4,1,3,7.140453,0,0,7.140453,0,1,2,0,3,1,1,6,4,1,3,4,31.0,2,2,460,2,3,2,298,0,0,0,0,0,0,4,2,0,5,2007,2,3,12.109011,0,0,0
2,3,60,3,4.219508,9.328123,1,2,1,1,1,0,0,14,2,1,3,5,7,5,7,6,0,0,10,10,2,162.0,2,3,4,3,3,2,6,486,5,0,434,920,2,4,1,3,6.824374,866,0,7.487734,1,0,2,1,3,1,2,6,4,1,3,4,7.0,2,2,608,2,3,2,0,42,0,0,0,0,0,4,2,0,9,2008,2,3,12.317167,0,0,0
3,4,70,3,4.094345,9.164296,1,2,1,1,1,1,0,16,2,1,3,5,7,5,91,36,0,0,2,4,1,0.0,1,3,1,2,4,1,4,216,5,0,540,756,2,3,1,3,6.867974,756,0,7.448334,1,0,1,0,3,1,2,7,4,1,4,2,8.0,1,3,642,2,3,2,0,35,272,0,0,0,0,4,2,0,2,2006,2,0,11.849398,0,0,0
4,5,60,3,4.430817,9.565214,1,2,1,1,1,2,0,22,2,1,3,5,8,5,8,8,0,0,10,10,2,350.0,2,3,4,3,3,3,6,655,5,0,490,1145,2,4,1,3,7.04316,1053,0,7.695303,1,0,2,1,4,1,2,9,4,1,3,4,8.0,2,3,836,2,3,2,192,84,0,0,0,0,0,4,2,0,12,2008,2,3,12.429216,0,0,0
5,6,50,3,4.442651,9.554993,1,2,1,1,1,0,0,9,2,1,3,1,5,5,16,14,0,0,10,10,1,0.0,1,3,3,3,3,1,6,732,5,0,64,796,2,4,1,3,6.679599,566,0,7.216709,1,0,1,1,1,1,1,5,4,0,1,4,16.0,1,2,480,2,3,2,40,30,0,320,0,0,0,2,1,700,10,2009,2,3,11.8706,0,0,0
6,7,20,3,4.317488,9.218705,1,2,0,1,1,0,0,18,2,1,3,3,8,5,3,2,0,0,10,10,3,186.0,2,3,4,4,3,3,6,1369,5,0,317,1686,2,4,1,3,7.434848,0,0,7.434848,1,0,2,0,3,1,2,7,4,1,4,4,3.0,2,2,636,2,3,2,255,57,0,0,0,0,0,4,2,0,8,2007,2,3,12.634603,0,0,0
7,8,60,3,4.234107,9.247829,1,2,1,1,1,1,0,12,5,1,3,5,7,6,36,36,0,0,6,5,3,240.0,1,3,2,3,3,2,4,859,1,32,216,1107,2,4,1,3,7.009409,983,0,7.644919,1,0,2,1,3,1,1,7,4,2,3,4,36.0,2,2,484,2,3,2,235,204,228,0,0,0,0,4,1,350,11,2009,2,3,12.206073,1,0,0
8,9,50,1,3.931826,8.719317,1,2,0,1,1,0,0,4,0,1,3,1,7,5,77,58,0,0,8,4,1,0.0,1,3,1,2,3,1,5,0,5,0,952,952,2,3,1,1,6.929517,752,0,7.480992,0,0,2,0,2,2,1,8,3,2,3,2,77.0,1,2,468,1,3,2,90,0,205,0,0,0,0,4,2,0,4,2008,2,0,11.77452,0,0,0
9,10,190,3,3.912023,8.911934,1,2,0,1,1,1,0,3,0,0,0,2,5,6,69,58,0,0,4,3,1,0.0,1,3,1,2,3,1,6,851,5,0,140,991,2,4,1,3,6.981935,0,0,6.981935,1,0,1,0,2,2,1,5,4,2,3,4,69.0,2,1,205,3,3,2,0,4,0,0,0,0,0,4,2,0,1,2008,2,3,11.67844,0,0,0


In [88]:
scaling_feature=[feature for feature in df.columns if feature not in ['Id','SalePerice'] ]
len(scaling_feature)

83

In [89]:
scaling_feature

['MSSubClass',
 'MSZoning',
 'LotFrontage',
 'LotArea',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'MasVnrArea',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinSF1',
 'BsmtFinType2',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'KitchenQual',
 'TotRmsAbvGrd',
 'Functional',
 'Fireplaces',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageCars',
 'GarageArea',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'WoodDeckSF',
 'OpenPorchSF',
 'Enc

In [90]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontagenan,MasVnrAreanan,GarageYrBltnan
0,1,60,3,4.174387,9.041922,1,2,0,1,1,0,0,14,2,1,3,5,7,5,5,5,0,0,10,10,2,196.0,2,3,4,3,3,1,6,706,5,0,150,856,2,4,1,3,6.75227,854,0,7.444249,1,0,2,1,3,1,2,8,4,0,1,4,5.0,2,2,548,2,3,2,0,61,0,0,0,0,0,4,2,0,2,2008,2,3,12.247694,0,0,0
1,2,20,3,4.382027,9.169518,1,2,0,1,1,2,0,11,1,1,3,3,6,8,31,31,0,0,4,3,1,0.0,1,3,2,3,3,4,4,978,5,0,284,1262,2,4,1,3,7.140453,0,0,7.140453,0,1,2,0,3,1,1,6,4,1,3,4,31.0,2,2,460,2,3,2,298,0,0,0,0,0,0,4,2,0,5,2007,2,3,12.109011,0,0,0
2,3,60,3,4.219508,9.328123,1,2,1,1,1,0,0,14,2,1,3,5,7,5,7,6,0,0,10,10,2,162.0,2,3,4,3,3,2,6,486,5,0,434,920,2,4,1,3,6.824374,866,0,7.487734,1,0,2,1,3,1,2,6,4,1,3,4,7.0,2,2,608,2,3,2,0,42,0,0,0,0,0,4,2,0,9,2008,2,3,12.317167,0,0,0
3,4,70,3,4.094345,9.164296,1,2,1,1,1,1,0,16,2,1,3,5,7,5,91,36,0,0,2,4,1,0.0,1,3,1,2,4,1,4,216,5,0,540,756,2,3,1,3,6.867974,756,0,7.448334,1,0,1,0,3,1,2,7,4,1,4,2,8.0,1,3,642,2,3,2,0,35,272,0,0,0,0,4,2,0,2,2006,2,0,11.849398,0,0,0
4,5,60,3,4.430817,9.565214,1,2,1,1,1,2,0,22,2,1,3,5,8,5,8,8,0,0,10,10,2,350.0,2,3,4,3,3,3,6,655,5,0,490,1145,2,4,1,3,7.04316,1053,0,7.695303,1,0,2,1,4,1,2,9,4,1,3,4,8.0,2,3,836,2,3,2,192,84,0,0,0,0,0,4,2,0,12,2008,2,3,12.429216,0,0,0


## Feature Scaling

In [91]:
feature_scale=[feature for feature in df.columns if feature not in ['Id','SalePrice']]

from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
scaler.fit(df[feature_scale])

In [92]:
scaler.transform(df[feature_scale])

array([[0.23529412, 0.75      , 0.41820812, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.75      , 0.49506375, ..., 0.        , 0.        ,
        0.        ],
       [0.23529412, 0.75      , 0.434909  , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.29411765, 0.75      , 0.42385922, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.75      , 0.434909  , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.75      , 0.47117546, ..., 0.        , 0.        ,
        0.        ]])

In [93]:
# transform the train and test set, and add on the Id and SalePrice variables
data = pd.concat([df[['Id', 'SalePrice']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(df[feature_scale]), columns=feature_scale)],
                    axis=1)

In [94]:
data.head()

Unnamed: 0,Id,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,LotFrontagenan,MasVnrAreanan,GarageYrBltnan
0,1,12.247694,0.235294,0.75,0.418208,0.366344,1.0,1.0,0.0,0.333333,1.0,0.0,0.0,0.636364,0.4,1.0,0.75,1.0,0.666667,0.5,0.036765,0.098361,0.0,0.0,1.0,1.0,0.666667,0.1225,0.666667,1.0,1.0,0.75,0.75,0.25,1.0,0.125089,0.833333,0.0,0.064212,0.140098,1.0,1.0,1.0,1.0,0.356155,0.413559,0.0,0.577712,0.333333,0.0,0.666667,0.5,0.375,0.333333,0.666667,0.5,1.0,0.0,0.2,0.8,0.046729,0.666667,0.5,0.38646,0.666667,1.0,1.0,0.0,0.111517,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.090909,0.5,0.666667,0.75,0.0,0.0,0.0
1,2,12.109011,0.0,0.75,0.495064,0.391317,1.0,1.0,0.0,0.333333,1.0,0.5,0.0,0.5,0.2,1.0,0.75,0.6,0.555556,0.875,0.227941,0.52459,0.0,0.0,0.4,0.3,0.333333,0.0,0.333333,1.0,0.5,0.75,0.75,1.0,0.666667,0.173281,0.833333,0.0,0.121575,0.206547,1.0,1.0,1.0,1.0,0.503056,0.0,0.0,0.470245,0.0,0.5,0.666667,0.0,0.375,0.333333,0.333333,0.333333,1.0,0.333333,0.6,0.8,0.28972,0.666667,0.5,0.324401,0.666667,1.0,1.0,0.347725,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.363636,0.25,0.666667,0.75,0.0,0.0,0.0
2,3,12.317167,0.235294,0.75,0.434909,0.422359,1.0,1.0,0.333333,0.333333,1.0,0.0,0.0,0.636364,0.4,1.0,0.75,1.0,0.666667,0.5,0.051471,0.114754,0.0,0.0,1.0,1.0,0.666667,0.10125,0.666667,1.0,1.0,0.75,0.75,0.5,1.0,0.086109,0.833333,0.0,0.185788,0.150573,1.0,1.0,1.0,1.0,0.383441,0.41937,0.0,0.593095,0.333333,0.0,0.666667,0.5,0.375,0.333333,0.666667,0.333333,1.0,0.333333,0.6,0.8,0.065421,0.666667,0.5,0.428773,0.666667,1.0,1.0,0.0,0.076782,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.727273,0.5,0.666667,0.75,0.0,0.0,0.0
3,4,11.849398,0.294118,0.75,0.388581,0.390295,1.0,1.0,0.333333,0.333333,1.0,0.25,0.0,0.727273,0.4,1.0,0.75,1.0,0.666667,0.5,0.669118,0.606557,0.0,0.0,0.2,0.4,0.333333,0.0,0.333333,1.0,0.25,0.5,1.0,0.25,0.666667,0.038271,0.833333,0.0,0.231164,0.123732,1.0,0.75,1.0,1.0,0.399941,0.366102,0.0,0.579157,0.333333,0.0,0.333333,0.0,0.375,0.333333,0.666667,0.416667,1.0,0.333333,0.8,0.4,0.074766,0.333333,0.75,0.45275,0.666667,1.0,1.0,0.0,0.063985,0.492754,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.090909,0.0,0.666667,0.0,0.0,0.0,0.0
4,5,12.429216,0.235294,0.75,0.513123,0.468761,1.0,1.0,0.333333,0.333333,1.0,0.5,0.0,1.0,0.4,1.0,0.75,1.0,0.777778,0.5,0.058824,0.147541,0.0,0.0,1.0,1.0,0.666667,0.21875,0.666667,1.0,1.0,0.75,0.75,0.75,1.0,0.116052,0.833333,0.0,0.20976,0.187398,1.0,1.0,1.0,1.0,0.466237,0.509927,0.0,0.666523,0.333333,0.0,0.666667,0.5,0.5,0.333333,0.666667,0.583333,1.0,0.333333,0.6,0.8,0.074766,0.666667,0.75,0.589563,0.666667,1.0,1.0,0.224037,0.153565,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.5,0.666667,0.75,0.0,0.0,0.0


In [95]:
data.to_csv('./datasets/Featured_Data.csv',index=False)

## Feature Selection

In [96]:
from sklearn.linear_model import Lasso, LassoCV
from sklearn.feature_selection import SelectFromModel

df2=pd.read_csv('./datasets/Featured_Data.csv')

In [97]:
df2.head()

Unnamed: 0,Id,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,LotFrontagenan,MasVnrAreanan,GarageYrBltnan
0,1,12.247694,0.235294,0.75,0.418208,0.366344,1.0,1.0,0.0,0.333333,1.0,0.0,0.0,0.636364,0.4,1.0,0.75,1.0,0.666667,0.5,0.036765,0.098361,0.0,0.0,1.0,1.0,0.666667,0.1225,0.666667,1.0,1.0,0.75,0.75,0.25,1.0,0.125089,0.833333,0.0,0.064212,0.140098,1.0,1.0,1.0,1.0,0.356155,0.413559,0.0,0.577712,0.333333,0.0,0.666667,0.5,0.375,0.333333,0.666667,0.5,1.0,0.0,0.2,0.8,0.046729,0.666667,0.5,0.38646,0.666667,1.0,1.0,0.0,0.111517,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.090909,0.5,0.666667,0.75,0.0,0.0,0.0
1,2,12.109011,0.0,0.75,0.495064,0.391317,1.0,1.0,0.0,0.333333,1.0,0.5,0.0,0.5,0.2,1.0,0.75,0.6,0.555556,0.875,0.227941,0.52459,0.0,0.0,0.4,0.3,0.333333,0.0,0.333333,1.0,0.5,0.75,0.75,1.0,0.666667,0.173281,0.833333,0.0,0.121575,0.206547,1.0,1.0,1.0,1.0,0.503056,0.0,0.0,0.470245,0.0,0.5,0.666667,0.0,0.375,0.333333,0.333333,0.333333,1.0,0.333333,0.6,0.8,0.28972,0.666667,0.5,0.324401,0.666667,1.0,1.0,0.347725,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.363636,0.25,0.666667,0.75,0.0,0.0,0.0
2,3,12.317167,0.235294,0.75,0.434909,0.422359,1.0,1.0,0.333333,0.333333,1.0,0.0,0.0,0.636364,0.4,1.0,0.75,1.0,0.666667,0.5,0.051471,0.114754,0.0,0.0,1.0,1.0,0.666667,0.10125,0.666667,1.0,1.0,0.75,0.75,0.5,1.0,0.086109,0.833333,0.0,0.185788,0.150573,1.0,1.0,1.0,1.0,0.383441,0.41937,0.0,0.593095,0.333333,0.0,0.666667,0.5,0.375,0.333333,0.666667,0.333333,1.0,0.333333,0.6,0.8,0.065421,0.666667,0.5,0.428773,0.666667,1.0,1.0,0.0,0.076782,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.727273,0.5,0.666667,0.75,0.0,0.0,0.0
3,4,11.849398,0.294118,0.75,0.388581,0.390295,1.0,1.0,0.333333,0.333333,1.0,0.25,0.0,0.727273,0.4,1.0,0.75,1.0,0.666667,0.5,0.669118,0.606557,0.0,0.0,0.2,0.4,0.333333,0.0,0.333333,1.0,0.25,0.5,1.0,0.25,0.666667,0.038271,0.833333,0.0,0.231164,0.123732,1.0,0.75,1.0,1.0,0.399941,0.366102,0.0,0.579157,0.333333,0.0,0.333333,0.0,0.375,0.333333,0.666667,0.416667,1.0,0.333333,0.8,0.4,0.074766,0.333333,0.75,0.45275,0.666667,1.0,1.0,0.0,0.063985,0.492754,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.090909,0.0,0.666667,0.0,0.0,0.0,0.0
4,5,12.429216,0.235294,0.75,0.513123,0.468761,1.0,1.0,0.333333,0.333333,1.0,0.5,0.0,1.0,0.4,1.0,0.75,1.0,0.777778,0.5,0.058824,0.147541,0.0,0.0,1.0,1.0,0.666667,0.21875,0.666667,1.0,1.0,0.75,0.75,0.75,1.0,0.116052,0.833333,0.0,0.20976,0.187398,1.0,1.0,1.0,1.0,0.466237,0.509927,0.0,0.666523,0.333333,0.0,0.666667,0.5,0.5,0.333333,0.666667,0.583333,1.0,0.333333,0.6,0.8,0.074766,0.666667,0.75,0.589563,0.666667,1.0,1.0,0.224037,0.153565,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.5,0.666667,0.75,0.0,0.0,0.0


In [98]:
## Capture the dependent feature
y_train=df2[['SalePrice']]


In [99]:
## drop dependent feature from dataset
X_train=df2.drop(['Id','SalePrice'],axis=1)

In [100]:
alphas = [0.001, 0.01, 0.1, 0.5, 1.0]
lasso_cv = LassoCV(alphas=alphas, cv=5, random_state=0)
lasso_cv.fit(X_train, y_train)

best_alpha = lasso_cv.alpha_
print(f"Best alpha value: {best_alpha}")


Best alpha value: 0.001


  y = column_or_1d(y, warn=True)


In [101]:

feature_sel_model = SelectFromModel(Lasso(alpha=best_alpha, random_state=0))
feature_sel_model.fit(X_train, y_train)


In [102]:

selected_features = X_train.columns[feature_sel_model.get_support()]
print("Selected Features:")
print(selected_features)


Selected Features:
Index(['MSSubClass', 'MSZoning', 'LotArea', 'LotShape', 'LandContour',
       'LotConfig', 'Neighborhood', 'Condition1', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'Exterior1st', 'MasVnrType',
       'ExterQual', 'Foundation', 'BsmtQual', 'BsmtExposure', 'BsmtUnfSF',
       'HeatingQC', 'CentralAir', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'BsmtFullBath', 'FullBath', 'HalfBath', 'KitchenQual', 'Functional',
       'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'ScreenPorch', 'YrSold',
       'SaleCondition'],
      dtype='object')


In [103]:
feature_sel_model.get_support()

array([ True,  True, False,  True, False, False,  True,  True, False,
        True, False,  True,  True, False, False, False,  True,  True,
        True,  True,  True, False,  True, False,  True, False,  True,
       False,  True,  True, False,  True, False, False, False, False,
        True, False, False,  True,  True, False,  True,  True, False,
        True,  True, False,  True,  True, False, False,  True, False,
        True,  True,  True,  True, False,  True,  True, False, False,
        True,  True,  True, False, False, False,  True, False, False,
       False, False, False, False,  True, False,  True, False, False,
       False])

In [104]:
def feature_selection_stats(X_train, feature_sel_model):
    """
    Print statistics about feature selection.

    Parameters:
    - X_train: DataFrame
        The training data.
    - feature_sel_model: sklearn.feature_selection selector object
        The feature selection model.

    Returns:
    - None
    """
    # Get the selected features
    selected_feat = X_train.columns[feature_sel_model.get_support()]

    # Print statistics
    print('Total features: {}'.format(X_train.shape[1]))
    print('Selected features: {}'.format(len(selected_feat)))
    print('Features with coefficients shrank to zero: {}'.format(
        np.sum(feature_sel_model.estimator_.coef_ == 0)))

    # Print the names of the selected features
    print('\nSelected Feature Names:')
    print(selected_feat)


In [105]:

feature_selection_stats(X_train, feature_sel_model)


Total features: 82
Selected features: 41
Features with coefficients shrank to zero: 41

Selected Feature Names:
Index(['MSSubClass', 'MSZoning', 'LotArea', 'LotShape', 'LandContour',
       'LotConfig', 'Neighborhood', 'Condition1', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'Exterior1st', 'MasVnrType',
       'ExterQual', 'Foundation', 'BsmtQual', 'BsmtExposure', 'BsmtUnfSF',
       'HeatingQC', 'CentralAir', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'BsmtFullBath', 'FullBath', 'HalfBath', 'KitchenQual', 'Functional',
       'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'ScreenPorch', 'YrSold',
       'SaleCondition'],
      dtype='object')


In [106]:
def select_features(X, feature_sel_model):
    """
    Select features using a feature selection model.

    Parameters:
    - X: DataFrame
        The input features.
    - feature_sel_model: sklearn.feature_selection selector object
        The feature selection model.

    Returns:
    - DataFrame
        A copy of the input features with only selected features.
    """
    selected_feat = X.columns[feature_sel_model.get_support()]
    return X[selected_feat]

X_train = select_features(X_train, feature_sel_model)


In [107]:
X_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,LotShape,LandContour,LotConfig,Neighborhood,Condition1,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,Exterior1st,MasVnrType,ExterQual,Foundation,BsmtQual,BsmtExposure,BsmtUnfSF,HeatingQC,CentralAir,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,FullBath,HalfBath,KitchenQual,Functional,Fireplaces,FireplaceQu,GarageType,GarageFinish,GarageCars,GarageCond,PavedDrive,WoodDeckSF,ScreenPorch,YrSold,SaleCondition
0,0.235294,0.75,0.366344,0.0,0.333333,0.0,0.636364,0.4,0.666667,0.5,0.036765,0.098361,0.0,1.0,0.666667,0.666667,1.0,0.75,0.25,0.064212,1.0,1.0,0.356155,0.413559,0.577712,0.333333,0.666667,0.5,0.666667,1.0,0.0,0.2,0.8,0.666667,0.5,1.0,1.0,0.0,0.0,0.5,0.75
1,0.0,0.75,0.391317,0.0,0.333333,0.5,0.5,0.2,0.555556,0.875,0.227941,0.52459,0.0,0.4,0.333333,0.333333,0.5,0.75,1.0,0.121575,1.0,1.0,0.503056,0.0,0.470245,0.0,0.666667,0.0,0.333333,1.0,0.333333,0.6,0.8,0.666667,0.5,1.0,1.0,0.347725,0.0,0.25,0.75
2,0.235294,0.75,0.422359,0.333333,0.333333,0.0,0.636364,0.4,0.666667,0.5,0.051471,0.114754,0.0,1.0,0.666667,0.666667,1.0,0.75,0.5,0.185788,1.0,1.0,0.383441,0.41937,0.593095,0.333333,0.666667,0.5,0.666667,1.0,0.333333,0.6,0.8,0.666667,0.5,1.0,1.0,0.0,0.0,0.5,0.75
3,0.294118,0.75,0.390295,0.333333,0.333333,0.25,0.727273,0.4,0.666667,0.5,0.669118,0.606557,0.0,0.2,0.333333,0.333333,0.25,0.5,0.25,0.231164,0.75,1.0,0.399941,0.366102,0.579157,0.333333,0.333333,0.0,0.666667,1.0,0.333333,0.8,0.4,0.333333,0.75,1.0,1.0,0.0,0.0,0.0,0.0
4,0.235294,0.75,0.468761,0.333333,0.333333,0.5,1.0,0.4,0.777778,0.5,0.058824,0.147541,0.0,1.0,0.666667,0.666667,1.0,0.75,0.75,0.20976,1.0,1.0,0.466237,0.509927,0.666523,0.333333,0.666667,0.5,0.666667,1.0,0.333333,0.6,0.8,0.666667,0.75,1.0,1.0,0.224037,0.0,0.5,0.75


In [108]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 41 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   float64
 1   MSZoning       1460 non-null   float64
 2   LotArea        1460 non-null   float64
 3   LotShape       1460 non-null   float64
 4   LandContour    1460 non-null   float64
 5   LotConfig      1460 non-null   float64
 6   Neighborhood   1460 non-null   float64
 7   Condition1     1460 non-null   float64
 8   OverallQual    1460 non-null   float64
 9   OverallCond    1460 non-null   float64
 10  YearBuilt      1460 non-null   float64
 11  YearRemodAdd   1460 non-null   float64
 12  RoofStyle      1460 non-null   float64
 13  Exterior1st    1460 non-null   float64
 14  MasVnrType     1460 non-null   float64
 15  ExterQual      1460 non-null   float64
 16  Foundation     1460 non-null   float64
 17  BsmtQual       1460 non-null   float64
 18  BsmtExpo