# <span style="background:rgba(10, 73, 136, 0.66); border-radius: 15px; padding: 2px 15px; text_shadow: 2px 2px 5px black"> **House sales prediction at Ames, Iowa**
</span>

The goal of this project is to predict the house sales according to seventy nine (79) features given in a dataset

In [None]:
import pandas as pd # for data manipulation & cleaning
import numpy as np # statistics and type handling
from colorama import Fore, Style # colorize output

## <span style="background:rgba(136, 134, 10, 0.49); border-radius: 15px; padding: 2px 15px; text_shadow: 2px 2px 5px black"> I. Let's peek at the dataset
</span>

In [None]:
data = pd.read_csv("../data/train.csv")
data.head().style.highlight_null(color="red")

In [None]:
data.describe().style.highlight_max(
                    color="green"
                    ).highlight_min(
                    color="black"
                    )

In [None]:
data.info()

## <span style="background:rgba(136, 134, 10, 0.49); border-radius: 15px; padding: 2px 15px; text_shadow: 2px 2px 5px black"> II. **Feature engineering + data cleaning**</span>

In [None]:
# relevant features we will use to train our model
keyFactors = [
        'Id',
        'OverallQual',
        'OverallCond',
        'YearBuilt',
        'YearRemodAdd',
        'GrLivArea', 
        'TotalBsmtSF',
        'LotArea',
        'GarageArea',
        'GarageCars',
        'GarageYrBlt',
        'GarageType',
        'GarageFinish',
        'FullBath', 
        'HalfBath',         
        'BedroomAbvGr',    
        'KitchenAbvGr',    
        'KitchenQual',      
        'ExterQual',        
        'ExterCond',        
        'BsmtCond',         
        'HeatingQC',        
        'Neighborhood',     
        'MSZoning',         
        'Fireplaces',       
        'FireplaceQu',      
        'WoodDeckSF',       
        'OpenPorchSF',      
        'Foundation',       
        'CentralAir',       
        'SaleType',
        'SaleCondition',
        'MiscFeature',        
        # target 
        'SalePrice'

]

In [None]:
df = data.copy()[keyFactors]
df.set_index("Id", inplace=True)
df.head().style.highlight_null(color="red").format(na_rep="⚠️missing")

#### **a. Let's add some features !**
we have some redundants `features` storing the same values and sharing the same information. Those columns can be merged by selected the second column.

In [None]:
data[['Exterior1st', 'Exterior2nd', 'Condition1', 'Condition2']]

In [None]:
df.loc[:,'Exterior'] = data['Exterior2nd'].values
df.loc[:,'Condition'] = data['Condition2'].values

We can also add the house's `lifespan` from the build year till the purchase

In [None]:
df["Lifespan"] = np.int64(data["YrSold"] - data["YearBuilt"])
df.fillna({"LifeSpan": 0}, inplace=True) # there is no duration when the result is NA
#####
df = df[df.columns.sort_values()] # sorts the columns in alphabetic order
df.tail(3).style.background_gradient(cmap="coolwarm")

#### **b. Data cleaning**

In [None]:
# removing duplicates and checking missing values
df.drop_duplicates(inplace=True)
a = df.isna().sum()
a[a>0]

samples having **NA** values means those features do not exist for that house. those values won't be dropped but will be replaced by `empty` if it is a text or `0` if it is numeric. Then they will encoded during the process.

In [None]:
# * fill_missing_valuesreplaces the NA with "Empty" or '0'
def fill_missing_values(df: pd.DataFrame) -> pd.DataFrame:
    temp = df.copy()
    cols_with_empty_values = a[a>0].index
    
    for c in cols_with_empty_values:
        if temp[c].dtype == "O":
            temp[c] = temp[c].fillna("Empty")
        else:
            temp[c] = temp[c].fillna(0)
    return temp

df = fill_missing_values(df)

In [None]:
# let's check values type for each sample
def check(df: pd.DataFrame):
    temp = df.dropna(axis=0)
    print(Fore.LIGHTYELLOW_EX+"start cheking ..."+Style.BRIGHT+Style.RESET_ALL)
    for col in temp.columns:
        if temp[col].dtype == np.int64:
            try:
                np.int64(temp[col])
            except Exception as e:
                print(f"'{col.capitalize()}' feature should have int64 type for all samples")
        elif temp[col].dtype == np.float64:
            try:
                np.float64(temp[col])
            except Exception as e:
                print(f"'{col.capitalize()}' feature should have float64 type for all samples")
        else :
            try:
                np.object_(temp[col])
            except Exception as e:
                print(f"'{col.capitalize()}' feature should have object type for all samples")
    print(Fore.GREEN + "All columns are checked "+ Style.BRIGHT+Style.RESET_ALL)

check(df)
print(f"we have {df.shape[0]} samples and {df.shape[1]} features with the houses id set as index")

let's repeat the sama data manipulation with the `test set`

In [None]:
## importing data
test_data = pd.read_csv("../data/test.csv")
df_test = test_data.copy()[keyFactors[:-1]]
df_test.set_index("Id", inplace=True)

## feature engineering
df_test.loc[:,'Exterior'] = test_data['Exterior2nd'].values
df_test.loc[:,'Condition'] = test_data['Condition2'].values
df_test.loc[:,"Lifespan"] = np.int64(test_data["YrSold"] - test_data["YearBuilt"])
df_test.fillna({"LifeSpan": 0}, inplace=True) # there is no duration when the result is NA
df_test = df_test[df_test.columns.sort_values()] # sorts the columns in alphabetic order

# removing duplicates and checking missing values
df_test.drop_duplicates(inplace=True)
df_test = fill_missing_values(df_test)
check(df_test)

## <span style="background:rgba(136, 134, 10, 0.49); border-radius: 15px; padding: 2px 15px; text_shadow: 2px 2px 5px black"> III. **Data exploration**</span>

we will display the `insights` and highlight how the selected features are relevant for the sale price prediction.  
Features are categorized into `nine(9) parts`:
* **`General features`** : *'OverallQual'*, *'OverallCond'*, *'YearBuilt'*, *'YearRemodAdd'*

* **`Surfaces`** : *GrLivArea*, *TotalBsmtSF*, *LotArea*  
* **`Garage`** : *GarageArea*, *GarageCars*, *GarageYrBlt*, *GarageType*, *GarageFinish*  
* **`Rooms and Bathrooms`** : *FullBath*, *HalfBath*, *BedroomAbvGr*, *KitchenAbvGr*  
* **`Quality`** : *KitchenQual*, *ExterQual*, *ExterCond*, *BsmtCond*, *HeatingQC*  
* **`Location`** : *Neighborhood*, *MSZoning*  
* **`Additional Value Features`** : *Fireplaces*, *FireplaceQu*, *WoodDeckSF*, *OpenPorchSF*, *Foundation*, *CentralAir*  
* **`Sales Variables`** : *SaleType*, *SaleCondition*, *MiscFeature*  
* **`Created features`** : *Exterior*, *LifeSpan*, *Condition*  

## <span style="background:rgba(136, 134, 10, 0.49); border-radius: 15px; padding: 2px 15px; text_shadow: 2px 2px 5px black"> **IV. Model Selection**</span>
we baptized our model **`ImmoSense`**

* first, let's cast our continuous or categorical data into dummy version (true or false state)

In [None]:
target = df.copy()["SalePrice"]
df.drop("SalePrice", axis=1, inplace=True)
dummy = pd.get_dummies(df)

* let's split it inton train and test set for traing purposes

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test,y_train, y_test = train_test_split(dummy, target ,test_size=0.2, shuffle=False)

* let's pick the best model for our study case between:
    * **Linear Regression**
    * **SVR**
    * **Ridge**
    * **Nearest neighbors regression**
    * **Decision trees**

In [None]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
models = {
    "Linear_regression": LinearRegression(),
    "KNR": KNeighborsRegressor(),
    "SVR": SVR(kernel="linear"),
    "Ridge": Ridge(alpha=0.5),
    "Decision": DecisionTreeRegressor()
}    

In [None]:
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score
def check_perf(model_list: dict, x_train, y_train, x_test, y_test) -> pd.DataFrame:
    rmse_tab, mae_tab, r_2_tab, score_tab = [], [], [], []
    
    for mod in model_list.values():
        mod.fit(x_train, y_train) # training thre model with training data
        y_pred = mod.predict(x_test) # prediction with samples splitted for test
        
        ## --- some metrics to evaluate model's prediction
        rmse_tab.append(round(root_mean_squared_error(y_test, y_pred),2)) 
        mae_tab.append(round(mean_absolute_error(y_test, y_pred), 2)) 
        r_2_tab.append(round(r2_score(y_test, y_pred),3))
        score_tab.append(mod.score(x_train, y_train)) # training score
        
    return pd.DataFrame({
                        "RMSE": rmse_tab,
                        "Mae": mae_tab,
                        "R2": r_2_tab,
                        "Scores": score_tab
                        }, index=model_list.keys())


In [None]:
check_perf(models, x_train, y_train, x_test, y_test).style.background_gradient(cmap="coolwarm")

if we compare the metrics above we notice that the **`Ridge_regression model`** has a lower **`mean error`** with a quite acceptable **`training score`**.   
So **`ImmoSense`** model will be : **`"Ridge"`**. for better result, let's choose the best parameter for **`ImmoSense`**

In [None]:
from sklearn.model_selection import GridSearchCV
immoSense_test = Ridge()
params = {"alpha": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}
gd = GridSearchCV(immoSense_test, param_grid=params, cv=5, n_jobs=-1, scoring="r2")
gd.fit(x_train, y_train)
print("the best alpha parameter is "+ Fore.YELLOW + f"{gd.best_params_}" +Style.BRIGHT+Style.RESET_ALL+ " with a best score trianing score of " + Fore.GREEN +f"{gd.best_score_*100:.2f}%" + Style.BRIGHT +Style.RESET_ALL)

In [None]:
## Training ImmoSense model with best estimator and best params
ImmoSense = Ridge(1.0)
ImmoSense.fit(x_train, y_train)
print(Fore.GREEN + "ImmoSense well trained"+ Style.DIM+Style.RESET_ALL)

* Let's peek at **feature's weigh**. This study will show us how each chosen feature **impacts** in the **model's precision**.

In [None]:
feat_imp = pd.DataFrame({
    "Features": ImmoSense.feature_names_in_,
    "Coef":np.abs(ImmoSense.coef_)
})

data_categories = { #? dummies features used for model training
    "General features":['OverallCond', 'OverallQual','YearBuilt', 'YearRemodAdd' ],
    
    "Surfaces":["GrLivArea", "TotalBsmtSF", "LotArea"],
    
    "Garage": ['GarageFinish_Empty','GarageFinish_Fin', 'GarageFinish_RFn', 
                'GarageFinish_Unf', 'GarageType_2Types', 'GarageType_Attchd',
                'GarageType_Basment','GarageType_BuiltIn', 'GarageType_CarPort', 
                'GarageType_Detchd','GarageType_Empty'],
    
    "Rooms and bathrooms": ['FullBath','HalfBath','BedroomAbvGr', 'KitchenAbvGr'],
    
    "Quality": ['KitchenQual_Ex', 'KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA',
                'ExterCond_Ex', 'ExterCond_Fa', 'ExterCond_Gd', 'ExterCond_Po',
                'ExterCond_TA', 'ExterQual_Ex', 'ExterQual_Fa', 'ExterQual_Gd',
                'ExterQual_TA', 'BsmtCond_Empty', 'BsmtCond_Fa',
                'BsmtCond_Gd', 'BsmtCond_Po', 'BsmtCond_TA',
                'HeatingQC_Ex', 'HeatingQC_Fa', 'HeatingQC_Gd',
                'HeatingQC_Po', 'HeatingQC_TA'],
    
    "Location": ['Neighborhood_Blueste', 'Neighborhood_BrDale',
                'Neighborhood_BrkSide', 'Neighborhood_ClearCr',
                'Neighborhood_CollgCr', 'Neighborhood_Crawfor',
                'Neighborhood_Edwards', 'Neighborhood_Gilbert',
                'Neighborhood_IDOTRR', 'Neighborhood_MeadowV',
                'Neighborhood_Mitchel', 'Neighborhood_NAmes',
                'Neighborhood_NPkVill', 'Neighborhood_NWAmes',
                'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
                'Neighborhood_OldTown', 'Neighborhood_SWISU',
                'Neighborhood_Sawyer', 'Neighborhood_SawyerW',
                'Neighborhood_Somerst', 'Neighborhood_StoneBr',
                'Neighborhood_Timber', 'Neighborhood_Veenker',
                'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM'
                ],
    
    "Additional feature value": ['Fireplaces', 'FireplaceQu_Empty',
                                'FireplaceQu_Ex', 'FireplaceQu_Fa', 'FireplaceQu_Gd',
                                'FireplaceQu_Po', 'FireplaceQu_TA', 'WoodDeckSF',
                                'OpenPorchSF','Foundation_BrkTil','Foundation_CBlock',
                                'Foundation_PConc', 'Foundation_Slab','Foundation_Stone', 
                                'Foundation_Wood', 'CentralAir_N', 'CentralAir_Y'
                                ],
    
    "Sales variable": ['SaleCondition_Abnorml', 'SaleCondition_AdjLand',
                        'SaleCondition_Alloca', 'SaleCondition_Family',
                        'SaleCondition_Normal', 'SaleCondition_Partial', 'SaleType_COD',
                        'SaleType_CWD', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_ConLI',
                        'SaleType_ConLw', 'SaleType_New', 'SaleType_Oth', 'SaleType_WD',
                        'MiscFeature_Empty', 'MiscFeature_Gar2', 'MiscFeature_Othr',
                        'MiscFeature_Shed', 'MiscFeature_TenC'],
    
    "Created features":['Exterior_AsbShng', 'Exterior_AsphShn',
                        'Exterior_Brk Cmn', 'Exterior_BrkFace', 'Exterior_CBlock',
                        'Exterior_CmentBd', 'Exterior_HdBoard', 'Exterior_ImStucc',
                        'Exterior_MetalSd', 'Exterior_Other', 'Exterior_Plywood',
                        'Exterior_Stone', 'Exterior_Stucco', 'Exterior_VinylSd',
                        'Exterior_Wd Sdng', 'Exterior_Wd Shng', 'Condition_Artery', 
                        'Condition_Feedr','Condition_Norm', 'Condition_PosA', 
                        'Condition_PosN','Condition_RRAe', 'Condition_RRAn', 
                        'Condition_RRNn']
}
coef_sum_tab = []

for categories_tab in data_categories.values():
    coef_sum_tab.append(
        feat_imp[feat_imp["Features"].isin(categories_tab)]["Coef"].sum()
    )
    
features_weigh = pd.DataFrame({"Weigh": np.int64(np.round(coef_sum_tab))}, index=data_categories.keys())
features_weigh.sort_values(ascending=True, inplace=True, by="Weigh")
features_weigh.style.background_gradient(cmap="coolwarm").format('{}$')

Each coefficient represents the sample's increase effect (or decrease for negeative coefficients) on the target (SalePrice).  

In our case if we take the `Rooms and bathrooms` category, we can say say that , in general, if each supplementary room or bathroom increases the house's price by **30,840$**. Same for the other categories.  

What makes the features weigh study more relevant is that we notice how the features we created influences the sales price. In fact, the more material the house is built with our the better the condition is, the more sales price increases by **307,113$**

## <span style="background:rgba(136, 134, 10, 0.49); border-radius: 15px; padding: 2px 15px; text_shadow: 2px 2px 5px black"> **V. Final prediction**</span>
let's predict prices with the test_set file and store it in `data/submission.csv`

In [None]:
df_test.ffill(inplace=True) # handling some none relevant missing values
dummy_test = pd.get_dummies(df_test)
dummy_test = dummy_test.reindex(columns=dummy.columns, fill_value=0)
prices = ImmoSense.predict(dummy_test)
pd.DataFrame({"SalePrice": np.int64(prices)}, index=dummy_test.index).to_csv("../data/submission.csv",sep=",",header=True)
print(Fore.GREEN+"Submission file generated successfully !"+Style.BRIGHT+Style.RESET_ALL)