# Notebook: PySpark/Pandas Project
### Raúl Varela Ferrando

## Initialization

Installation of Spark and Java

In [None]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.4/spark-3.2.4-bin-hadoop3.2.tgz
!tar xf spark-3.2.4-bin-hadoop3.2.tgz
!pip install -q findspark

Setting up environment variables.



In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.4-bin-hadoop3.2"

Opening connection with Google Drive.



In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We create a variable with the path to our working directory.



In [4]:
input_path = '/content/drive/My Drive/Master Big Data y Data Science/APBD/trabajo_final_APBD_2023/data/{}'

The following chunk will connect the Spark execution engine with the Python environment we are using.



In [5]:
import findspark
findspark.init()

Finally, we initialize the connection with the Spark execution engine.



In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("EntregaAPBD") \
    .getOrCreate()
spark

## Exploring the Dataset

Next, we start an example covering the complete modeling process using a dataset provided by Kaggle, where the goal is to predict housing prices based on their features. We will begin with data cleaning and exploration, and finish with model training and parameter tuning.

In [7]:
train_df = spark.read.csv(path=input_path.format('train.csv'), header=True, inferSchema=True)
test_df = spark.read.csv(path=input_path.format('test.csv'), header=True, inferSchema=True)

In [8]:
train_df.show(5)

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition

In [9]:
test_df.show(5)

+----+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+
|  Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition2|BldgTy

We convert our dataframes to pandas for easier handling.



In [10]:
!pip install pyarrow



In [11]:
from pyspark import pandas as ps
train_psdf = train_df.to_pandas_on_spark(index_col='Id')
test_psdf = test_df.to_pandas_on_spark(index_col='Id')



In [12]:
type(train_psdf)

pyspark.pandas.frame.DataFrame

In [13]:
train_psdf.head(3)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1
1,60,RL,65,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500


In [14]:
test_psdf.head(3)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1
1461,20,RH,80,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0,TA,TA,CBlock,TA,TA,No,Rec,468,LwQ,144,270,882,GasA,TA,Y,SBrkr,896,0,0,896,0,0,1,0,2,1,TA,5,Typ,0,,Attchd,1961,Unf,1,730,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal
1462,20,RL,81,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108,TA,TA,CBlock,TA,TA,No,ALQ,923,Unf,0,406,1329,GasA,TA,Y,SBrkr,1329,0,0,1329,0,0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958,Unf,1,312,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal
1463,60,RL,74,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0,TA,TA,PConc,Gd,TA,No,GLQ,791,Unf,0,137,928,GasA,Gd,Y,SBrkr,928,701,0,1629,0,0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997,Fin,2,482,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal


- We see that the train dataset has one extra variable, the response.
- Let's see what happened with the schema.

In [None]:
print("Train DataFrame Info:")
train_psdf.info()

  fields = [


<class 'pyspark.pandas.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MSSubClass     1460 non-null   int32 
 1   MSZoning       1460 non-null   object
 2   LotFrontage    1460 non-null   object
 3   LotArea        1460 non-null   int32 
 4   Street         1460 non-null   object
 5   Alley          1460 non-null   object
 6   LotShape       1460 non-null   object
 7   LandContour    1460 non-null   object
 8   Utilities      1460 non-null   object
 9   LotConfig      1460 non-null   object
 10  LandSlope      1460 non-null   object
 11  Neighborhood   1460 non-null   object
 12  Condition1     1460 non-null   object
 13  Condition2     1460 non-null   object
 14  BldgType       1460 non-null   object
 15  HouseStyle     1460 non-null   object
 16  OverallQual    1460 non-null   int32 
 17  OverallCond    1460 non-null   int32 
 18  YearBuilt      1460 non-n

To obtain information about our dataframe, we will use the **describe()** function, which provides various statistics such as the mean, standard deviation, etc.

In [None]:
print("Train DataFrame Describe:")
train_psdf.describe()

Unnamed: 0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,10516.828082,6.099315,5.575342,1971.267808,1984.865753,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,42.300571,9981.264932,1.382997,1.112799,30.202904,20.645407,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,20.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,20.0,7540.0,5.0,5.0,1954.0,1967.0,0.0,0.0,223.0,795.0,882.0,0.0,0.0,1128.0,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1.0,330.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129900.0
50%,50.0,9477.0,6.0,5.0,1973.0,1994.0,383.0,0.0,476.0,991.0,1086.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,70.0,11600.0,7.0,6.0,2000.0,2004.0,712.0,0.0,808.0,1298.0,1391.0,728.0,0.0,1776.0,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,190.0,215245.0,10.0,9.0,2010.0,2010.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


- We obtain a list of column names using the **columns** attribute. 

In [17]:
cols = train_psdf.columns
print(len(cols))
cols

80


Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

## Exploration

### Dimensions

Merging both datasets facilitates exploration, but for this, both must have the same dimensions. Since the test set does not contain the response variable, we will assign it the actual values contained in another .csv file within the working material.

In [18]:
test_psdf['SalePrice'] = None
test_psdf.head(5)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1
1461,20,RH,80,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0,TA,TA,CBlock,TA,TA,No,Rec,468,LwQ,144,270,882,GasA,TA,Y,SBrkr,896,0,0,896,0,0,1,0,2,1,TA,5,Typ,0,,Attchd,1961,Unf,1,730,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal,
1462,20,RL,81,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108,TA,TA,CBlock,TA,TA,No,ALQ,923,Unf,0,406,1329,GasA,TA,Y,SBrkr,1329,0,0,1329,0,0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958,Unf,1,312,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal,
1463,60,RL,74,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0,TA,TA,PConc,Gd,TA,No,GLQ,791,Unf,0,137,928,GasA,Gd,Y,SBrkr,928,701,0,1629,0,0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997,Fin,2,482,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal,
1464,60,RL,78,9978,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,6,6,1998,1998,Gable,CompShg,VinylSd,VinylSd,BrkFace,20,TA,TA,PConc,TA,TA,No,GLQ,602,Unf,0,324,926,GasA,Ex,Y,SBrkr,926,678,0,1604,0,0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,1998,Fin,2,470,TA,TA,Y,360,36,0,0,0,0,,,,0,6,2010,WD,Normal,
1465,120,RL,43,5005,Pave,,IR1,HLS,AllPub,Inside,Gtl,StoneBr,Norm,Norm,TwnhsE,1Story,8,5,1992,1992,Gable,CompShg,HdBoard,HdBoard,,0,Gd,TA,PConc,Gd,TA,No,ALQ,263,Unf,0,1017,1280,GasA,Ex,Y,SBrkr,1280,0,0,1280,0,0,2,0,2,1,Gd,5,Typ,0,,Attchd,1992,RFn,2,506,TA,TA,Y,0,82,0,0,144,0,,,,0,1,2010,WD,Normal,


In [19]:
train_psdf.head(5)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1
1,60,RL,65,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


### Dataset Merge

In [20]:
full_psdf = ps.concat([train_psdf, test_psdf])
full_psdf.shape

(2919, 80)

In [21]:
train_psdf.shape

(1460, 80)

In [22]:
test_psdf.shape

(1459, 80)

Now we have our unified dataset. However, some numerical variables have been categorized as strings, so we will convert them to the correct type.




In [23]:
for column in ['LotFrontage','MasVnrArea','GarageYrBlt']:
  full_psdf[column]=full_psdf[column].astype('int32')

full_psdf.info()

  fields = [


<class 'pyspark.pandas.frame.DataFrame'>
Int64Index: 2919 entries, 1 to 2919
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MSSubClass     2919 non-null   int32 
 1   MSZoning       2919 non-null   object
 2   LotFrontage    2433 non-null   int32 
 3   LotArea        2919 non-null   int32 
 4   Street         2919 non-null   object
 5   Alley          2919 non-null   object
 6   LotShape       2919 non-null   object
 7   LandContour    2919 non-null   object
 8   Utilities      2919 non-null   object
 9   LotConfig      2919 non-null   object
 10  LandSlope      2919 non-null   object
 11  Neighborhood   2919 non-null   object
 12  Condition1     2919 non-null   object
 13  Condition2     2919 non-null   object
 14  BldgType       2919 non-null   object
 15  HouseStyle     2919 non-null   object
 16  OverallQual    2919 non-null   int32 
 17  OverallCond    2919 non-null   int32 
 18  YearBuilt      2919 non-n

## Missings

- We have seen that there are missing values, so we will impute them.

In [24]:
full_psdf.isnull().sum()

  fields = [


MSSubClass          0
MSZoning            0
LotFrontage       486
LotArea             0
Street              0
Alley               0
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          0
MasVnrArea         23
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual            0
BsmtCond            0
BsmtExposure        0
BsmtFinType1        0
BsmtFinSF1          0
BsmtFinType2        0
BsmtFinSF2          0
BsmtUnfSF           0
TotalBsmtSF         0
Heating             0
HeatingQC           0
CentralAir          0
Electrical          0
1stFlrSF            0
2ndFlrSF            0
LowQualFinSF        0
GrLivArea 

As can be observed, most variables do not have missing values, which, when examining the dataset, seems to be an error. At this point, we will differentiate between categorical variables that include **NA** as a category, detailed in the **data_description.txt** file, and those that do not. The latter are the ones with true missing values, so we will transform these values to **None**.

In [None]:
cat_vars = [
    'MSSubClass', 'MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
    'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual',
    'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'ExterQual',
    'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
    'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition'
]

for column in cat_vars:
    full_psdf[column] = full_psdf[column].replace('NA', None)

full_psdf.head(5)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1
1,60,RL,65,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [None]:
full_psdf.isnull().sum()

  fields = [


MSSubClass          0
MSZoning            4
LotFrontage       486
LotArea             0
Street              0
Alley               0
LotShape            0
LandContour         0
Utilities           2
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         1
Exterior2nd         1
MasVnrType          0
MasVnrArea         23
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual            0
BsmtCond            0
BsmtExposure        0
BsmtFinType1        0
BsmtFinSF1          0
BsmtFinType2        0
BsmtFinSF2          0
BsmtUnfSF           0
TotalBsmtSF         0
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
1stFlrSF            0
2ndFlrSF            0
LowQualFinSF        0
GrLivArea 

After these transformations, we identify additional missing values in variables such as **MSZoning**, **Utilities**, **Exterior1st**, **Exterior2nd**, **Electrical**, **KitchenQual**, **Functional** and **SaleType**, beyond those already found in numerical variables.

To impute the missing values, we will follow two processes depending on whether the variable is numerical or categorical. For numerical variables, we will replace missing values with the median, as we cannot assume a normal distribution. For categorical variables, we will replace them with the most frequent category. We did not conduct a deeper analysis because an initial attempt did not yield well-defined conclusions.

In [None]:
ps.set_option('compute.ops_on_diff_frames', True)
num_var=['LotFrontage','MasVnrArea','GarageYrBlt']
cat_vars=['MSZoning', 'Utilities', 'Exterior1st', 'Exterior2nd', 'Electrical', 'KitchenQual', 'Functional', 'SaleType']

for column in full_psdf.columns:
  if column in num_var:
    median=full_psdf[column].median()
    full_psdf[column]=full_psdf[column].fillna(median)
  if column in cat_var:
    mode=full_psdf[column].mode()
    full_psdf[column]=full_psdf[column].replace(None,mode[0])

In [None]:
print("Valores nulos después de la imputación:")
full_psdf.isnull().sum()

  fields = [


MSSubClass          0
MSZoning            0
LotFrontage         0
LotArea             0
Street              0
Alley               0
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          0
MasVnrArea          0
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual            0
BsmtCond            0
BsmtExposure        0
BsmtFinType1        0
BsmtFinSF1          0
BsmtFinType2        0
BsmtFinSF2          0
BsmtUnfSF           0
TotalBsmtSF         0
Heating             0
HeatingQC           0
CentralAir          0
Electrical          0
1stFlrSF            0
2ndFlrSF            0
LowQualFinSF        0
GrLivArea 

As we can see, there are no missing values except for the **SalePrice** variable, which is our target variable. Once the data is preprocessed, we move on to the model creation stage. First, we load the packages we will use.

## Pipelines

In [29]:
import pandas as pd
from pyspark.sql.types import IntegerType
from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import GeneralizedLinearRegression, FMRegressor, RandomForestRegressor
from pyspark.ml.evaluation import Evaluator, RegressionEvaluator

To apply models, we need to structure the data in the features/label format. To automate the process, we will use a Pipeline.



In [None]:
categorical_columns = full_psdf.select_dtypes(include=['object', 'category']).columns.tolist()
print(categorical_columns)

In [None]:
categorical_columns = [
    "MSSubClass", "MSZoning", "Street", "Alley", "LotShape", "LandContour",
    "Utilities", "LotConfig", "LandSlope", "Neighborhood", "Condition1", "Condition2",
    "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd",
    "MasVnrType", "ExterQual", "ExterCond", "Foundation", "BsmtQual", "BsmtCond",
    "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "Heating", "HeatingQC",
    "CentralAir", "Electrical", "KitchenQual", "Functional", "FireplaceQu",
    "GarageType", "GarageFinish", "GarageQual", "GarageCond", "PavedDrive",
    "PoolQC", "Fence", "MiscFeature", "SaleType", "SaleCondition"
]

indexers = [
    StringIndexer(inputCol=col, outputCol=f"{col}_n", handleInvalid="keep")
    for col in categorical_columns
]

selected_features = ['MSSubClass_n', 'MSZoning_n','LotFrontage','LotArea', 'Street_n',
                     'Alley_n', 'LotShape_n', 'LandContour_n', 'Utilities_n', 'LotConfig_n', 'LandSlope_n',
                     'Neighborhood_n','Condition1_n', 'Condition2_n','BldgType_n','HouseStyle_n',
                     'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle_n',
                     'RoofMatl_n','Exterior1st_n', 'Exterior2nd_n', 'MasVnrType_n','MasVnrArea',
                     'ExterQual_n', 'ExterCond_n', 'Foundation_n', 'BsmtQual_n', 'BsmtCond_n',
                     'BsmtExposure_n', 'BsmtFinType1_n', 'BsmtFinSF1_n','BsmtFinType2_n',
                     'BsmtFinSF2_n','BsmtUnfSF_n','TotalBsmtSF_n', 'Heating_n', 'HeatingQC_n',
                     'CentralAir_n', 'Electrical_n', '1stFlrSF', '2ndFlrSF','LowQualFinSF',
                     'GrLivArea','FullBath','HalfBath','KitchenQual_n', 'TotRmsAbvGrd', 'Functional_n', 'Fireplaces',
                     'FireplaceQu_n','GarageType_n', 'GarageYrBlt','GarageFinish_n','GarageQual_n',
                     'GarageCond_n', 'PavedDrive_n', 'WoodDeckSF','OpenPorchSF','EnclosedPorch',
                     '3SsnPorch', 'ScreenPorch','PoolArea','PoolQC_n', 'Fence_n', 'MiscFeature_n',
                     'MiscVal', 'MoSold','YrSold', 'SaleType_n', 'SaleCondition_n']
                     
assembler = VectorAssembler(inputCols=selected_features, outputCol="features")


We create the Pipeline:

In [None]:
pipeline_stages = [
    MSSubClass_indexer, MSZoning_indexer, Street_indexer,
    Alley_indexer, LotShape_indexer, LandContour_indexer,
    Utilities_indexer, LotConfig_indexer, LandSlope_indexer,
    Neighborhood_indexer, Condition1_indexer, Condition2_indexer,
    BldgType_indexer, HouseStyle_indexer, RoofStyle_indexer,
    RoofMatl_indexer, Exterior1st_indexer, Exterior2nd_indexer,
    MasVnrType_indexer, ExterQual_indexer, ExterCond_indexer,
    Foundation_indexer, BsmtQual_indexer, BsmtCond_indexer,
    BsmtExposure_indexer, BsmtFinType1_indexer, BsmtFinSF1_indexer,
    BsmtFinType2_indexer, BsmtFinSF2_indexer, BsmtUnfSF_indexer,
    TotalBsmtSF_indexer, Heating_indexer, HeatingQC_indexer,
    CentralAir_indexer, Electrical_indexer, KitchenQual_indexer,
    Functional_indexer, FireplaceQu_indexer, GarageType_indexer,
    GarageFinish_indexer, GarageQual_indexer, GarageCond_indexer,
    PavedDrive_indexer, PoolQC_indexer, Fence_indexer,
    MiscFeature_indexer, SaleType_indexer, SaleCondition_indexer,
    assembler
]

pipeline = Pipeline(stages=pipeline_stages)

Now we will train the pipeline with the training set, but for this, we need to split the dataframe into both sets.



In [32]:
full_df = full_psdf.to_spark()

In [33]:
full_df.show(5)

+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition2|BldgTy

In [34]:
train_df = full_df.filter(full_df.SalePrice.isNotNull())
test_df = full_df.filter(full_df.SalePrice.isNull())

We train our estimator:



In [35]:
preprocessing_pl = pipeline.fit(train_df)

Once trained, we apply it to the dataset.



In [36]:
feat_train_df = preprocessing_pl.transform(train_df)
feat_train_df = feat_train_df.select(feat_train_df.features, feat_train_df.SalePrice.alias('label'))

## Models

With our dataframe structured correctly, we start with the models, remembering to split the training set into two parts. In this case, we will use the **LinearRegression**, **GeneralizedLinearRegression**, **FMRegressor** y **RandomForestRegressor** models.

In [37]:
train, test = feat_train_df.randomSplit([0.85, 0.25], seed=1234)

In [38]:
rf = RandomForestRegressor(predictionCol='prediction', maxBins=1200 ,seed=1234)
fm = FMRegressor(predictionCol='prediction', seed=1234)
glm = GeneralizedLinearRegression(predictionCol='prediction')

Once the models are created, we need to train them.

In [39]:
rf_model = rf.fit(train)
fm_model = fm.fit(train)
glm_model = glm.fit(train)

The **featureImportances()** method of the Random Forest model allows us to obtain the weight of each variable in the response. As shown below, many variables do not provide relevant information about housing prices.

### Variable Importance

In [40]:
rf_fi = pd.DataFrame({'feature': selected_features, 'importance_rf': rf_model.featureImportances.toArray()})
rf_fi.sort_values(by='importance_rf', ascending=False)

Unnamed: 0,feature,importance_rf
37,TotalBsmtSF_n,0.372815
36,BsmtUnfSF_n,0.228358
33,BsmtFinSF1_n,0.102087
16,OverallQual,0.086950
11,Neighborhood_n,0.076845
...,...,...
13,Condition2_n,0.000000
50,Functional_n,0.000000
7,LandContour_n,0.000000
8,Utilities_n,0.000000


We can also use the **coefficients()** method of the generalized linear model, which provides insight into how each variable affects the response.

In [41]:
glm_fi = pd.DataFrame({'feature': selected_features, 'coef_glm': glm_model.coefficients.toArray()})
glm_fi.sort_values(by='coef_glm', ascending=False)

Unnamed: 0,feature,coef_glm
65,PoolQC_n,29315.188234
16,OverallQual,16062.063975
26,ExterQual_n,14142.659462
48,KitchenQual_n,6244.639372
17,OverallCond,5987.652449
...,...,...
14,BldgType_n,-6927.361505
67,MiscFeature_n,-7692.619126
30,BsmtCond_n,-8083.726757
4,Street_n,-36793.072086


The next step is to predict the response variable values in the test set using the trained models.



In [42]:
test_rf = rf_model.transform(test)
test_fm = fm_model.transform(test)
test_glm = glm_model.transform(test)

In [43]:
test_rf.show(10)

+--------------------+------+------------------+
|            features| label|        prediction|
+--------------------+------+------------------+
|(73,[0,1,2,3,4,7,...| 81000|145588.19537774537|
|(73,[0,1,2,3,5,6,...|159434|190880.20482002525|
|(73,[0,1,2,3,5,9,...|124000|396958.11638410855|
|(73,[0,1,2,3,5,9,...|146000|259827.80964800817|
|(73,[0,1,2,3,5,9,...|208900| 418593.2157856221|
|(73,[0,1,2,3,5,9,...|256000| 500930.9595161291|
|(73,[0,1,2,3,5,9,...|155000| 290122.3575140253|
|(73,[0,1,2,3,5,9,...|110000|158354.76429315336|
|(73,[0,1,2,3,5,11...|172500|  313988.247326519|
|(73,[0,1,2,3,5,11...|163000|375316.07906016486|
+--------------------+------+------------------+
only showing top 10 rows



In [44]:
test_fm.show(10)

+--------------------+------+--------------------+
|            features| label|          prediction|
+--------------------+------+--------------------+
|(73,[0,1,2,3,4,7,...| 81000|-1.29047646986040...|
|(73,[0,1,2,3,5,6,...|159434|  7053461.9364781715|
|(73,[0,1,2,3,5,9,...|124000|   3787070.683827903|
|(73,[0,1,2,3,5,9,...|146000|   7220328.868796626|
|(73,[0,1,2,3,5,9,...|208900|   6810956.118645244|
|(73,[0,1,2,3,5,9,...|256000|   7665180.122098284|
|(73,[0,1,2,3,5,9,...|155000|1.0716601668574752E7|
|(73,[0,1,2,3,5,9,...|110000| -3530688.0832300046|
|(73,[0,1,2,3,5,11...|172500|1.1143175136869239E7|
|(73,[0,1,2,3,5,11...|163000|   3500173.118536112|
+--------------------+------+--------------------+
only showing top 10 rows



In [45]:
test_glm.show(10)

+--------------------+------+------------------+
|            features| label|        prediction|
+--------------------+------+------------------+
|(73,[0,1,2,3,4,7,...| 81000| 62742.39365498611|
|(73,[0,1,2,3,5,6,...|159434| 90884.38036025228|
|(73,[0,1,2,3,5,9,...|124000|125006.24136060773|
|(73,[0,1,2,3,5,9,...|146000|169172.71241342154|
|(73,[0,1,2,3,5,9,...|208900| 191998.8459166009|
|(73,[0,1,2,3,5,9,...|256000|249226.96352591645|
|(73,[0,1,2,3,5,9,...|155000| 197663.9742821874|
|(73,[0,1,2,3,5,9,...|110000| 97878.59222324309|
|(73,[0,1,2,3,5,11...|172500|210238.30778979877|
|(73,[0,1,2,3,5,11...|163000| 171010.4692046875|
+--------------------+------+------------------+
only showing top 10 rows



### Evaluation


To evaluate our models, we need to implement our own evaluator, as the required metric is the logarithmic transformation of the RMSE. Below is the implementation of this evaluator in Python, provided in the working material.



In [46]:
from pyspark.ml.evaluation import Evaluator
from math import sqrt
from operator import add
import pyspark.sql.functions as F

class RmsleEvaluator(Evaluator):
    '''
    When a userID is predicted when it is not already trained (all userID  data is used on validation
    group and none of them to train), prediction is nan,  so RegressionEvaluator returns Nan.
    To solve this we must change RegressionEvaluator by MiValidacion
    '''
    def __init__(self,predictionCol='prediction', targetCol='label'):
        super(RmsleEvaluator, self).__init__()
        self.predictionCol=predictionCol
        self.targetCol=targetCol

    def _evaluate(self, dataset):
        error=self.rmsle(dataset,self.predictionCol,self.targetCol)
        print ("Error: {}".format(error))
        return error

    def isLargerBetter(self):
        return False

    @staticmethod
    def rmsle(dataset,predictionCol,targetCol):
        return sqrt(dataset.select(F.avg((F.log1p(dataset[targetCol]) - F.log1p(dataset[predictionCol])) ** 2)).first()[0])


In [47]:
evaluator = RmsleEvaluator()

Once we have our evaluator, we use the evaluate method to obtain the metrics.



In [48]:
rf_rmsle=evaluator.evaluate(test_rf)
fm_rmsle=evaluator.evaluate(test_fm)
glm_rmsle=evaluator.evaluate(test_glm)

Error: 0.7397955073178369
Error: 3.1084327519828614
Error: 0.19010416345917983


In [50]:
rmsle_dict = {'rf': rf_rmsle, 'fm': fm_rmsle, 'gam': glm_rmsle}
rmsle_dict

{'rf': 0.7397955073178369,
 'fm': 3.1084327519828614,
 'gam': 0.19010416345917983}

As observed, the model that best fits our data is the generalized linear model, returning the smallest RMSLE value. This result was expected when comparing the prediction columns and the actual values of the datasets returned by the models.



In [51]:
spark.stop()

## Additional Notes:

I am aware that I could have automated the pipeline to avoid writing all the String Indexers by formatting the column names of the dataframe. However, due to the pressure of other subjects and the time required by my job, I decided to proceed this way. It is the method I found to make it work, even though it does not strictly follow the standard procedures for addressing a Big Data problem.
