# House-prices-modeling

### 1. Data Setup

In [2]:
import pandas as pd

# Load datasets, I didn't have to split them because in kaggle they are already separeted.
# dont use absoulte path, use relative one .../ !!!
train = pd.read_csv('/Users/paulayagoesparza/Downloads/train.csv')
test = pd.read_csv('/Users/paulayagoesparza/Downloads/test.csv')

In [3]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### 2. Feature Selection

Continuous Features:

- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet

Categorical Features:

- Neighborhood: Physical locations within Ames city limits
- BldgType: Type of dwelling
- MSZoning: Identifies the general zoning classification of the sale.

In [4]:
train['LotFrontage']

0       65.0
1       80.0
2       68.0
3       60.0
4       84.0
        ... 
1455    62.0
1456    85.0
1457    66.0
1458    68.0
1459    75.0
Name: LotFrontage, Length: 1460, dtype: float64

In [5]:
train['Neighborhood']

0       CollgCr
1       Veenker
2       CollgCr
3       Crawfor
4       NoRidge
         ...   
1455    Gilbert
1456     NWAmes
1457    Crawfor
1458      NAmes
1459    Edwards
Name: Neighborhood, Length: 1460, dtype: object

### 3. Feature processing

In [6]:
#TRAIN DATASET
# First check if there are missing values

columns = ['LotFrontage', 'LotArea', 'Neighborhood', 'BldgType', 'MSZoning']
for column in columns:
    null_rows = train[train[column].isnull()]
    if null_rows.empty:
        print('No missing values in column', column)
    else:
        print('Missing values in column', column, 'in the rows:')
        print(null_rows)

Missing values in column LotFrontage in the rows:
        Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
7        8          60       RL          NaN    10382   Pave   NaN      IR1   
12      13          20       RL          NaN    12968   Pave   NaN      IR2   
14      15          20       RL          NaN    10920   Pave   NaN      IR1   
16      17          20       RL          NaN    11241   Pave   NaN      IR1   
24      25          20       RL          NaN     8246   Pave   NaN      IR1   
...    ...         ...      ...          ...      ...    ...   ...      ...   
1429  1430          20       RL          NaN    12546   Pave   NaN      IR1   
1431  1432         120       RL          NaN     4928   Pave   NaN      IR1   
1441  1442         120       RM          NaN     4426   Pave   NaN      Reg   
1443  1444          30       RL          NaN     8854   Pave   NaN      Reg   
1446  1447          20       RL          NaN    26142   Pave   NaN      IR1   

 

So the column 'LotFrontage' has missing values, specifically 259 missing values (size of the matrix). I want to replace those values with the mean so i don't alter so much the distribution of the data. 

In [7]:
# calculating the mean
mean_lotfrontage = train['LotFrontage'].mean()

# replacing the missing values with the mean
train['LotFrontage'].fillna(mean_lotfrontage, inplace=True)

In [8]:
# let's do it again to check that it was done correctly
columns = ['LotFrontage']
for column in columns:
    null_rows = train[train[column].isnull()]
    if null_rows.empty:
        print('No missing values in column', column)
    else:
        print('Missing values in column', column, 'in the rows:')
        print(null_rows)

No missing values in column LotFrontage


In [9]:
# TEST DATASET
# First check if there are missing values

columns = ['LotFrontage', 'LotArea', 'Neighborhood', 'BldgType', 'MSZoning']
for column in columns:
    null_rows = test[test[column].isnull()]
    if null_rows.empty:
        print('No missing values in column', column)
    else:
        print('Missing values in column', column, 'in the rows:')
        print(null_rows)

Missing values in column LotFrontage in the rows:
        Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
6     1467          20       RL          NaN     7980   Pave   NaN      IR1   
40    1501         160       FV          NaN     2980   Pave   NaN      Reg   
41    1502         160       FV          NaN     2403   Pave   NaN      IR1   
45    1506          20       RL          NaN    10456   Pave   NaN      IR1   
47    1508          50       RL          NaN    18837   Pave   NaN      IR1   
...    ...         ...      ...          ...      ...    ...   ...      ...   
1387  2848          20       RL          NaN    11088   Pave   NaN      Reg   
1390  2851          60       RL          NaN    21533   Pave   NaN      IR2   
1440  2901          20       RL          NaN    50102   Pave   NaN      IR1   
1441  2902          20       RL          NaN     8098   Pave   NaN      IR1   
1448  2909          90       RL          NaN    11836   Pave   NaN      IR1   

 

i will follow the same procedure as with the train dataset. In this case two features have missing values, one categorical and another one continuous.
so i will replace the missing values of the numerical variable by the mean and of the categorical variable by the mode

In [10]:
# TEST DATASET

# calculating the mean. continious variable
mean_lotfrontage_test = test['LotFrontage'].mean()

# replacing the missing values with the mean
test['LotFrontage'].fillna(mean_lotfrontage_test, inplace=True)

# calculating the meode, categorical variable
mean_lotfrontage_test = test['MSZoning'].mode()
test['MSZoning'].fillna(mean_lotfrontage_test, inplace=True)

### Scaling
I don't understand this concept fully and the methods there are to achieve it. 
As well i don't know which one is the most suitable for my features.

### Encoding
Encoding features refers to the process of converting categorical or textual data into numerical representations that can be used as input for machine learning models

In [11]:
from sklearn.preprocessing import LabelEncoder
#TRAIN DATASET
# i chose to use LabelEncoder for all categorial features.
le = LabelEncoder()

train['Neighborhood'] = le.fit_transform(train['Neighborhood'])
train['BldgType'] = le.fit_transform(train['BldgType'])
train['MSZoning'] = le.fit_transform(train['MSZoning'])

In [15]:
# how the column looks like now
train['MSZoning']

0       3
1       3
2       3
3       3
4       3
       ..
1455    3
1456    3
1457    3
1458    3
1459    3
Name: MSZoning, Length: 1460, dtype: int64

In [13]:
#TEST DATASET
# i chose to use LabelEncoder for all categorial features.
le = LabelEncoder()

test['Neighborhood'] = le.fit_transform(test['Neighborhood'])
test['BldgType'] = le.fit_transform(test['BldgType'])
test['MSZoning'] = le.fit_transform(test['MSZoning'])


### one hot encoding 
get_dummies : encoding technique binary

### 4. Model training
Using Linear Regression

In [13]:
from sklearn.linear_model import LinearRegression

# feature selection
X_train = train[['LotFrontage', 'LotArea', 'Neighborhood', 'BldgType', 'MSZoning']]
y_train = train['SalePrice']

model = LinearRegression()

# fit the model 
model.fit(X_train, y_train)

I want to test the algorithm, but to do that I have to encode as well the categorical varibales.
I have an error, because in this features there are values that were not on the train dataset, so when the econder tries to assign a numerical value (to substitue the string) it is not able because is does not match with the previous values from the train dataset.

In [14]:
# let's test the algorithm

X_test = test[['LotFrontage', 'LotArea', 'Neighborhood', 'BldgType', 'MSZoning']]
#y_test = test['SalePrice']

y_pred = model.predict(X_test)

In [15]:
import numpy as np
from sklearn.metrics import mean_squared_log_error

def compute_rmsle(y_test: np.ndarray, y_pred: np.ndarray, precision: int = 2) -> float:
    rmsle = np.sqrt(mean_squared_log_error(y_test, y_pred))
    return round(rmsle, precision)

But I should compare the 'SalePrice' predicted by my model with the correct 'SalePrice' of the test dataset. There is no column 'SalePrice' in the test dataset so I'm not able to compare them. And that's why my RMSLE is 0, there is no difference between the predicted values because they are in fact the same.

In [16]:
rmsle = compute_rmsle(y_pred, y_pred)
print("RMSLE:", rmsle)

RMSLE: 0.0
