<a href="https://colab.research.google.com/github/skadiddles/CCMACLRL_EXERCISES_COM232_/blob/main/Exercise6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise 6: Choosing the best performing model on a dataset

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
- Use all Regression models

Submit your results to:
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview



In [None]:
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

## Dataset File

In [None]:
train_data = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/train.csv?raw=true'
df = pd.read_csv(train_data)

## Test File

In [None]:
test_url = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/test.csv?raw=true'
dt=pd.read_csv(test_url)

In [None]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

## Sample Submission File

In [None]:
sample_submission_url ='https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/sample_submission.csv?raw=true'

sf=pd.read_csv(sample_submission_url)

In [None]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         1459 non-null   int64  
 1   SalePrice  1459 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 22.9 KB


In [None]:
X = df.drop(columns=["SalePrice", "Id"])
y = df["SalePrice"]

for col in X.select_dtypes(include=["int64", "float64"]).columns:
    X[col].fillna(X[col].median(), inplace=True)

for col in X.select_dtypes(include=["object"]).columns:
    X[col].fillna(X[col].mode()[0], inplace=True)

X = pd.get_dummies(X, drop_first=True)

# Train/test split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

def cross_val(model, X, y, k=5):
    fold_size = len(X) // k
    scores = []

    for i in range(k):
        start = i * fold_size
        end = start + fold_size

        x_val = X.iloc[start:end]
        y_val = y.iloc[start:end]

        x_train = pd.concat([X.iloc[:start], X.iloc[end:]])
        y_train = pd.concat([y.iloc[:start], y.iloc[end:]])

        model.fit(x_train, y_train)
        scores.append(model.score(x_val, y_val))

    return sum(scores) / len(scores)

score_list = {}


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X[col].fillna(X[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X[col].fillna(X[col].mode()[0], inplace=True)


## 1. Train a KNN Regressor

In [None]:
KNN = KNeighborsRegressor(n_neighbors=22)
KNN.fit(x_train, y_train)
knn_score = KNN.score(x_test, y_test)
score_list["KNN Regressor"] = knn_score
print(f"KNN Test Score is {knn_score}")

KNN Test Score is 0.5980835491666139


- Perform cross validation

In [None]:

knn_cv = cross_val(KNN, X, y, k=5)
print(f"KNN Cross-Validation Score is {knn_cv}")

KNN Cross-Validation Score is 0.6085890609548947


## 2. Train a SVM Regression

In [None]:
SVM = SVR(kernel="rbf", C=100, gamma=0.1)
SVM.fit(x_train, y_train)
svm_score = SVM.score(x_test, y_test)
score_list["SVM Regressor"] = svm_score
print(f"SVM Test Score is {svm_score}")

SVM Test Score is -0.037962385609509264


- Perform cross validation

In [None]:
svm_cv = cross_val(SVM, X, y, k=5)
print(f"SVM Cross-Validation Score is {svm_cv}")

SVM Cross-Validation Score is -0.05202678413481774


## 3. Train a Decision Tree Regression

In [None]:
DT = DecisionTreeRegressor(random_state=1)
DT.fit(x_train, y_train)
dt_score = DT.score(x_test, y_test)
score_list["Decision Tree Regressor"] = dt_score
print(f"Decision Tree Test Score is {dt_score}")

Decision Tree Test Score is 0.751032883968054


- Perform cross validation

In [None]:
dt_cv = cross_val(DT, X, y, k=5)
print(f"Decision Tree Cross-Validation Score is {dt_cv}")

Decision Tree Cross-Validation Score is 0.7241490299673856


## 4. Train a Random Forest Regression

In [None]:
RF = RandomForestRegressor(n_estimators=200, random_state=1)
RF.fit(x_train, y_train)
rf_score = RF.score(x_test, y_test)
score_list["Random Forest Regressor"] = rf_score
print(f"Random Forest Test Score is {rf_score}")
rf_cv = cross_val(RF, X, y, k=5)
print(f"Random Forest Cross-Validation Score is {rf_cv}")

TypeError: list indices must be integers or slices, not str

## 5. Compare all the performance of all regression models

In [None]:
score_list = list(score_list.items())

for alg,score in score_list:
    print(f"{alg} Score is {str(score)[:4]} ")

AttributeError: 'list' object has no attribute 'items'

## 6. Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [None]:
# Preprocess the test data in the same way as the training data
# Drop the 'Id' column as it is not a feature
dt_processed = dt.drop(columns=["Id"])

# Handle missing values in the test data (using median for numerical, mode for categorical)
for col in dt_processed.select_dtypes(include=["int64", "float64"]).columns:
    dt_processed[col].fillna(dt_processed[col].median(), inplace=True)

for col in dt_processed.select_dtypes(include=["object"]).columns:
    dt_processed[col].fillna(dt_processed[col].mode()[0], inplace=True)

# Apply one-hot encoding to the test data
dt_processed = pd.get_dummies(dt_processed, drop_first=True)

# Align columns - this will add missing columns with 0 and drop extra columns
train_cols = X.columns
dt_processed = dt_processed.reindex(columns=train_cols, fill_value=0)

# Make predictions using the trained Random Forest model
y_pred = RF.predict(dt_processed)

# Create a submission DataFrame
submission_df = pd.DataFrame({
    'Id': dt['Id'],  # Use the original Id column from the test data
    'SalePrice': y_pred
})

# Save the submission DataFrame to a CSV file
submission_df.to_csv('submission_file.csv', index=False)
print("Submission file created: submission_file.csv")

Submission file created: submission_file.csv


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dt_processed[col].fillna(dt_processed[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dt_processed[col].fillna(dt_processed[col].mode()[0], inplace=True)
