## Introduction
In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to apply what you've learned and move up the leaderboard.

Begin by running the code cell below to set up code checking and the filepaths for the dataset.



In [None]:
# Import helpful libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Load the data, and separate the target
iowa_file_path = 'train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice

quality_mapping = {'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex': 4}
{'Ex': 1, 'Gd': 2, 'TA': 3, 'Fa': 4}
home_data['EncodedExteriorQual'] = home_data["ExterQual"].map(quality_mapping)

# Calculate neighborhood frequencies
neighborhood_frequencies = home_data['Neighborhood'].value_counts(normalize=True)

# Map frequencies to neighborhoods
home_data['EncodedNeighborhood_Freq'] = home_data['Neighborhood'].map(neighborhood_frequencies)

# Apply the same encoding to the test data
# test_data['EncodedNeighborhood_Freq'] = test_data['Neighborhood'].map(neighborhood_frequencies)

functional_mapping = {
    'Typ': 1,
    'Min1': 2,
    'Min2': 3,
    'Mod': 4,
    'Maj1': 5,
    'Maj2': 6,
    'Sev': 7,
    'Sal': 8,
    'Missing': 9
}
home_data['EncodedFunctional'] = home_data["Functional"].map(functional_mapping)
home_data['LotArea_Binned'] = pd.cut(home_data['LotArea'], 
                                bins=[0, 5000, 10000, 20000, 50000, np.inf], 
                                labels=['Very Small', 'Small', 'Medium', 'Large', 'Very Large'])

# Ordinal encoding
size_mapping = {'Very Small': 1, 'Small': 2, 'Medium': 3, 'Large': 4, 'Very Large': 5}
home_data['LotArea_Encoded'] = home_data['LotArea_Binned'].map(size_mapping)


# Create X (After completing the exercise, you can return to modify this line!)
features = ['MSSubClass', 'LotArea_Encoded', 'YearBuilt', 'YearRemodAdd', 'GrLivArea', 'OverallQual', 'TotalBsmtSF', 
            '1stFlrSF', '2ndFlrSF', 'EncodedExteriorQual',  'EncodedFunctional', 'EncodedNeighborhood_Freq']


# Select columns corresponding to features, and preview the data
X = home_data[features]
X.head()


# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Define a random forest model
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

# Reset index for val_X and y_val to ensure proper alignment
val_X = val_X.reset_index(drop=True)
val_y = val_y.reset_index(drop=True)

# Create the DataFrame
comparison_df = pd.concat([val_X, val_y.rename('Actual'), pd.Series(rf_val_predictions, name='Predicted')], axis=1)
# comparison_df.to_csv('actual_vs_predicted_with_features.csv', index=False)
# print("File saved as 'actual_vs_predicted_with_features.csv'")



# Create a scatter plot to compare actual vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(val_y, rf_val_predictions, alpha=0.6, color='skyblue', edgecolor='k')
plt.plot([min(val_y), max(val_y)], [min(val_y), max(val_y)], color='red', linewidth=2, label='Ideal Prediction')

# Add labels and title
plt.xlabel("Actual SalePrice", fontsize=12)
plt.ylabel("Predicted SalePrice", fontsize=12)
plt.title("Actual vs Predicted SalePrice", fontsize=14)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()

# Show the plot
plt.show()

## Train a model for the competition
Train a model for the competition
The code cell above trains a Random Forest model on train_X and train_y.

In [6]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state=1)

# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(X, y)

Now, read the file of "test" data, and apply your model to make predictions.

In [7]:
# path to file you will use for predictions
test_data_path = 'test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)


test_data['Functional'] = test_data['Functional'].fillna('Missing')

test_data['EncodedExteriorQual'] = test_data["ExterQual"].map(quality_mapping)

# Calculate neighborhood frequencies
neighborhood_frequencies = test_data['Neighborhood'].value_counts(normalize=True)

# Map frequencies to neighborhoods
test_data['EncodedNeighborhood_Freq'] = test_data['Neighborhood'].map(neighborhood_frequencies)

# Apply the same encoding to the test data
# test_data['EncodedNeighborhood_Freq'] = test_data['Neighborhood'].map(neighborhood_frequencies)

test_data['EncodedFunctional'] = test_data["Functional"].map(functional_mapping)
test_data['LotArea_Binned'] = pd.cut(test_data['LotArea'], 
                                bins=[0, 5000, 10000, 20000, 50000, np.inf], 
                                labels=['Very Small', 'Small', 'Medium', 'Large', 'Very Large'])

# Ordinal encoding
size_mapping = {'Very Small': 1, 'Small': 2, 'Medium': 3, 'Large': 4, 'Very Large': 5}
test_data['LotArea_Encoded'] = test_data['LotArea_Binned'].map(size_mapping)


# Create X (After completing the exercise, you can return to modify this line!)
features = ['MSSubClass', 'LotArea_Encoded', 'YearBuilt', 'YearRemodAdd', 'GrLivArea', 'OverallQual', 'TotalBsmtSF', 
            '1stFlrSF', '2ndFlrSF', 'EncodedExteriorQual',  'EncodedFunctional', 'EncodedNeighborhood_Freq']

test_data['TotalBsmtSF'].fillna(0, inplace=True)
# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[features]

# Check for missing values in the test set
missing_data_test = test_X.isnull().sum()

# Filter columns with missing values (non-zero counts)
missing_test_columns = missing_data_test[missing_data_test > 0]


print("\nMissing values in test data:")
print(missing_test_columns)
# make predictions which we will submit. 
test_preds = rf_model_on_full_data.predict(test_X)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['TotalBsmtSF'].fillna(0, inplace=True)



Missing values in test data:
Series([], dtype: int64)


## Generate a submission
Run the code cell below to generate a CSV file with your predictions that you can use to submit to the competition.

In [8]:
# Run the code to save predictions in the format used for competition scoring

output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('house_price_submission_1.csv', index=False)
print("completed")

completed
