<a href="https://www.kaggle.com/code/uroojmurtaza/exercise-machine-learning-competitions?scriptVersionId=248866404" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/machine-learning-competitions).**

---


# Introduction

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to apply what you've learned and move up the leaderboard.

Begin by running the code cell below to set up code checking and the filepaths for the dataset.

In [1]:
#import useful libraries, packages and modules along with the data 

from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *


import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

#import various models to try and compare 

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor

Here's some of the code you've written so far. Start by running it again.

In [2]:
######### Use training data ####### 
#######################################


####### Define file path and select features #############

home_data = pd.read_csv('../input/train.csv')
print (home_data.head())
Y=home_data.SalePrice

########## choose data ######### 

X=home_data[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']]
#split y and x into validation and training data

trainX, valX, trainY, valY=train_test_split(X,Y,random_state=1)

# fit data and use results to make predictions
Prediction=(((RandomForestRegressor(random_state=1).fit)(trainX, trainY)).predict(valX))

#find the mean square error to calculate how accurate the predictions are 
MeanSquareError=mean_squared_error(Prediction, valY)
print ("MSE is:", MeanSquareError)

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

In [3]:


# To improve accuracy, applying XGBRegressor on all training data

# fit data and use results to make predictions
Prediction=(((XGBRegressor(objective="reg:squarederror", random_state=1).fit)(X, Y)).predict(X))

# we didnt use HistGradientBoostingRegressor as it has lower accuracy 
# randomforestregressor doesnt handle nan values so we skipped that as well

#found mean square error to calculate how accurate the predictions are
MeanSquareError=mean_squared_error(Prediction, Y)
print ("MSE is:", MeanSquareError)


____

MSE is: 33566892.66843599




Now, read the file of "test" data, and apply your model to make predictions.

In [4]:
######### Using test data ####### 
#######################################

home_data = pd.read_csv('../input/train.csv')
Y=home_data.SalePrice
home_data_test = pd.read_csv('../input/test.csv')

############# Loading Test and Training Data ##################
#############################################################

########## Preprocessing Data #########
X=home_data[['BsmtExposure','Utilities','RoofMatl','Condition2','PoolQC','MiscFeature','Street','HeatingQC','ExterQual','MasVnrType','MSZoning','KitchenQual','GarageFinish','SaleCondition','SaleType','Fence','GarageCond','FireplaceQu','Functional','BsmtFinType1','BsmtCond','BsmtQual','Foundation','ExterCond','BldgType','Condition1','Neighborhood','LandSlope','LotConfig','LotShape','GarageArea', 'GarageCars','GarageYrBlt','TotalBsmtSF','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1','LotFrontage','Id','MSSubClass', 'LotArea','OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd','1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'FullBath', 'HalfBath','BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal','MoSold', 'YrSold']]
X_test=home_data_test[['BsmtExposure','Utilities','RoofMatl','Condition2','PoolQC','MiscFeature','Street','HeatingQC','ExterQual','MasVnrType','MSZoning','KitchenQual','GarageFinish','SaleCondition','SaleType','Fence','GarageCond','FireplaceQu','Functional','BsmtFinType1','BsmtCond','BsmtQual','Foundation','ExterCond','BldgType','Condition1','Neighborhood','LandSlope','LotConfig','LotShape','GarageArea', 'GarageCars','GarageYrBlt','TotalBsmtSF','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1','LotFrontage','Id','MSSubClass', 'LotArea','OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd','1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'FullBath', 'HalfBath','BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal','MoSold', 'YrSold']]

#######Preprocessing#########
# code under evalution so ignore for now
nan_counts=X_test.isnull().sum()
pd.set_option('display.max_columns',None)
columns_with_nan = nan_counts[nan_counts > 0].index


############## One-Hot Encoding #############
############################################# 

X_encoded = pd.get_dummies(X, columns=['BsmtExposure','Utilities','RoofMatl','Condition2','PoolQC','MiscFeature','Street','HeatingQC','ExterQual','MasVnrType','MSZoning','KitchenQual','GarageFinish','SaleCondition','SaleType','Fence','GarageCond','FireplaceQu','Functional','BsmtFinType1','BsmtCond','BsmtQual','Foundation','ExterCond','BldgType','Condition1','Neighborhood','LandSlope','LotConfig','LotShape'])
X_test_encoded = pd.get_dummies(X_test,columns=['BsmtExposure','Utilities','RoofMatl','Condition2','PoolQC','MiscFeature','Street','HeatingQC','ExterQual','MasVnrType','MSZoning','KitchenQual','GarageFinish','SaleCondition','SaleType','Fence','GarageCond','FireplaceQu','Functional','BsmtFinType1','BsmtCond','BsmtQual','Foundation','ExterCond','BldgType','Condition1','Neighborhood','LandSlope','LotConfig','LotShape'])

# to ensure correct indexing of both x_test_encoded and x_encoded 
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns)


############## Fitting & Predicting #############
#############################################   


test_preds=((HistGradientBoostingRegressor().fit(X_encoded,Y)).predict(X_test_encoded))
# using HistGradientBoostingRegressor to handle nan values

# Check your answer (To get credit for completing the exercise, you must get a "Correct" result!)
step_1.check()
# step_1.solution()

############## Excluded Data #############
##########################################

#model accuracy decreased by followig columns so we reomved them: 'HeatingQC','BsmtExposure','Exterqual','MasVnrType','MSZoning''KitchenQual','GarageFinish','SaleCondition',saletype, Fence',GarageCond','FireplaceQu','Functional',LotShape,LotConfig,'LandSlope',BldgType,'Foundation', 'ExterCond',Condition1, 'BsmtCond','BsmtQual','BsmtFinType1',imporved the model'CentralAir','BsmtFinType2','GarageType','PavedDrive','Alley',LandContour','RoofStyle', improved model'HouseStyle', 'Exterior2nd','Exterior1st','Electrical','BsmtHalfBath','BsmtFullBath','MasVnrArea', 'GarageQual', 'Heating',
#model accuracy remained unchanged after adding the following columns: 'street' , 'MiscFeature', 'PoolQC', 'Condition2', 'RoofMatl' had no effect on the model
#model accuracy increased with ffill for LotFrontage (on x_test only)
#model accuracy increased with mean for LotFrontage (all cases)
#model accuracy increased with bfill for LotFrontage (all cases)
#model accuracy decreased after ffill & bfill of utilities , bsmtexposure
#model accuracy remain unchanged after ffill & bfill of PoolQC, MiscFeature
#model model accuracy remain unchanged after removing "utilities"

####Extra Code
#ensuring indices remain same for both x_encoded and x_test_encoded
#X_test_encoded = X_test_encoded.reindex(columns = X_encoded.columns, fill_value=0)  
#X_test_encoded=X_test_encoded.reindex_like(X_encoded)


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [5]:
##################################
######## Ignore Extra Code #######
##################################
##################################


#print (mean)
#print(X_test[columns_with_nan])

#print(nan_counts)

# finding daattypes of the dataframe 
#print(X_test.dtypes)


# Replace NaN values with the mean using loc
####filling with mean vaulues 
#mean = X.BsmtUnfSF.mean()
#mean1 = X_test.BsmtUnfSF.mean()
#X.loc[X['BsmtUnfSF'].isna(), 'BsmtUnfSF'] = mean
#X_test.loc[X_test['BsmtUnfSF'].isna(),'BsmtUnfSF'] = mean1

####filling with ffill and bfill
#X.loc[:, 'BsmtUnfSF'] = X['BsmtUnfSF'].fillna(method='bfill')
#X.loc[:, 'BsmtExposure'] = X['BsmtExposure'].fillna(method='bfill')
#X_test.loc[:, 'BsmtExposure'] = X_test['BsmtExposure'].fillna(method='bfill')
#finding columns with nan values
# Set display options to show more rows and columns
#X.loc[X['LotFrontage'] == 'NA', 'LotFrontage'] = X['LotFrontage'].replace('NA', X['LotFrontage'].mean())
#print (X.LotFrontage)

Before submitting, run a check to make sure your `test_preds` have the right format.

# Generate a submission

Run the code cell below to generate a CSV file with your predictions that you can use to submit to the competition.

In [6]:
# Run code to save house price predictions in the format used for competition scoring

output = pd.DataFrame({'Id': X_test.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)
print (output)


        Id      SalePrice
0     1461  125119.075696
1     1462  155171.290751
2     1463  183986.608536
3     1464  185732.070779
4     1465  199321.290690
...    ...            ...
1454  2915   76137.472100
1455  2916   85050.980502
1456  2917  149858.724968
1457  2918  113544.837657
1458  2919  205451.149302

[1459 rows x 2 columns]


# Submit to the competition

To test your results, you'll need to join the competition (if you haven't already).  So open a new window by clicking on **[this link](https://www.kaggle.com/c/home-data-for-ml-course)**.  Then click on the **Join Competition** button.

![join competition image](https://storage.googleapis.com/kaggle-media/learn/images/axBzctl.png)

Next, follow the instructions below:
1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Data** tab near the top of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.


# Continue Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  To add more features to the data, revisit the first code cell, and change this line of code to include more column names:
```python
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
```

Some features will cause errors because of issues like missing values or non-numeric data types.  Here is a complete list of potential columns that you might like to use, and that won't throw errors:
- 'MSSubClass'
- 'LotArea'
- 'OverallQual' 
- 'OverallCond' 
- 'YearBuilt'
- 'YearRemodAdd' 
- '1stFlrSF'
- '2ndFlrSF' 
- 'LowQualFinSF' 
- 'GrLivArea'
- 'FullBath'
- 'HalfBath'
- 'BedroomAbvGr' 
- 'KitchenAbvGr' 
- 'TotRmsAbvGrd' 
- 'Fireplaces' 
- 'WoodDeckSF' 
- 'OpenPorchSF'
- 'EnclosedPorch' 
- '3SsnPorch' 
- 'ScreenPorch' 
- 'PoolArea' 
- 'MiscVal' 
- 'MoSold' 
- 'YrSold'

Look at the list of columns and think about what might affect home prices.  To learn more about each of these features, take a look at the data description on the **[competition page](https://www.kaggle.com/c/home-data-for-ml-course/data)**.

After updating the code cell above that defines the features, re-run all of the code cells to evaluate the model and generate a new submission file.  


# What's next?

As mentioned above, some of the features will throw an error if you try to use them to train your model.  The **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.

The **[Pandas](https://kaggle.com/Learn/Pandas)** course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/intro-to-Deep-Learning)** course, where you will build models with better-than-human level performance at computer vision tasks.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*