# <center>House Price Prediction</center>
<div style="width:100%;text-align: center;"> <img align = middle src="https://images.pexels.com/photos/186077/pexels-photo-186077.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500" style="height:300px;"> </div>

# Dataset Fields

Description of the data given:

- **SalePrice**: The property's sale price in dollars. This is the target variable that you're trying to predict.
- **MSSubClass**: The building class
- **MSZoning**: The general zoning classification
- **LotFrontage**: Linear feet of street connected to property
- **LotArea**: Lot size in square feet
- **Street**: Type of road access
- **Alley**: Type of alley access
- **LotShape**: General shape of property
- **LandContour**: Flatness of the property
- **Utilities**: Type of utilities available
- **LotConfig**: Lot configuration
- **LandSlope**: Slope of property
- **Neighborhood**: Physical locations within Ames city limits
- **Condition1**: Proximity to main road or railroad
- **Condition2**: Proximity to main road or railroad (if a second is present)
- **BldgType**: Type of dwelling
- **HouseStyle**: Style of dwelling
- **OverallQual**: Overall material and finish quality
- **OverallCond**: Overall condition rating
- **YearBuilt**: Original construction date
- **YearRemodAdd**: Remodel date
- **RoofStyle**: Type of roof
- **RoofMatl**: Roof material
- **Exterior1st**: Exterior covering on house
- **Exterior2nd**: Exterior covering on house (if more than one material)
- **MasVnrType**: Masonry veneer type
- **MasVnrArea**: Masonry veneer area in square feet
- **ExterQual**: Exterior material quality
- **ExterCond**: Present condition of the material on the exterior
- **Foundation**: Type of foundation
- **BsmtQual**: Height of the basement
- **BsmtCond**: General condition of the basement
- **BsmtExposure**: Walkout or garden level basement walls
- **BsmtFinType1**: Quality of basement finished area
- **BsmtFinSF1**: Type 1 finished square feet
- **BsmtFinType2**: Quality of second finished area (if present)
- **BsmtFinSF2**: Type 2 finished square feet
- **BsmtUnfSF**: Unfinished square feet of basement area
- **TotalBsmtSF**: Total square feet of basement area
- **Heating**: Type of heating
- **HeatingQC**: Heating quality and condition
- **CentralAir**: Central air conditioning
- **Electrical**: Electrical system
- **1stFlrSF**: First Floor square feet
- **2ndFlrSF**: Second floor square feet
- **LowQualFinSF**: Low quality finished square feet (all floors)
- **GrLivArea**: Above grade (ground) living area square feet
- **BsmtFullBath**: Basement full bathrooms
- **BsmtHalfBath**: Basement half bathrooms
- **FullBath**: Full bathrooms above grade
- **HalfBath**: Half baths above grade
- **Bedroom**: Number of bedrooms above basement level
- **Kitchen**: Number of kitchens
- **KitchenQual**: Kitchen quality
- **TotRmsAbvGrd**: Total rooms above grade (does not include bathrooms)
- **Functional**: Home functionality rating
- **Fireplaces**: Number of fireplaces
- **FireplaceQu**: Fireplace quality
- **GarageType**: Garage location
- **GarageYrBlt**: Year garage was built
- **GarageFinish**: Interior finish of the garage
- **GarageCars**: Size of garage in car capacity
- **GarageArea**: Size of garage in square feet
- **GarageQual**: Garage quality
- **GarageCond**: Garage condition
- **PavedDrive**: Paved driveway
- **WoodDeckSF**: Wood deck area in square feet
- **OpenPorchSF**: Open porch area in square feet
- **EnclosedPorch**: Enclosed porch area in square feet
- **3SsnPorch**: Three season porch area in square feet
- **ScreenPorch**: Screen porch area in square feet
- **PoolArea**: Pool area in square feet
- **PoolQC**: Pool quality
- **Fence**: Fence quality
- **MiscFeature**: Miscellaneous feature not covered in other categories
- **MiscVal**: $Value of miscellaneous feature
- **MoSold**: Month Sold
- **YrSold**: Year Sold
- **SaleType**: Type of sale
- **SaleCondition**: Condition of sale

# Importing the Necessary Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor

import warnings
warnings.filterwarnings('ignore')

<h3>Printing out a list of all the files in the directory (optional)</h3>

In [None]:
# for dirname, _, filenames in os.walk(os.environ['DSX_PROJECT_DIR']):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

<h3> Reading the train data </h3>

In [None]:
df = pd.read_csv(os.environ['DSX_PROJECT_DIR']+'/datasets/train_houses.csv')
df

# Dataset Information

<h3> Looking at all the columns of the given data </h3>

In [None]:
df.columns

<h3> Identifying the shape of the given dataframe (Rows X Columns)</h3>

In [None]:
df.shape

<h3>Printing the following information about our dataset:</h3>
<ul>
    <li>Column Name</li>
    <li>Count of the non-null data</li>
    <li>Datatype of that column</li>
</ul>

In [None]:
df.info()

<h3>Printing the statistical summary of our dataset such as:</h3>
<ul>
    <li>Count</li>
    <li>Mean</li>
    <li>Standard Deviation</li>
    <li>Minimum</li>
    <li>25<sup>th</sup> Percentile</li>
    <li>50<sup>th</sup> Percentile</li>
    <li>75<sup>th</sup> Percentile</li>
    <li>Maximum</li>     
</ul>

In [None]:
df.describe()

# Identifying Correlations in the Data

In [None]:
plt.figure(figsize = (30, 30))
sns.heatmap(df.corr(), cmap = 'Blues', square = True, annot = True)
plt.title("Visualizing Correlations", size = 30)
plt.show()

# Selecting Numeric Features
The numerical features with more than a <b>0.50</b> correlation rate with SalePrice have been selected.

In [None]:
num_cols = ['OverallQual', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea']
df_num_cols = df[num_cols]
df_num_cols.isnull().sum()

<h3>Creating a pair plot to identify trends and relations in the data</h3>

In [None]:
sns.pairplot(df[num_cols])
plt.show()

<h3>Looking at the non-numeric features of our dataset</h3>

In [None]:
df_nn = df.select_dtypes(include = ['O'])
nn_list = df_nn.columns.tolist()
print("There are", format(len(df_nn.columns)), "non numerical features in the dataset\n\nThe non numerical features are:\n", df_nn.columns.tolist(), "\n\nCounting the number of NA values for our non numeric data")
df_nn.isna().sum()

# Visualize the Count of the Categorical Data

In [None]:
cat_cols = ['MSZoning', 'BldgType', 'ExterQual', 'CentralAir', 'KitchenQual', 'SaleCondition']

df_cat_cols = df[cat_cols]

sns.countplot(x = df_cat_cols['MSZoning'], data = df_cat_cols)
plt.show()

sns.countplot(x = df_cat_cols['BldgType'], data = df_cat_cols)
plt.show()

sns.countplot(x = df_cat_cols['ExterQual'], data = df_cat_cols)
plt.show()

sns.countplot(x = df_cat_cols['CentralAir'], data = df_cat_cols)
plt.show()

sns.countplot(x = df_cat_cols['KitchenQual'], data = df_cat_cols)
plt.show()

sns.countplot(x = df_cat_cols['SaleCondition'], data = df_cat_cols)
plt.show()

# Final Features Selected

In [None]:
final_features = list(num_cols + cat_cols)
print(final_features)

<h3>Final dataset used with only the selected features

In [None]:
df = df[final_features + ['SalePrice']]
df

# Encoding the Categorical Data
The categorical data can be encoded using Label Encoder. It encodes labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value as assigned earlier. The categorical values can be converted into numeric values.

In [None]:
label_encoder = LabelEncoder()
df[cat_cols] = df[cat_cols].apply(label_encoder.fit_transform)
df

<h3>Assigning features to X and the labels to y</h3>

In [None]:
X = df.drop(['SalePrice'], axis = 1)
y = df['SalePrice']
X.head()

In [None]:
y.head()

# Scaling the Data
StandardScaler standardizes a feature by subtracting the mean and then scaling it to unit variance.
<h3>Formula</h3>
<div style="width:100%;text-align: center;"><img align = left src="https://cdn-images-1.medium.com/max/800/0*vQEjz0mvylP--30Q.GIF" style="height:300px;"></div>

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X = pd.DataFrame(X_scaled, index=X.index, columns=X.columns)
print(X, "\n\n\n", "The shape of X is:", X.shape)
print("\n", y, "\n\n\n", "The shape of y is:", y.shape)

# Splitting the Data Into Training and Testing Sets

In [None]:
ratio_train = 0.8
ratio_val = 0.1
ratio_test = 0.1

# produces test split.
x_remaining, X_test, y_remaining, y_test = train_test_split(X, y, test_size=ratio_test, random_state=0)

# adjusts val ratio, w.r.t. remaining dataset.
ratio_remaining = 1 - ratio_test
ratio_val_adjusted = ratio_val / ratio_remaining

# produces train and val splits.
X_train, X_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=ratio_val_adjusted, random_state=0)

# Linear Regression

<h4>What is linear regression ?</h4>

It is a statistical method that is used for predictive analysis. There are 2 types:
- Simple Linear Regression
- Multiple Linear Regression

<h4>Simple Linear Regression</h4>

Simple linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line.

<h4>Formulae for simple linear regression</h4>
<div style="width:100%;text-align: center;"><img align = left src="https://miro.medium.com/max/960/1*jt-pyQQ7bgL2lyganse0nQ.png" style="height:300px;"></div>
<div style="width:100%;text-align: left;"><img align = left src="https://miro.medium.com/max/952/1*RMGN1iWcl7l8iDy9Vk6HGg.png" style="height:300px;"></div>

<h4>Multiple Linear Regression</h4>

For our dataset since there are many features we are using multiple linear regression.

Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable.

<h4>Formula</h4>
<div style="width:100%;text-align: center;"><img align = left src="https://csharpcorner-mindcrackerinc.netdna-ssl.com/article/linear-regression2/Images/f_MLR.png" style="height:300px;"></div>

In [None]:
lin_reg = LinearRegression(n_jobs = -1)
lin_reg.fit(X_train, y_train)
y_pred = lin_reg.predict(X_test)
rmse_lin_reg = np.sqrt(mean_squared_error(y_test, y_pred))
print("The Root Mean Squared Error for Linear Regression is:", rmse_lin_reg)
print("Linear Regression Score is:", lin_reg.score(X_test, y_test) * 100, "%")

# What is GridSearchCV ?
It is a library function that helps you loop through pre-defined hyperparameters and fits your model with the best ones. 

<h3>Advantage</h3>
GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method.

<h3>Disadvantage</h3>
<span>It is very time consuming.

<div style="width:100%;text-align: center;"><img align = "left" src="https://c.tenor.com/hHMlCXdEfbEAAAAC/laptop-breaking.gif" style="height:300px;"></div>

# Ridge Regression
Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity. 

In [None]:
ridge = Ridge()
param_grid = {'alpha': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 1, 2, 3, 5, 8, 10, 20, 50, 100]}
grid_search_ridge = GridSearchCV(ridge, param_grid, cv = 5)
grid_search_ridge.fit(X_train, y_train)
y_pred = grid_search_ridge.predict(X_test)
rmse_ridge_reg = np.sqrt(mean_squared_error(y_test, y_pred))
print("The Root Mean Squared Error for Ridge Regression is:", rmse_ridge_reg)
print("Ridge Regression Score is:", grid_search_ridge.score(X_test, y_test) * 100, "%")

# Lasso Regression
LASSO stands for Least Absolute Shrinkage and Selection Operator. It is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.

In [None]:
lasso = Lasso()
param_grid = {'alpha': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 1, 2, 3, 5, 8, 10, 20, 50, 100]}
grid_search_las = GridSearchCV(lasso, param_grid, cv = 5)
grid_search_las.fit(X_train, y_train)
y_pred = grid_search_las.predict(X_test)
rmse_las_reg = np.sqrt(mean_squared_error(y_test, y_pred))
print("The Root Mean Squared Error for Lasso Regression is:", rmse_las_reg)
print("Lasso Regression Score is:", grid_search_las.score(X_test, y_test) * 100, "%")

# Elastic Net
Elastic net is a penalized linear regression model that includes both the L1 and L2 penalties during training.

In [None]:
el_net = ElasticNet()
param_grid = {'alpha': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 1, 2, 3, 5, 8, 10, 20, 50, 100],
             'l1_ratio': np.arange(0.0, 1.0, 0.1)}
grid_search_el_net = GridSearchCV(el_net, param_grid, cv = 5)
grid_search_el_net.fit(X_train, y_train)
y_pred = grid_search_el_net.predict(X_test)
rmse_el_net = np.sqrt(mean_squared_error(y_test, y_pred))
print("The Root Mean Squared Error for Elastic Net is:", rmse_el_net)
print("Elastic Net Score is:", grid_search_el_net.score(X_test, y_test) * 100, "%")

# Support Vector Regression
Support Vector Regression is a supervised machine learning algorithm. The basic idea behind SVR is to find the best fit line. In SVR, the best fit line is the hyperplane that has the maximum number of points.

In [None]:
svr = SVR()
param_grid = {'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 
              'gamma': ['scale', 'auto'],
              'C': [1, 10, 100],
              'epsilon': [0.01, 0.1, 1, 10]}
grid_search_svr = GridSearchCV(svr, param_grid, cv = 5)
grid_search_svr.fit(X_train, y_train)
y_pred = grid_search_svr.predict(X_test)
rmse_svr = np.sqrt(mean_squared_error(y_test, y_pred))
print("The Root Mean Squared Error for Support Vector Regressor is:", rmse_svr)
print("SVR Score is:", grid_search_svr.score(X_test, y_test) * 100, "%")

# Decision Tree Regressor
It is a supervised machine learning algorithm. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
Decision trees regressors normally use mean squared error (MSE) to decide to split a node in two or more sub-nodes.
<div style="width:100%;text-align: center;"><img align = "left" src="https://gdcoder.com/content/images/2019/05/Screen-Shot-2019-05-18-at-03.40.41.png" style="height:300px;"></div>


In [None]:
dtr = DecisionTreeRegressor(random_state = 0)
param_grid = {'max_depth': list(range(2, 10)),
              'splitter': ['best', 'random'],
              'min_samples_leaf': list(range(1, 10)),
              'max_leaf_nodes': list(range(5, 20))}
grid_search_dtr = GridSearchCV(dtr, param_grid, cv = 5)
grid_search_dtr.fit(X_train, y_train)
y_pred = grid_search_dtr.predict(X_test)
rmse_dtr = np.sqrt(mean_squared_error(y_test, y_pred))
print("The Root Mean Squared Error for Decision Tree Regressor is:", rmse_dtr)
print("DTR score is:", grid_search_dtr.score(X_test, y_test) * 100, "%")
print("The best parameters of the decision tree regressor are:")
print(grid_search_dtr.best_params_)

# Visualizing the Decision Tree Regressor

In [None]:
features = df[final_features].columns
label = ['SalePrice']
target = df[label].columns
dot_data = tree.export_graphviz(
    DecisionTreeRegressor(max_depth = 5, max_leaf_nodes = 19, min_samples_leaf = 3, splitter = 'best', random_state = 0).fit(X_train, y_train), 
    out_file = None, feature_names = features, class_names = target, filled = True
    )
graph = graphviz.Source(dot_data, format = "jpg")
display(graph)

# Random Forest Regressor
It it an ensemble technique capable of performing regression. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.
<div style="width:100%;text-align: center;"><img align = "left" src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/rfc_vs_dt1.png" style="height:300px;"></div>

In [None]:
rfr = RandomForestRegressor()
param_grid = {'n_estimators': list(range(100, 200, 10)),
             'max_depth': list(range(4, 7)),
             'min_samples_split': list(range(2, 4))}
grid_search_rfr = GridSearchCV(rfr, param_grid, cv = 5)
grid_search_rfr.fit(X_train, y_train)
y_pred = grid_search_rfr.predict(X_test)
rmse_rfr_grid = np.sqrt(mean_squared_error(y_test, y_pred))
print("The Root Mean Squared Error is:", rmse_rfr_grid)
print("RFR Score is:", grid_search_rfr.score(X_test, y_test) * 100, "%")

# AdaBoost Regressor
AdaBoost regressor is a meta-estimator that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction. As such, subsequent regressors focus more on difficult cases.

In [None]:
abr = AdaBoostRegressor(random_state = 0)
param_grid = {'n_estimators': list(range(100, 1000, 100)),
             'learning_rate': [0.001, 0.01, 0.1, 1, 10]}
grid_search_abr = GridSearchCV(abr, param_grid, cv = 5)
grid_search_abr.fit(X_train, y_train)
y_pred = grid_search_abr.predict(X_test)
rmse_abr_grid = np.sqrt(mean_squared_error(y_test, y_pred))
print("The Root Mean Squared Error is:", rmse_abr_grid)
print("ABR Score is:", grid_search_abr.score(X_test, y_test) * 100, "%")

# Visualizing the Results

In [None]:
data = {'Linear Regression': rmse_lin_reg, 'Ridge Regression': rmse_ridge_reg, 'Lasso Regression': rmse_las_reg, 'Elastic Net': rmse_el_net,
        'Support Vector Regressor': rmse_svr, 'Decision Tree Regressor': rmse_dtr, 'Random Forest Regressor': rmse_rfr_grid,
        'Ada Boost Regressor': rmse_abr_grid}
data = dict(sorted(data.items(), key = lambda x: x[1], reverse = True))
models = list(data.keys())
RMSE = list(data.values())
fig = plt.figure(figsize = (30, 10))
sns.barplot(x = models, y = RMSE)
plt.xlabel("Models Used", size = 20)
plt.xticks(rotation = 30, size = 15)
plt.ylabel("RMSE", size = 20)
plt.yticks(size = 15)
plt.title("RMSE for different models", size = 25)
plt.show()

# Save Test Data

In [None]:
test_run_df = y_val.to_frame().join(X_val)
test_run_df.head()

In [None]:
test_run_df.to_csv(os.environ['DSX_PROJECT_DIR']+'/datasets/test_df_houses_labeled.csv', index = False)

In [None]:
X_val.to_csv(os.environ['DSX_PROJECT_DIR']+'/datasets/test_df_houses_unlabeled.csv', index = False)

# Save Model

In [None]:
model_name = 'Regressor_House_Prices_rfr'

In [None]:
from dsx_ml.ml import save

saved_model_output = save(name = model_name,
                          model = grid_search_rfr,
                          x_test=pd.DataFrame(X_test),
                          y_test=pd.DataFrame(y_test),
                          labelColumn_json = [{"name": "SalePrice", "type": "float"}],
                          algorithm_type = 'Regression',
                          source='housing_prices_regression_models_solution.ipynb',
                          description='Regressor models for a house prices use case'
                         )
saved_model_output

## Make an online scoring prediction

Upon saving a model, an internal online scoring endpoint is automatically created.

Compatible models:
https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=data-machine-learning-models

In [None]:
import os
import requests
import json

In [None]:
with open(f"/user-home/{os.environ['DSX_USER_ID']}/DSX_Projects/{os.environ['DSX_PROJECT_NAME']}/models/{model_name}/metadata.json") as infile:
    metadata_dict = json.load(infile)

In [None]:
print(f"Model Type: {metadata_dict['algorithm']}")
print("Feature(s):")
for feature in metadata_dict['features']:
    print('    '+feature['name'])

print(f"Latest Model Version: {metadata_dict['latestModelVersion']}")
print("Label(s):")
for label in metadata_dict['labelColumns']:
    print('    '+label['name'])

In [None]:
header_online = {'Content-Type': 'application/json', 'Authorization': os.environ['DSX_TOKEN']}

New data is provided in the following cell.

Use this sample data or create your own:

{"OverallQual":-0.75110125, "YearBuilt":-0.34094461, "YearRemodAdd":-1.07288463, "TotalBsmtSF":-0.3706814 , "1stFlrSF":-0.65456132,"GrLivArea":-1.21558782, "FullBath":-1.02871967, "TotRmsAbvGrd":-0.91833473, "GarageCars":-0.98767408,  "GarageArea":1.1855381 ,"MSZoning": -1.49289614, "BldgType": -0.42662461, "ExterQual": 0.67354752, "CentralAir": 0.27271612,  "KitchenQual": 0.76832738, "SaleCondition": 0.20138321}

In [None]:
payload = [{"OverallQual":-0.75110125, "YearBuilt":-0.34094461, "YearRemodAdd":-1.07288463, "TotalBsmtSF":-0.3706814 , "1stFlrSF":-0.65456132,"GrLivArea":-1.21558782, "FullBath":-1.02871967, "TotRmsAbvGrd":-0.91833473, "GarageCars":-0.98767408,  "GarageArea":1.1855381 ,"MSZoning": -1.49289614, "BldgType": -0.42662461, "ExterQual": 0.67354752, "CentralAir": 0.27271612,  "KitchenQual": 0.76832738, "SaleCondition": 0.20138321}]
print(payload)

In [None]:
scoring_response = requests.post(saved_model_output['scoring_endpoint'], json=payload, headers=header_online)

In [None]:
scoring_response.content

## Make batch online scoring prediction via API

In Watson Studio, an online batch scoring endpoint can be created.  
By running batch score at least once to generate a script, API details can be generated by going to {Your project} > Scripts > {your model} then click the 3 vertical dots on the right then Test API…

To test a script, a bearer token (accessToken) is needed to authenticate the user. The token lasts for 13 hours and can be retrieved by running:

In [None]:
url_batch_score_auth = 'https://52.116.135.95/v1/preauth/validateAuth'
url_batch_score = 'https://52.116.135.95/dsx-py3-script/ibmdsxuser-1003/1648180381668/batch_score'

In [None]:
score_auth = requests.get(url_batch_score_auth, auth=({{your username}}, {{your password}}), verify=False)

In [None]:
accessToken = json.loads(score_auth.content)['accessToken']

In [None]:
header_batch_score = {'Content-Type': 'application/json', 'Authorization': f'Bearer {accessToken}'}

In [None]:
args = {'execution_type': 'DSX', 'target': '/datasets/test_df_houses_batch_score_results.csv', 'source': '/datasets/test_df_houses_unlabeled.csv', 'output_type': 'Localfile', 'output_datasource_type': '', 'sysparm': '', 'remoteHost': '', 'remoteHostImage': '', 'livyVersion': 'livyspark2'}

In [None]:
batch_score_payload = { "relativeScriptPath": "scripts/batch_score_housing_prices_sample.py", "args": args }

In [None]:
batch_scoring_response = requests.post(url_batch_score, json=batch_score_payload, headers=header_batch_score, verify=False)

In [None]:
json.loads(batch_scoring_response.content)

## Make an online scoring prediction via API

Upon deploying in Watson Machine Learning, an online scoring endpoint is automatically created.

In [None]:
url_score = 'https://52.116.135.95/dmodel/v1/python-lab/pyscript/regressor-housing-prices/score'

In [None]:
header_online_api = {'Content-Type': 'application/json', 'Cache-Control': 'no-cache', 'Authorization': {{your token}}}

In [None]:
payload_data = {"args":{"input_json":[{"OverallQual":-0.75110125, "YearBuilt":-0.34094461, "YearRemodAdd":-1.07288463, "TotalBsmtSF":-0.3706814 , "1stFlrSF":-0.65456132,"GrLivArea":-1.21558782, "FullBath":-1.02871967, "TotRmsAbvGrd":-0.91833473, "GarageCars":-0.98767408,  "GarageArea":1.1855381 ,"MSZoning": -1.49289614, "BldgType": -0.42662461, "ExterQual": 0.67354752, "CentralAir": 0.27271612,  "KitchenQual": 0.76832738, "SaleCondition": 0.20138321}]}}
print(payload_data)

In [None]:
scoring_response_api = requests.post(url_score, json=payload_data, headers=header_online_api, verify=False)

In [None]:
json.loads(scoring_response_api.content)

# Creating the Submission File (Optional only if submitting to Kaggle for test/ranking)

In [None]:
df = pd.read_csv(os.environ['DSX_PROJECT_DIR']+'/datasets/test_houses.csv')
df = df[final_features]
df[cat_cols] = df[cat_cols].astype(str)
df[cat_cols] = df[cat_cols].apply(label_encoder.fit_transform)
X = df
X_scaled = scaler.fit_transform(X)
X = pd.DataFrame(X_scaled, index=X.index, columns=X.columns)
y_pred_frf = grid_search_rfr.predict(X)
final_pred = list(y_pred_frf)
df = pd.read_csv(os.environ['DSX_PROJECT_DIR']+'/datasets/test_houses.csv')['Id']
df = pd.DataFrame(df)
df['SalePrice'] = final_pred
# df.to_csv('submission', index = False)
df.head()