In [79]:

import numpy as np 
import pandas as pd 
 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv
/kaggle/input/house-prices-advanced-regression-techniques/data_description.txt
/kaggle/input/house-prices-advanced-regression-techniques/train.csv
/kaggle/input/house-prices-advanced-regression-techniques/test.csv


Imports the required libraries:

numpy: A library for numerical computing in Python.
pandas: A library for data manipulation and analysis.
The os.walk() function is used to traverse through the directory tree rooted at /kaggle/input. It iterates through each directory, subdirectory, and files present in this path.

Inside the loop, the script prints the full path of each file in the Kaggle input directory structure using os.path.join(dirname, filename). This provides an overview of the dataset files available in the Kaggle input directory.

The printed output will show the file paths of all the files present in the /kaggle/input directory and its subdirectories.

In [80]:

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor



sklearn (Scikit-learn) is a popular Python library for machine learning, and it provides various tools for data preprocessing, modeling, and evaluation.

train_test_split:
train_test_split is a function in sklearn.model_selection module that is used to split a dataset into two subsets: the training set and the testing (or validation) set.

KNeighborsRegressor is a regression model from the sklearn.neighbors module. It is based on the k-nearest neighbors algorithm and is used for predicting continuous numeric values.

the combination of train_test_split and KNeighborsRegressor allows you to split your dataset into training and testing sets, and then build and evaluate a k-nearest neighbors regression model on the training data to make predictions on the testing set.

In [81]:
df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
df.shape


(1460, 81)

 The pandas library is used to read a CSV file into a DataFrame, and then the .shape attribute is used to get the dimensions of the DataFrame.

In [82]:
y = df.SalePrice

columns=['MSSubClass', 'LotFrontage', 'LotArea', 'YearBuilt', 'GarageArea', 'PoolArea', 'YrSold']
df[columns].isnull().sum()

MSSubClass       0
LotFrontage    259
LotArea          0
YearBuilt        0
GarageArea       0
PoolArea         0
YrSold           0
dtype: int64

The DataFrame df is used to extract a specific column SalePrice and store it in the variable y. Then, a subset of the DataFrame containing only selected columns (MSSubClass, LotFrontage, LotArea, YearBuilt, GarageArea, PoolArea, and YrSold) is created. The code then checks for missing values in this subset using the .isnull() method and calculates the sum of missing values for each column.

In [83]:
from sklearn.impute import SimpleImputer

inputer = SimpleImputer(missing_values=np.nan, strategy='constant')
inputer = inputer.fit(df[columns].values)

inputed = inputer.transform(df[columns].values)




This code uses the SimpleImputer from scikit-learn to handle missing values in a subset of the DataFrame df containing the columns specified in the columns list. It replaces any missing values with a constant value, as specified by the strategy='constant'. The fit method is used to compute the constant value for imputation, and the transform method applies the imputation to the data, creating a new array with missing values replaced by the constant value.

In [84]:
X = inputed

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=42)


This code splits the preprocessed data X and the target variable y into training and validation sets (train_X, val_X, train_y, and val_y) using the train_test_split function. This is a common practice in machine learning to train the model on one subset of the data and evaluate its performance on another subset to avoid overfitting and assess generalization capabilities.

In [85]:
model_knn = KNeighborsRegressor(n_neighbors=5)
model_knn.fit(train_X, train_y)

The model_knn object will be trained and ready to make predictions on new data. It has learned from the training set and can use the k-nearest neighbors algorithm to find the closest data points to a given input and make predictions based on the average of their target values.

In [86]:
test_data = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [87]:
from sklearn.impute import SimpleImputer

inputer = SimpleImputer(missing_values=np.nan, strategy='constant')
inputer = inputer.fit(test_data[columns].values)

inputed = inputer.transform(test_data[columns].values)

This code uses the SimpleImputer from scikit-learn to handle missing values in a subset of the test_data DataFrame containing the columns specified in the columns list. It replaces any missing values with a constant value, as specified by the strategy='constant'. The fit method is used to compute the constant value for imputation, and the transform method applies the imputation to the test data, creating a new array with missing values replaced by the constant value. This process is necessary to ensure that the test data is also properly preprocessed before making predictions using a machine learning model.

In [88]:
test_X = inputed

test_preds = model_knn.predict(test_X)

This code snippet uses the preprocessed test data test_X as input to the trained k-nearest neighbors regression model (model_knn). The model predicts the sale prices (SalePrice) for the instances in the test dataset and stores the predicted values in the variable test_preds. These predicted values can then be used for further analysis, evaluation, or any other purposes related to the predictive model.

In [89]:
test_preds

array([262342.6, 223780. , 222760. , ..., 265630.8, 133000. , 161780. ])

In [90]:
output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

df = pd.read_csv('submission.csv')
df.head()


Unnamed: 0,Id,SalePrice
0,1461,262342.6
1,1462,223780.0
2,1463,222760.0
3,1464,189158.0
4,1465,204330.0


This code prepares the model's predictions on the test dataset in the form of a DataFrame, saves it to a CSV file named 'submission.csv', reads it back into a new DataFrame df, and then displays the first few rows of df to inspect the structure of the submission file. This is a typical process used in machine learning competitions or projects where predictions need to be submitted in a specific format for evaluation and scoring.