# Housing Price Prediction**
## Introduction
In this notebook, we'll model for the prediction of the actual value of prospective properties using data analytics.

Given the problem statement and the data provided, let's perform the following steps:
### 1. Data Preprocessing
### 2. Exploratory Data Analysis (EDA)
### 3. Feature Selection
### 4. Model Building & optimal value of lambda
### 5. Model Evaluation
### 6 Interpretation

### 1. Data Preprocessing
#### Load Data & Handle Missing Values

In [29]:
import pandas as pd

# Loading the provided dataset
file_path = 'train.csv'
data = pd.read_csv(file_path)

# Displaying the first few rows of the dataset
data.head()


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### 2. Exploratory Data Analysis (EDA)

#### Data Cleaning and Preprocessing
- Explore the dataset to understand the various features.
- Handle missing values, if any.
- Convert categorical variables to numerical format using encoding techniques.

In [30]:
# Preprocessing: Handling missing values and dropping columns with a high percentage of missing values
columns_to_drop = data.columns[data.isnull().mean() > 0.5]
data_cleaned = data.drop(columns=columns_to_drop)

# Fill remaining missing values
for column in data_cleaned.columns:
    if data_cleaned[column].dtype == 'object':
        data_cleaned[column].fillna(data_cleaned[column].mode()[0], inplace=True)
    else:
        data_cleaned[column].fillna(data_cleaned[column].median(), inplace=True)

# Encode categorical variables using one-hot encoding
data_encoded = pd.get_dummies(data_cleaned.drop('Id', axis=1))

### 3. Feature Selection
- Identify the most relevant features based on their correlation with SalePrice.

In [32]:
# Feature Selection: Using correlation with SalePrice
corr_matrix = data_encoded.corr()
saleprice_corr = corr_matrix['SalePrice'].sort_values(ascending=False)

### 4. Model Building & optimal value of lambda
- Split the dataset into training and test sets.
- Standardize the features.
- Apply Ridge and Lasso regression models.
- Use cross-validation to find the optimal value of lambda (regularization strength) for Ridge and Lasso regression.

In [48]:
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso

# Splitting the dataset into training and test sets
X = data_encoded.drop('SalePrice', axis=1)
y = data_encoded['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardizing features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Hyperparameter Tuning: Finding optimal alpha for Ridge and Lasso Regression
parameters = {'alpha': [1e-3, 1e-2, 1, 5, 10, 20, 30, 40, 50]}
ridge_cv = GridSearchCV(Ridge(), parameters, scoring='neg_mean_squared_error', cv=5)
lasso_cv = GridSearchCV(Lasso(max_iter=10000), parameters, scoring='neg_mean_squared_error', cv=5)

ridge_cv.fit(X_train_scaled, y_train)
lasso_cv.fit(X_train_scaled, y_train)

### 5. Model Evaluation
- Evaluate the models using appropriate metrics (like R-squared, RMSE).

In [47]:
import warnings
# Suppressing warnings
warnings.filterwarnings('ignore')

# Model Evaluation: Using best alpha values
ridge_best = Ridge(alpha=ridge_cv.best_params_['alpha'])
lasso_best = Lasso(alpha=lasso_cv.best_params_['alpha'])

ridge_best.fit(X_train_scaled, y_train)
lasso_best.fit(X_train_scaled, y_train)

y_pred_ridge = ridge_best.predict(X_test_scaled)
y_pred_lasso = lasso_best.predict(X_test_scaled)

# Metrics: RMSE and R-squared
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
r2_ridge = r2_score(y_test, y_pred_ridge)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
r2_lasso = r2_score(y_test, y_pred_lasso)

### 6 Interpretation
- Interpret the results to identify which variables are significant in predicting house prices and how well they describe the price.

In [39]:
# Identifying Significant Features
ridge_coeffs = pd.DataFrame({'Feature': X.columns, 'Ridge Coefficient': ridge_best.coef_})
lasso_coeffs = pd.DataFrame({'Feature': X.columns, 'Lasso Coefficient': lasso_best.coef_})

coeffs_combined = pd.merge(ridge_coeffs, lasso_coeffs, on='Feature')
coeffs_combined['Ridge Coefficient Absolute'] = coeffs_combined['Ridge Coefficient'].abs()
significant_features = coeffs_combined.sort_values('Ridge Coefficient Absolute', ascending=False).head(10)

print("Ridge RMSE:", rmse_ridge)
print("Ridge R2:", r2_ridge)
print("Lasso RMSE:", rmse_lasso)
print("Lasso R2:", r2_lasso)
print("Significant Features:\n", significant_features)

Ridge RMSE: 31041.653626058294
Ridge R2: 0.8437856852540064
Lasso RMSE: 33725.475236013575
Lasso R2: 0.8156057869489353
Significant Features:
                   Feature  Ridge Coefficient  Lasso Coefficient  \
73   Neighborhood_StoneBr        5021.869609        8445.157620   
141           BsmtQual_Ex        5011.268952        6754.455292   
15              GrLivArea        4836.535374       21470.274992   
3             OverallQual        4131.984263        5270.035147   
22           TotRmsAbvGrd        4027.814109        1080.838503   
129          ExterQual_Ex        3982.976504        4434.607926   
12               1stFlrSF        3885.357388        1399.226723   
11            TotalBsmtSF        3877.777851        9917.137396   
68   Neighborhood_NridgHt        3783.894467        3967.150310   
26             GarageArea        3592.833039       10860.461040   

     Ridge Coefficient Absolute  
73                  5021.869609  
141                 5011.268952  
15               