In [1]:
# ans 1:



# Importing essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

# Reading data into a DataFrame from a CSV file
df = pd.read_csv("encoded_data_benguluru1.csv")

# Splitting data into two parts
X = df.drop("price", axis=1)
y = df["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Standardize the values
from sklearn.impute import SimpleImputer

# Identify numeric columns
numeric_columns = X_train.select_dtypes(include=['float64', 'int64']).columns

# Select only numeric columns for standardization
X_train_numeric = X_train[numeric_columns]
X_test_numeric = X_test[numeric_columns]

# Impute missing values
imputer = SimpleImputer(strategy='mean')
X_train_numeric_imputed = imputer.fit_transform(X_train_numeric)
X_test_numeric_imputed = imputer.transform(X_test_numeric)

# Standardize the numeric columns
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_numeric_imputed)
X_test_scaled = scaler.transform(X_test_numeric_imputed)


# Starting regressor

# Define the parameter grid
param_grid = {'C': [0.1, 1, 10, 100], 'kernel': ['linear', 'rbf', 'poly'], 'gamma': ['scale', 'auto']}

# Create an SVR regressor
sv_regressor = SVR()

# Create GridSearchCV object
grid_search = GridSearchCV(estimator=sv_regressor, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)

# Fit the grid search to the data
grid_search.fit(X_train_scaled, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Use the best model for prediction
y_pred = best_model.predict(X_test_scaled)


Best Parameters: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}


In [2]:
y_pred

array([ 55.10046664,  99.89961255,  55.10046664, ...,  55.10046664,
        55.10046664, 149.90031332])

In [6]:
!pip install gdown
import gdown

file_url = "https://drive.google.com/uc?id=1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0"
output_file = "encoded_data_benguluru1.csv"

gdown.download(file_url, output_file, quiet=False)
df = pd.read_csv(output_file)

    

Collecting gdown
  Downloading gdown-5.1.0-py3-none-any.whl (17 kB)
Collecting filelock
  Downloading filelock-3.13.1-py3-none-any.whl (11 kB)
Installing collected packages: filelock, gdown
Successfully installed filelock-3.13.1 gdown-5.1.0


Downloading...
From: https://drive.google.com/uc?id=1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0
To: /home/jovyan/work/encoded_data_benguluru1.csv
100%|██████████| 938k/938k [00:00<00:00, 2.29MB/s]


## so the best parameters would be :
## Best Parameters: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}

# ans 2:

If your goal is to predict the actual price of a house as accurately as possible, Mean Squared Error (MSE) would be a more appropriate evaluation metric for an SVM regression model.

Mean Squared Error (MSE) measures the average squared difference between the predicted values and the actual values. In the context of predicting house prices, MSE penalizes large errors more heavily than smaller errors. Minimizing MSE implies that you are trying to reduce the overall magnitude of prediction errors, which aligns well with the goal of accurate price prediction.

On the other hand, R-squared (coefficient of determination) measures the proportion of the variance in the dependent variable that is predictable from the independent variables. While R-squared is a valuable metric for understanding the proportion of variance explained by the model, it may not directly convey how close your predictions are to the actual prices.

In summary, for the specific goal of predicting house prices accurately, MSE is a more suitable metric as it directly reflects the accuracy of the predicted values in terms of their closeness to the actual prices.

In [None]:
# code:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

# Assuming you have loaded your encoded data into X_train, X_test, y_train, y_test
df = pd.read_csv("encoded_data_benguluru1.csv")
df_encoded=pd.get_dummies(df)
df_encoded = df_encoded.dropna()
X = df_encoded.drop("price", axis=1)
y = df_encoded["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train an SVM regression model
sv_regressor = SVR(kernel='linear', C=1.0)
sv_regressor.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = sv_regressor.predict(X_test_scaled)

# Evaluate using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error (MSE): {mse}')

# Evaluate using R-squared
r_squared = r2_score(y_test, y_pred)
print(f'R-squared: {r_squared}')


# ans 3:

When dealing with a dataset that has a significant number of outliers, Mean Squared Error (MSE) may not be the most appropriate regression metric. MSE is sensitive to outliers because it squares the differences between predicted and actual values, giving more weight to larger errors.

A more robust metric in the presence of outliers is Mean Absolute Error (MAE). MAE is less affected by extreme values since it takes the absolute values of the differences between predicted and actual values. It is calculated as the average of the absolute differences between predictions and true values.

Using MAE as a regression metric is often recommended when dealing with datasets containing outliers because it provides a more balanced measure of the model's accuracy without being heavily influenced by a few extreme observations.

In scikit-learn, you can use `mean_absolute_error` from the `sklearn.metrics` module to calculate the Mean Absolute Error.

Here's an example:

```python
from sklearn.metrics import mean_absolute_error

# Assuming y_true and y_pred are your true and predicted values
mae = mean_absolute_error(y_true, y_pred)
print(f'Mean Absolute Error (MAE): {mae}')
```

Consider using MAE when evaluating your SVM regression model on datasets with significant outliers.

# ans 4:

When MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) are very close, it is generally preferable to choose the RMSE as the evaluation metric, especially for regression tasks. The RMSE is essentially the square root of the MSE and shares the same unit as the target variable. Here's why RMSE is often preferred:

1. **Interpretability:** The RMSE is more interpretable as it is in the same units as the target variable. This makes it easier to convey the magnitude of prediction errors in a way that aligns with the original scale of the data.

2. **Sensitivity to Outliers:** The square root operation in RMSE mitigates the impact of large errors, making it less sensitive to outliers compared to MSE. If you have outliers in your data, RMSE provides a more balanced view of prediction errors.

3. **Consistency with the Original Metric:** If your original goal or problem statement was framed in terms of RMSE or closely related metrics, it makes sense to stick with RMSE for consistency.

While both metrics provide a measure of the average magnitude of prediction errors, RMSE is often preferred for its interpretability and robustness to outliers.

# asn 5:

If your goal is to measure how well the model explains the variance in the target variable, then R-squared (coefficient of determination) would be the most appropriate evaluation metric.

R-squared provides a measure of the proportion of the variance in the dependent variable that is predictable from the independent variables. Specifically, it quantifies the goodness of fit of the model by comparing the variability of the predicted values to the variability of the actual values. The value of R-squared ranges from 0 to 1, where 1 indicates a perfect fit.

For SVM regression models with different kernels (linear, polynomial, and RBF), you can use the `r2_score` function from scikit-learn to calculate R-squared. Here's an example:

```python
from sklearn.metrics import r2_score
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assuming X, y are your features and target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train SVM regression models with different kernels
kernels = ['linear', 'poly', 'rbf']
for kernel in kernels:
    sv_regressor = SVR(kernel=kernel)
    sv_regressor.fit(X_train_scaled, y_train)
    y_pred = sv_regressor.predict(X_test_scaled)
    
    # Calculate R-squared
    r_squared = r2_score(y_test, y_pred)
    print(f'R-squared for {kernel} kernel: {r_squared}')
```

This code demonstrates how to train SVM regression models with different kernels and evaluate their performance using R-squared. Adjust the kernel types and other parameters based on your specific scenario.