<a href="https://www.kaggle.com/code/ziadhamed/data-preprocessing-and-model-selection-regression?scriptVersionId=204571874" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Preprocessing Summary

This notebook covers essential preprocessing steps for preparing data for machine learning regression models. Each section performs a specific transformation on the data to ensure it's suitable for modeling. Below is a summary of the sections:

1. **Dataset Importation**: The dataset is loaded using `pandas` for data manipulation and analysis.
2. **Feature Selection**: Independent (X) and dependent (Y) variables are extracted from the dataset, where X contains the features, and Y is the target variable.
3. **Handling Missing Values**: Missing values in the dataset are handled using `SimpleImputer`, which replaces `NaN` values with the mean of the respective feature.
4. **Encoding Categorical Data**: The dataset contains categorical variables, which are encoded using `LabelEncoder` and `OneHotEncoder` to convert them into a numerical format suitable for machine learning algorithms.
5. **Train-Test Split**: The data is split into training and testing sets using `train_test_split` from scikit-learn.
6. **Feature Scaling**: Standardization of features is done with `StandardScaler`, which normalizes the feature values to ensure that each has a mean of 0 and a standard deviation of 1.

The diagram below illustrates these preprocessing steps:

![Data Preprocessing Workflow](https://www.techtarget.com/rms/onlineimages/steps_for_data_preprocessing-f_mobile.png)


In [1]:
! pip install -U scikit-learn

[0m

In [2]:
import pandas as pd

dataset = pd.read_csv('/kaggle/input/dataforpreprocessing/Data.csv')

In [3]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [4]:
#independent variables
X =dataset.iloc[:,:-1].values
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [5]:
#dependent variable
Y = dataset.iloc[:,3].values
Y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

In [6]:
#handle missing values
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values = np.NaN, strategy = 'mean', )
imputer = imputer.fit(X[:, 1:3])
X[:,1:3]=imputer.transform(X[:,1:3])

print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [7]:
#handle categorical data
from sklearn.preprocessing import OneHotEncoder , LabelEncoder
encoder = LabelEncoder()
Y = encoder.fit_transform(Y)

print(Y)

[0 1 0 0 1 1 0 1 0 1]


In [8]:
oneHotEncoder = OneHotEncoder()
X_transformed = oneHotEncoder.fit_transform(X[:,[0]]).toarray()
# Reshape the remaining columns to ensure they have two dimensions
X_remaining = X[:, 1:].reshape(X.shape[0], -1)
X_encoded = np.hstack((X_transformed , X_remaining))

print(X_encoded)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [9]:
from sklearn.model_selection import train_test_split
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, Y, test_size=0.2, random_state=42)


In [10]:
#feature scaler
from sklearn.preprocessing import StandardScaler

sts = StandardScaler()
X_train[:,3:5] = sts.fit_transform(X_train[:,3:5])
X_test[:,3:5] = sts.transform(X_test[:,3:5])

print(X_train)

[[1.0 0.0 0.0 -0.7529426005471072 -0.6260377781240918]
 [1.0 0.0 0.0 1.008453807952985 1.0130429500553495]
 [1.0 0.0 0.0 1.7912966561752484 1.8325833141450703]
 [0.0 1.0 0.0 -1.7314961608249362 -1.0943465576039322]
 [1.0 0.0 0.0 -0.3615211764359756 0.42765697570554906]
 [0.0 1.0 0.0 0.22561095973072184 0.05040823668012247]
 [0.0 0.0 1.0 -0.16581046438040975 -0.27480619351421154]
 [0.0 0.0 1.0 -0.013591021670525094 -1.3285009473438525]]


In [11]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score



models = [
    ('Linear Regression', LinearRegression()),
    ('Ridge Regression', Ridge()),
    ('Lasso Regression', Lasso()),
    ('Decision Tree', DecisionTreeRegressor()),
    ('Random Forest', RandomForestRegressor()),
    ('Support Vector Regression', SVR()),
    ('XGBoost', XGBRegressor(objective='reg:squarederror', random_state=42))
]

# Dictionary to store the MSE for each model
mse_scores = {}

# Evaluate each model
for name, model in models:
    # Fit the model to the training data
    model.fit(X_train, y_train)

    # Predict the target variable on the test data
    y_pred = model.predict(X_test)

    # Calculate the Mean Squared Error (MSE)
    mse = mean_squared_error(y_test, y_pred)

    # Store the MSE score in the dictionary
    mse_scores[name] = mse
    r2 = r2_score(y_test, y_pred)

    print(f"{name}: MSE = {mse:.3f} \n R^2 Score = {r2:.4f}")



# Find the model with the lowest MSE
best_model_name = min(mse_scores, key=mse_scores.get)
best_mse = mse_scores[best_model_name]

print(f"\nBest Model: {best_model_name} with MSE = {best_mse:.3f}")

Linear Regression: MSE = 0.910 
 R^2 Score = -2.6391
Ridge Regression: MSE = 0.746 
 R^2 Score = -1.9827
Lasso Regression: MSE = 0.250 
 R^2 Score = 0.0000
Decision Tree: MSE = 1.000 
 R^2 Score = -3.0000
Random Forest: MSE = 0.661 
 R^2 Score = -1.6440
Support Vector Regression: MSE = 0.557 
 R^2 Score = -1.2266
XGBoost: MSE = 1.005 
 R^2 Score = -3.0184

Best Model: Lasso Regression with MSE = 0.250
