### Feature Scaling

Feature scaling is a **data preprocessing technique** used to transform the values of features or variables in a dataset to a similar scale.  
The purpose is to ensure that **all features contribute equally to the model** and to avoid the **domination of features with larger values**.


In [25]:
# Importing the pandas library for data handling and analysis
import pandas as pd

# Reading the dataset from the CSV file
# Make sure the file 'MinMaxScaler.csv' is in the same directory as your notebook
df = pd.read_csv("MinMaxScaler.csv")

# Displaying the first 5 rows of the dataset to understand its structure
df.head()


Unnamed: 0,year,selling_price,km_driven
0,2007,60000,70000
1,2007,135000,50000
2,2012,600000,100000
3,2017,250000,46000
4,2014,450000,141000


### Normalization (Min-Max Scaling)

Normalization is a **scaling technique** in which values are shifted and rescaled  
so that they end up ranging between **0 and 1**.  
It is also known as **Min-Max Scaling**.

The formula for Min-Max scaling is given by:

x_scaled = (x - x_min) / (x_max - x_min)

This transformation ensures that all features have values within the same range,  
making them equally important during model training.


In [27]:
from sklearn.preprocessing import MinMaxScaler   # Import MinMaxScaler class from sklearn for normalization
scale_obj = MinMaxScaler()                       # Create a MinMaxScaler object to scale features between 0 and 1

In [29]:
# Fit the MinMaxScaler on the dataset and transform it
# This scales all numerical values between 0 and 1 using the formula:
# (x - x_min) / (x_max - x_min)
scaled_arr = scale_obj.fit_transform(df)

# Convert the scaled NumPy array back into a DataFrame
# This helps retain column names and makes the data easier to read
scaled_df = pd.DataFrame(data=scaled_arr, columns=df.columns)

# Display the first 5 rows of the scaled DataFrame to verify the transformation
scaled_df.head()

Unnamed: 0,year,selling_price,km_driven
0,0.535714,0.004505,0.086783
1,0.535714,0.01295,0.061988
2,0.714286,0.065315,0.123976
3,0.892857,0.025901,0.057028
4,0.785714,0.048423,0.174807


In [16]:
from sklearn.model_selection import train_test_split

In [30]:
# Separate the independent variables (features) from the dependent variable (target)
# Here, 'selling_price' is the target column we want to predict
X = scaled_df.drop('selling_price', axis=1)

# Store the target column 'selling_price' in a separate variable Y
Y = scaled_df['selling_price']


In [31]:
# Split the dataset into training and testing sets
# test_size=0.2 → 20% of the data is used for testing, 80% for training
# random_state=42 → ensures the split is reproducible (same results every time)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)


In [23]:
X_test

Unnamed: 0,year,km_driven
3978,0.642857,0.099181
1448,0.964286,0.033473
2664,0.571429,0.051726
17,0.785714,0.174807
1634,0.857143,0.074385
...,...,...
3468,0.535714,0.074385
3164,0.821429,0.148772
416,0.678571,0.049590
1616,0.892857,0.008033


In [24]:
X_train

Unnamed: 0,year,km_driven
227,0.892857,0.024794
964,0.928571,0.061988
2045,0.750000,0.030993
1025,0.678571,0.086783
4242,0.892857,0.089263
...,...,...
3444,0.500000,0.061988
466,0.678571,0.099181
3092,0.857143,0.063227
3772,0.750000,0.099181


### Conclusion

We successfully performed **data preprocessing** using the **Min-Max Scaling** technique.

Key points:
- All numeric features were scaled to a **range between 0 and 1**.
- The dataset was split into **training (80%)** and **testing (20%)** sets for model development.
- This scaling ensures that all features contribute **equally** during model training, preventing bias from large-valued features.

The data is now **clean, normalized, and ready** for building a machine learning model.
