# **Day - 48 : Preprocess Data with Sklearn**

*1.*

*When training a model on a dataset, it is advisable to split the data into training and test sets. Why do we split data into training and test sets?* 

>We split data into training and test sets to check how well the model performs on unseen data.

>>Training set: Used to teach the model.

>>Test set: Used to evaluate how well the model generalizes to new, real-world data.

>If you train and test on the same data, the model can memorize it and look perfect, but completely fail on new data. That’s like studying only past exam papers and then failing the actual test.

In [1]:
import pandas as pd

df = pd.read_csv("Data/flower_dataset.csv")
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,flower_type
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,versicolor
3,4.6,3.1,1.5,0.2,versicolor
4,5.0,3.6,1.4,0.2,setosa


*2.*

*Create a copy of the DataFrame. In supervised machine learning, the target variable is the variable that will be predicted using the other variables in the dataset. This target variable must be converted into numeric data type before fitting a model. Convert this target variable(flower_type) into numeric data type and separate the target column from the other variables. Write code to create two variables, x and y. X is the variable that will predict the target variable, y. Use scikit-learn's train_test_split() function to split the data into training and test sets. Make the test size to be 20% of the data set. Set the random_state parameter to 42. Check the shapes of the training and test sets. What is the purpose of the random_state parameter in the train_test_split() function?* 

In [2]:
df_copy = df.copy()

In [3]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df_copy["flower_type"] = le.fit_transform(df_copy["flower_type"])
df_copy.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,flower_type
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,1
3,4.6,3.1,1.5,0.2,1
4,5.0,3.6,1.4,0.2,0


In [4]:
from sklearn.model_selection import train_test_split

X = df_copy.drop(columns=["flower_type"])

y = df_copy["flower_type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((16, 4), (16,), (4, 4), (4,))

*3.*

*Why is it important to standardize the data before fitting? Use Scikit-Learn's StandardScaler to standardize the features in the dataset (training and test sets).* 

>Standardizing data makes sure all features are on the same scale, so no single feature dominates just because it has bigger values. Models like SVM, KNN, and logistic regression are sensitive to feature scales—without standardizing, they can get confused and give bad results. It’s basically leveling the playing field.

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train

array([[-1.3782598 , -1.53281263, -0.23791548, -0.2773501 ],
       [ 0.94301987,  1.25411943,  1.66540833,  1.94145069],
       [-0.44974794, -0.1393466 ,  1.03096706, -0.2773501 ],
       [-0.91400387, -0.97542622,  0.39652579, -0.2773501 ],
       [ 1.63940377,  0.97542622,  1.66540833,  0.83205029],
       [ 0.94301987,  1.25411943, -0.87235674,  1.94145069],
       [-1.61038777, -1.25411943, -2.14123928, -1.38675049],
       [-0.6818759 , -0.69673301, -0.87235674, -0.2773501 ],
       [-0.21761997, -0.97542622,  0.39652579, -1.38675049],
       [ 0.24663596,  0.97542622,  0.39652579,  0.83205029],
       [ 0.014508  ,  0.41803981, -0.23791548, -0.2773501 ],
       [-0.44974794, -1.25411943, -0.23791548, -1.38675049],
       [ 0.014508  , -0.1393466 ,  0.39652579, -0.2773501 ],
       [ 0.94301987,  0.69673301,  0.39652579, -0.2773501 ],
       [ 1.87153173,  1.53281263, -1.50679801, -0.2773501 ],
       [-0.91400387, -0.1393466 , -0.23791548,  0.83205029]])