## Introduction to Data PreprocessingData 

preprocessing is a critical step in the machine learning workflow that involves preparing and cleaning raw data to make it suitable for modeling. Effective preprocessing ensures that your data is in the best possible shape to build accurate and reliable machine learning models.In this guide, we'll cover fundamental preprocessing techniques that are essential for beginners. These steps include:

Loading Your Data: Importing your dataset into a suitable format for analysis.

In [3]:
import pandas as pd
from sklearn.datasets import load_iris
# Load the dataset
iris = load_iris()
# Convert to a pandas DataFrame
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['target'] = iris.target
# Display the first few rows of the DataFrame
print(data.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


Inspect Your DataUnderstand the structure of your data by checking the first few rows, columns, and summary statistics.

In [4]:
# View the first few rows
print(data.head())
# Get a summary of the dataset
print(data.info())
# Summary statistics
print(data.describe())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
dtypes: float

Handle Missing ValuesDecide how to deal with missing values: fill them in or drop them.

In [5]:
# Fill missing values with the mean (for numerical columns)
data.fillna(data.mean(), inplace=True)
# Drop rows with missing values
data.dropna(inplace=True)
print(data.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


4. Remove DuplicatesEnsure there are no duplicate rows in your dataset.

In [6]:
data.drop_duplicates(inplace=True)
print(data.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  
