## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [None]:
# Write your code from here


In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_diabetes  # Sample dataset for demo

# Step 1: Load a sample dataset
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)

# For demonstration, we'll scale just a few numerical columns
features_to_scale = ['bmi', 'bp']  # You can change this based on your dataset

# Step 2: Define a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Step 3: Apply the pipeline to scale features
df[features_to_scale] = pipeline.fit_transform(df[features_to_scale])

# Verify the scaled values
print(df[features_to_scale].head())


        bmi        bp
0  1.297088  0.459841
1 -1.082180 -0.553505
2  0.934533 -0.119214
3 -0.243771 -0.770650
4 -0.764944  0.459841


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [None]:
# Write your code from here

In [2]:
import numpy as np
from sklearn.impute import SimpleImputer

# Step 1: Create a dummy dataset with missing values
data = {
    'age': [25, 30, np.nan, 40, 22],
    'salary': [50000, np.nan, 62000, 58000, np.nan]
}
df_missing = pd.DataFrame(data)

# Step 2: Define a pipeline for imputation
impute_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# Step 3: Apply the pipeline
df_missing[['age', 'salary']] = impute_pipeline.fit_transform(df_missing[['age', 'salary']])

# Verify imputed values
print(df_missing)


     age        salary
0  25.00  50000.000000
1  30.00  56666.666667
2  29.25  62000.000000
3  40.00  58000.000000
4  22.00  56666.666667
