## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [1]:
# Write your code from here
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# 1. Load a sample dataset with Pandas
data = {'feature1': [10, 20, 30, 40, 50],
        'feature2': [5, 15, 25, 35, 45],
        'category': ['A', 'B', 'A', 'C', 'B']}
df = pd.DataFrame(data)

# Identify numerical features
numerical_features = ['feature1', 'feature2']

# 2. Define a pipeline using Pipeline from sklearn.pipeline
# 3. Use StandardScaler to scale features
pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Apply the pipeline to the numerical features
df[numerical_features] = pipeline.fit_transform(df[numerical_features])

# Display the scaled data
print("Original DataFrame:")
print(pd.DataFrame(data))
print("\nDataFrame with scaled numerical features:")
print(df)

Original DataFrame:
   feature1  feature2 category
0        10         5        A
1        20        15        B
2        30        25        A
3        40        35        C
4        50        45        B

DataFrame with scaled numerical features:
   feature1  feature2 category
0 -1.414214 -1.414214        A
1 -0.707107 -0.707107        B
2  0.000000  0.000000        A
3  0.707107  0.707107        C
4  1.414214  1.414214        B


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [2]:
# Write your code from here
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import numpy as np

# 1. Load a dataset with missing values
data = {'feature1': [10, np.nan, 30, 40, 50],
        'feature2': [5, 15, np.nan, 35, 45],
        'category': ['A', 'B', 'A', np.nan, 'B']}
df = pd.DataFrame(data)

print("Original DataFrame with missing values:")
print(df)

# Identify numerical and categorical features
numerical_features = ['feature1', 'feature2']
categorical_features = ['category']

# 2. Define a pipeline to use SimpleImputer for filling missing values

# For numerical features, we'll use the mean to impute missing values
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# For categorical features, we'll use the most frequent value to impute missing values
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent'))
])

# Apply the numerical pipeline to the numerical features
df[numerical_features] = numerical_pipeline.fit_transform(df[numerical_features])

# Apply the categorical pipeline to the categorical features
df[categorical_features] = categorical_pipeline.fit_transform(df[categorical_features])

# Display the DataFrame after imputation
print("\nDataFrame after imputation:")
print(df)

Original DataFrame with missing values:
   feature1  feature2 category
0      10.0       5.0        A
1       NaN      15.0        B
2      30.0       NaN        A
3      40.0      35.0      NaN
4      50.0      45.0        B

DataFrame after imputation:
   feature1  feature2 category
0      10.0       5.0        A
1      32.5      15.0        B
2      30.0      25.0        A
3      40.0      35.0        A
4      50.0      45.0        B
