End-to-End Data Preprocessing Pipeline using Python

Develop a Python pipeline that automatically cleans and standardizes raw data. Integrate missing value treatment, data-type correction, and outlier detection into a unified workflow.
Tasks:
1.	Build a function to automate key preprocessing tasks.
2.	Apply normalization and handle outliers programmatically.
3.	Generate before-and-after data summaries.


In [6]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import MinMaxScaler

# 1Ô∏è‚É£ Load dataset
df = pd.read_csv("datasets/data_5.csv")
print("‚úÖ Original Data Loaded\n")

# 2Ô∏è‚É£ Before summary
print("=== BEFORE CLEANING ===")
print(df.info())
print(df.describe(include='all'))
print("\nMissing values before cleaning:\n", df.isna().sum())



‚úÖ Original Data Loaded

=== BEFORE CLEANING ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Customer_ID      20 non-null     object 
 1   Age              19 non-null     float64
 2   Income           19 non-null     object 
 3   Purchase_Amount  20 non-null     int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 772.0+ bytes
None
       Customer_ID        Age Income  Purchase_Amount
count           20  19.000000     19        20.000000
unique          20        NaN     18              NaN
top           C001        NaN  68000              NaN
freq             1        NaN      2              NaN
mean           NaN  33.368421    NaN       401.250000
std            NaN   7.220423    NaN       231.958906
min            NaN  22.000000    NaN       190.000000
25%            NaN  28.500000    NaN       277.500000
50%           

In [7]:
def preprocess_pipeline(df):
    df = df.copy()

    # üîπ Step 1: Replace text-based missing or invalid data
    df.replace("not_available", np.nan, inplace=True)

    # üîπ Step 2: Convert numeric columns to correct type
    df['Income'] = pd.to_numeric(df['Income'], errors='coerce')

    # üîπ Step 3: Handle missing values (mean imputation)
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    df['Income'].fillna(df['Income'].mean(), inplace=True)

    # üîπ Step 4: Detect and handle outliers using Z-score
    numeric_cols = ['Age', 'Income', 'Purchase_Amount']
    z_scores = np.abs(stats.zscore(df[numeric_cols]))
    df = df[(z_scores < 3).all(axis=1)]

    # üîπ Step 5: Normalize numerical columns (0‚Äì1 range)
    scaler = MinMaxScaler()
    df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

    return df


In [8]:
# 4Ô∏è‚É£ Apply the pipeline
cleaned_df = preprocess_pipeline(df)

# 5Ô∏è‚É£ After summary
print("\n=== AFTER CLEANING ===")
print(cleaned_df.info())
print(cleaned_df.describe())



=== AFTER CLEANING ===
<class 'pandas.core.frame.DataFrame'>
Index: 19 entries, 0 to 18
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Customer_ID      19 non-null     object 
 1   Age              19 non-null     float64
 2   Income           19 non-null     float64
 3   Purchase_Amount  19 non-null     float64
dtypes: float64(3), object(1)
memory usage: 760.0+ bytes
None
             Age     Income  Purchase_Amount
count  19.000000  19.000000        19.000000
mean    0.456221   0.381885         0.331785
std     0.260719   0.282847         0.273697
min     0.000000   0.000000         0.000000
25%     0.282609   0.151163         0.166667
50%     0.434783   0.255814         0.254902
75%     0.630435   0.604651         0.421569
max     1.000000   1.000000         1.000000


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Income'].fillna(df['Income'].mean(), inplace=True)
