Data Preprocessing
Data preprocessing is the process of transforming raw, often messy data into a clean and understandable format that is suitable for analysis or machine learning models.

Real-world data is often incomplete, inconsistent, and lacking in certain behaviors or trends, and is likely to contain many errors.

Handling Missing Values: Filling in gaps (imputation) or removing incomplete rows.

Noisy Data: smoothing out error and outliers (binning, regression, clustering).

Outlier Removal: Identifying data points that are statistically improbable (e.g., Age = 200).

Identifying redundancy involves finding duplicate records or attributes that convey the same information (e.g., storing both "Age" and "Date of Birth").

Elimination removes these repetitive instances to reduce dataset size, ensure consistency, and prevent the model from becoming biased toward frequent data points.



In [1]:
import pandas as pd
import numpy as np

Create a Dataset

In [2]:
data={
    'Name':['Alice','Bob','Alice','David','Eve','Frank','Grace','Heidi'],
    'Age':[25,np.nan,25,45,120,30,np.nan,35],
    'Salary':[50000,60000,50000,80000,55000,58000,62000,2000000],
    'City':['NY','LA','NY','Chicago','Houston','Phoenix','NY','Seattle']
}
df=pd.DataFrame(data)
print("--- ORIGINAL DATAFRAME ---")
print(df)
print("\n")

--- ORIGINAL DATAFRAME ---
    Name    Age   Salary     City
0  Alice   25.0    50000       NY
1    Bob    NaN    60000       LA
2  Alice   25.0    50000       NY
3  David   45.0    80000  Chicago
4    Eve  120.0    55000  Houston
5  Frank   30.0    58000  Phoenix
6  Grace    NaN    62000       NY
7  Heidi   35.0  2000000  Seattle




Handling missing values

In [3]:
print(f"Missing values per column:\n{df.isnull().sum()}\n")


Missing values per column:
Name      0
Age       2
Salary    0
City      0
dtype: int64



In [6]:
df_dropped=df.dropna()
print("1.Shape after dropping rows with NaNs:",df_dropped.shape)
print(df_dropped)


1.Shape after dropping rows with NaNs: (6, 4)
    Name    Age   Salary     City
0  Alice   25.0    50000       NY
2  Alice   25.0    50000       NY
3  David   45.0    80000  Chicago
4    Eve  120.0    55000  Houston
5  Frank   30.0    58000  Phoenix
7  Heidi   35.0  2000000  Seattle


In [8]:
df_imputed=df.copy()
median_age=df_imputed['Age'].median()
df_imputed['Age']=df_imputed['Age'].fillna(median_age)
print("2. Filled missing Age with median({median_age})")
print(df_imputed)
print("\n")

2. Filled missing Age with median({median_age})
    Name    Age   Salary     City
0  Alice   25.0    50000       NY
1    Bob   32.5    60000       LA
2  Alice   25.0    50000       NY
3  David   45.0    80000  Chicago
4    Eve  120.0    55000  Houston
5  Frank   30.0    58000  Phoenix
6  Grace   32.5    62000       NY
7  Heidi   35.0  2000000  Seattle




NOISE DETECTION & REMOVAL(Outliers)