<a href="https://colab.research.google.com/github/shruti63-code/Data-cleaning_/blob/main/Data_Cleaning_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧹 Data Cleaning & Preprocessing Example
This notebook demonstrates how to perform **data cleaning** and **data preprocessing** in Python using Pandas and Scikit-learn.


In [1]:
# Step 1: Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

## Step 2: Create a messy dataset


In [2]:
data = {
    "Name": ["John", "Alice", "Bob", "Alice"],
    "Age": [25, np.nan, 30, np.nan],
    "Salary": [50000, 60000, np.nan, 60000],
    "Country": ["USA", "U.S.A.", "United States", "U.S.A."]
}

df = pd.DataFrame(data)
print("Raw Data:")
print(df)

Raw Data:
    Name   Age   Salary        Country
0   John  25.0  50000.0            USA
1  Alice   NaN  60000.0         U.S.A.
2    Bob  30.0      NaN  United States
3  Alice   NaN  60000.0         U.S.A.


## Step 3: Data Cleaning


In [3]:
# Fill missing values
df["Age"].fillna(df["Age"].median(), inplace=True)
df["Salary"].fillna(df["Salary"].mean(), inplace=True)

# Remove duplicates
df = df.drop_duplicates()

# Standardize country names
df["Country"] = df["Country"].replace({
    "U.S.A.": "USA",
    "United States": "USA"
})

print("Cleaned Data:")
print(df)

Cleaned Data:
    Name   Age        Salary Country
0   John  25.0  50000.000000     USA
1  Alice  27.5  60000.000000     USA
2    Bob  30.0  56666.666667     USA


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Salary"].fillna(df["Salary"].mean(), inplace=True)


## Step 4: Data Preprocessing


In [4]:
# Scale numerical features
scaler = StandardScaler()
df[["Age", "Salary"]] = scaler.fit_transform(df[["Age", "Salary"]])

# Encode categorical feature
encoder = OneHotEncoder(sparse_output=False)
encoded_country = encoder.fit_transform(df[["Country"]])
encoded_df = pd.DataFrame(encoded_country, columns=encoder.get_feature_names_out(["Country"]))

# Merge back into dataframe
df = pd.concat([df.drop("Country", axis=1), encoded_df], axis=1)

print("Preprocessed Data:")
print(df)

Preprocessed Data:
    Name       Age    Salary  Country_USA
0   John -1.224745 -1.336306          1.0
1  Alice  0.000000  1.069045          1.0
2    Bob  1.224745  0.267261          1.0


## Step 5: Train-Test Split

In [5]:
X = df.drop("Name", axis=1)
y = df["Name"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Features:")
print(X_train)
print("\nTesting Features:")
print(X_test)

Training Features:
        Age    Salary  Country_USA
1  0.000000  1.069045          1.0
2  1.224745  0.267261          1.0

Testing Features:
        Age    Salary  Country_USA
0 -1.224745 -1.336306          1.0
