# üóÇÔ∏è Raw to Clean Data Transformation üßπ


In [1]:

import pandas as pd
import numpy as np

# 1Ô∏è‚É£ Raw Data (Messy / Unstructured)
raw_data = {
    "Name": ["Alice", "Bob", None, "Emma", "John", "Alice"],
    "Age": [25, 30, 22, None, 28, 25],
    "City": ["Paris", "London", "New York", "paris", None, "Paris"],
    "Salary": ["50000", "60000", "55000", "not available", "58000", "50000"]
}


In [2]:

df = pd.DataFrame(raw_data)
print("Raw Data:")
print(df)

# 2Ô∏è‚É£ Step 1 ‚Äì Handle Missing Values üöß
df["Name"].fillna("Unknown", inplace=True)
df["Age"].fillna(df["Age"].mean(), inplace=True)         # Replace missing age with mean
df["City"].fillna("Unknown", inplace=True)
df["Salary"].replace("not available", np.nan, inplace=True)
df["Salary"] = df["Salary"].astype(float)               # Convert salary to numeric


Raw Data:
    Name   Age      City         Salary
0  Alice  25.0     Paris          50000
1    Bob  30.0    London          60000
2   None  22.0  New York          55000
3   Emma   NaN     paris  not available
4   John  28.0      None          58000
5  Alice  25.0     Paris          50000


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Name"].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].mean(), inplace=True)         # Replace missing age with mean
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate

In [3]:

# 3Ô∏è‚É£ Step 2 ‚Äì Remove Duplicates üóëÔ∏è
df.drop_duplicates(inplace=True)


In [4]:

# 4Ô∏è‚É£ Step 3 ‚Äì Standardize / Clean Text ‚ú®
df["City"] = df["City"].str.title()                     # Capitalize city names
df["Name"] = df["Name"].str.strip()                     # Remove extra spaces


In [5]:

# 5Ô∏è‚É£ Step 4 ‚Äì Handle Outliers / Check Data üîé
# Example: Age cannot be negative
df = df[df["Age"] >= 0]


In [6]:

# 6Ô∏è‚É£ Step 5 ‚Äì Final Clean Data ‚úÖ
print("\nCleaned Data:")
print(df)



Cleaned Data:
      Name   Age      City   Salary
0    Alice  25.0     Paris  50000.0
1      Bob  30.0    London  60000.0
2  Unknown  22.0  New York  55000.0
3     Emma  26.0     Paris      NaN
4     John  28.0   Unknown  58000.0


In [7]:

# üîπ Quick Tips:
# - Missing values ‚Üí fill or drop
# - Duplicates ‚Üí remove
# - Text ‚Üí standardize (case, strip spaces)
# - Type conversion ‚Üí numeric, date, categorical
# - Outliers ‚Üí detect & handle
# - Always check dataframe.info() & describe() after cleaning
