**Data Cleaning and Preprocessing**

In [23]:
import pandas as pd

df = pd.read_csv('/content/marketing_campaign.csv')

In [24]:
# Handle Missing Values
print("Columns with missing values:\n", df.isnull().sum())

Columns with missing values:
 ID\tYear_Birth\tEducation\tMarital_Status\tIncome\tKidhome\tTeenhome\tDt_Customer\tRecency\tMntWines\tMntFruits\tMntMeatProducts\tMntFishProducts\tMntSweetProducts\tMntGoldProds\tNumDealsPurchases\tNumWebPurchases\tNumCatalogPurchases\tNumStorePurchases\tNumWebVisitsMonth\tAcceptedCmp3\tAcceptedCmp4\tAcceptedCmp5\tAcceptedCmp1\tAcceptedCmp2\tComplain\tZ_CostContact\tZ_Revenue\tResponse    0
dtype: int64


In [25]:
# Fill missing 'age' with the mean age
if 'age' in df.columns:
    df['age'].fillna(df['age'].mean(), inplace=True)

In [26]:
# Drop rows with missing 'gender'
if 'gender' in df.columns:
    df.dropna(subset=['gender'], inplace=True)

In [27]:
# Remove Duplicate Rows
df.drop_duplicates(inplace=True)

In [28]:
# Standardize Text Values 'gender'
if 'gender' in df.columns:
  df['gender'] = df['gender'].str.lower()
  df['gender'] = df['gender'].replace({'male': 'Male', 'female': 'Female', 'm': 'Male', 'f': 'Female'})


In [29]:
if 'date' in df.columns:
    try:
        df['date'] = pd.to_datetime(df['date'])
        df['date'] = df['date'].dt.strftime('%d-%m-%Y')
    except ValueError:
        print("Error converting 'date' column. Check the format of the dates in the input file.")


In [30]:
# 5. Rename Columns
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [31]:
# 6. Check and Fix Data Types
# Convert 'age' to integer
if 'age' in df.columns:
  df['age'] = df['age'].astype(int)

In [32]:
# Summary of changes
print("\nSummary of Data Cleaning:")
print(f"- Missing values handled in 'age' and 'gender' columns.")
print(f"- Duplicate rows removed.")
print(f"- Text values in 'gender' standardized.")
print(f"- Date format standardized to 'dd-mm-yyyy'.")
print(f"- Column headers cleaned.")
print(f"- Data type of 'age' fixed.")


Summary of Data Cleaning:
- Missing values handled in 'age' and 'gender' columns.
- Duplicate rows removed.
- Text values in 'gender' standardized.
- Date format standardized to 'dd-mm-yyyy'.
- Column headers cleaned.
- Data type of 'age' fixed.


In [33]:
print(df.head)

<bound method NDFrame.head of      id\tyear_birth\teducation\tmarital_status\tincome\tkidhome\tteenhome\tdt_customer\trecency\tmntwines\tmntfruits\tmntmeatproducts\tmntfishproducts\tmntsweetproducts\tmntgoldprods\tnumdealspurchases\tnumwebpurchases\tnumcatalogpurchases\tnumstorepurchases\tnumwebvisitsmonth\tacceptedcmp3\tacceptedcmp4\tacceptedcmp5\tacceptedcmp1\tacceptedcmp2\tcomplain\tz_costcontact\tz_revenue\tresponse
0     5524\t1957\tGraduation\tSingle\t58138\t0\t0\t0...                                                                                                                                                                                                                                                                                                                                                  
1     2174\t1954\tGraduation\tSingle\t46344\t1\t1\t0...                                                                                                                               