# 📊 Data Cleaning and Transformation Project

In this project, I cleaned and transformed a raw customer dataset using Python and Pandas. The dataset contained missing values, duplicate entries, and inconsistent formats.

## Key steps included:

- Importing data using Pandas
- Handling missing values with appropriate imputation
- Removing duplicates
- Converting data types for accurate analysis
- Creating new calculated columns for business insights

This cleaned dataset was then ready for analysis and dashboard development. My Python scripts ensure data quality, reliability, and clarity for strategic decision-making.


In [5]:
# Data Cleaning and Transformation Example
import pandas as pd

# Load dataset
df = pd.read_csv('D:/customer_data.csv')

# View first rows
print("First 5 rows:")
print(df.head())

First 5 rows:
   customer_id name  age  signup_date subscription_type
0         1000    q   12          345                 a
1         1001    m   34          234                 g
2         1002    q   34          123               NaN
3         1003    m   34           12               NaN
4         1004    q   34          -99               NaN


In [6]:
# Check data types
print("Data types:")
print(df.dtypes)

Data types:
customer_id           int64
name                 object
age                   int64
signup_date           int64
subscription_type    object
dtype: object


In [7]:
# Handle missing values
# Example: Fill missing 'age' with median
df['age'] = df['age'].fillna(df['age'].median())

In [8]:
# Remove duplicates
df = df.drop_duplicates()

In [9]:
# Convert date column to datetime
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')

In [10]:
# Create a new calculated column
# Example: Customer tenure in days
df['tenure_days'] = (pd.Timestamp.today() - df['signup_date']).dt.days

In [11]:
# Final overview
print("Cleaned dataset info:")
print(df.info())

Cleaned dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   customer_id        5 non-null      int64         
 1   name               5 non-null      object        
 2   age                5 non-null      int64         
 3   signup_date        5 non-null      datetime64[ns]
 4   subscription_type  2 non-null      object        
 5   tenure_days        5 non-null      int64         
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 372.0+ bytes
None


In [12]:
# Save cleaned data
df.to_csv('D:/customer_data_cleaned.csv', index=False)

print("Data cleaning and transformation completed successfully.")

Data cleaning and transformation completed successfully.
