# Data Cleaning

Cleaning, Standardizing, Formatting, and Preparing Data for Reporting with Pandas (Using Basic_data.csv)

## 1. Importing Libraries and Reading Data:

In [1]:
import pandas as pd

# Read the CSV data into a DataFrame
data = pd.read_csv("Basic_data.csv")


## 2. Data Cleaning

Handling Missing Values

Check for missing values using isnull().sum(). You can choose to drop rows with missing data (dropna()) or fill them with a suitable value (e.g., mean for numerical columns).

In [2]:
# Check for missing values
print(data.isnull().sum())

# Option 1: Drop rows with missing values (if acceptable)
# data.dropna(inplace=True)

# Option 2: Fill missing values with a specific value (e.g., mean for numerical columns)
data["Balance"].fillna(data["Balance"].mean(), inplace=True)


AccID        0
Name         0
Gender       0
Age          0
AccOpen      0
Balance      0
AccStatus    0
dtype: int64


##  Fixing Data Types
Convert columns with dates (AccOpen in this case) to datetime format using pd.to_datetime().

In [3]:
# Convert 'AccOpen' to datetime format
data["AccOpen"] = pd.to_datetime(data["AccOpen"])


## Removing Duplicates
Use drop_duplicates() to remove duplicate rows if present.

In [10]:
# Remove duplicate rows (based on all columns)
data.drop_duplicates(inplace=True)


## 3. Data Standardization

### Lowercase or Uppercase Text Columns:

Standardize text columns (e.g., converting names to lowercase) using string methods like str.lower().

Ensure consistent date formatting using datetime methods like dt.strftime().

In [5]:
# Make 'Name' column lowercase
data["Name"] = data["Name"].str.lower()


In [6]:
# Format 'AccOpen' column (e.g., YYYY-MM-DD)
data["AccOpen"] = data["AccOpen"].dt.strftime("%Y-%m-%d")


## 4. Data Formatting

### Currency Formatting for Balance

Apply a function (using apply()) to format numerical columns like Balance with currency symbol and desired number of decimal places.

In [7]:
# Add currency symbol and format decimals for 'Balance'
data["Balance"] = data["Balance"].apply(lambda x: f"${x:.2f}")


Rename columns for better readability using rename() with a dictionary mapping old names to new.

In [8]:
# Rename columns for clarity
data = data.rename(columns={"AccStatus": "Account Status"})


## 5. Report Prework

### Selecting Specific Columns

Select only the columns needed for your report using indexing or boolean selection

In [9]:
# Choose relevant columns for reporting
report_data = data[["Name", "Age", "Account Status", "Balance"]]


After cleaning and preparing the data, you can use it to create tabular, chart, timeline, or scatter plot reports as demonstrated in the previous response