<h3>0. Importing Libraries</h3>
<p>Started by importing the necessary Python library for data manipulation:</p>

In [16]:
import pandas as pd

<h3>1. Loading the Dataset</h3>
<p>The dataset was loaded from a CSV file using the <code>pd.read_csv()</code> function:</p>

<pre><code>df = pd.read_csv("Mall_Customers.csv")</code></pre>

<p>This creates a DataFrame named <code>df</code> containing the data from the file <code>Mall_Customers.csv</code>. It’s the foundational step before beginning any cleaning or analysis.</p>

In [17]:
# Load the dataset
df = pd.read_csv(r"D:\Internship\DataSet\Mall_Customers.csv")

<h2>2. Data Cleaning Steps</h2>

In [18]:
#  Identify and handle missing values
print("Missing values before handling:\n", df.isnull().sum())

Missing values before handling:
 CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64


<h3>3. Handling Missing Values</h3>
<p>We identified missing values using the <code>.isnull()</code> method. If any rows contained missing data, they were removed using <code>.dropna()</code>. Alternatively, missing values could be imputed using <code>.fillna()</code> with a default or calculated value.</p>


In [19]:
# Fill or drop missing values
df = df.dropna()  # Or use df.fillna(value) if you prefer filling


<h3>4. Removing Duplicate Rows</h3>
<p>Duplicate records were eliminated from the dataset using <code>.drop_duplicates()</code> to ensure data integrity and prevent biased results during analysis.</p>



In [20]:
#  Remove duplicate rows
df = df.drop_duplicates()

<h3>4. Standardizing Text Values</h3>
<p>To ensure consistency, especially for categorical text data (like gender), all string values were stripped of leading/trailing whitespace and converted to lowercase using <code>.str.strip().str.lower()</code>.</p>

In [21]:
#  Standardize text values (example: Gender column)
if 'Gender' in df.columns:
    df['Gender'] = df['Gender'].str.strip().str.lower()



<h3>5. Converting Date Formats</h3>
<p>We checked for any columns representing dates (e.g., 'Date') and converted them into a consistent datetime format using <code>pd.to_datetime()</code>. This ensures reliable time-based analysis and filtering.</p>


In [22]:
# . Convert date formats to consistent type (assuming a date column exists)
# Replace 'Date' with your actual column name if it exists
if 'Date' in df.columns:
    df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

<h3>6. Renaming Column Headers</h3>
<p>To standardize and clean up the column names, we:
<ul>
  <li>Converted all column names to lowercase</li>
  <li>Replaced spaces with underscores</li>
  <li>Removed leading/trailing spaces</li>
</ul>
This was done using <code>df.columns.str.strip().str.lower().str.replace(' ', '_')</code>.</p>

In [23]:
#  Rename column headers to be clean and uniform
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')



<h3>7. Checking and Fixing Data Types</h3>
<p>We ensured numerical fields like <strong>age</strong> were properly typed by using <code>pd.to_numeric()</code> and converted to <code>int</code> after handling any conversion errors. This is essential for mathematical operations and modeling.</p>

In [24]:
# 7. Check and fix data types
# Example: converting 'age' to integer
if 'age' in df.columns:
    df['age'] = pd.to_numeric(df['age'], errors='coerce').fillna(0).astype(int)



In [25]:
# Show cleaned dataset info
print("\nCleaned Data Info:")
print(df.info())




Cleaned Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   customerid              200 non-null    int64 
 1   gender                  200 non-null    object
 2   age                     200 non-null    int32 
 3   annual_income_(k$)      200 non-null    int64 
 4   spending_score_(1-100)  200 non-null    int64 
dtypes: int32(1), int64(3), object(1)
memory usage: 7.2+ KB
None


<h3>8. Saving Cleaned Data</h3>
<p>The cleaned dataset was saved as a new CSV file using <code>df.to_csv()</code> for future analysis or model training.</p>


In [26]:
# Save cleaned data to current directory
df.to_csv("Mall_Customers_Cleaned.csv", index=False)
print("✅ Cleaned data saved as 'Mall_Customers_Cleaned.csv'")

✅ Cleaned data saved as 'Mall_Customers_Cleaned.csv'
