<a href="https://colab.research.google.com/github/vpshilfiger37/Data-Science-Internship-Basics/blob/main/dcmvh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Cleaning & Missing Value Handling

-
Use Pandas to load a dataset.

-
Identify and handle missing values (fillna(), dropna()).

-
Remove duplicate entries.

-
Standardize column names (lowercase, no spaces).

In [4]:
import pandas as pd
import numpy as np

# Generate a random dataset with missing values
np.random.seed(42)  # For reproducibility
data = {
    'id': np.arange(1, 11),
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah', 'Ivy', 'Jack'],
    'age': [25, 30, np.nan, 35, 28, np.nan, 22, 29, 40, np.nan],  # Missing values in 'age'
    'salary': [50000, 60000, 55000, np.nan, 65000, 62000, 48000, np.nan, 70000, 67000],  # Missing values in 'salary'
    'city': ['NY', 'LA', 'SF', 'NY', 'SF', np.nan, 'LA', 'NY', 'LA', 'SF']  # Missing value in 'city'
}

# Convert dictionary to DataFrame
df = pd.DataFrame(data)

# Introduce some duplicate rows
df = pd.concat([df, df.iloc[2:4]], ignore_index=True)

print("Original Dataset with Missing Values & Duplicates:")
print(df)

# -------------------------------------------
# DATA CLEANING PROCESS
# -------------------------------------------

# 1. Handle Missing Values
df['age'].fillna(df['age'].median(), inplace=True)  # Fill missing 'age' with median
df['salary'].fillna(df['salary'].mean(), inplace=True)  # Fill missing 'salary' with mean
df['city'].fillna('Unknown', inplace=True)  # Fill missing 'city' with 'Unknown'

# 2. Remove Duplicate Entries
df.drop_duplicates(inplace=True)

# 3. Standardize Column Names (lowercase, no spaces)
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Display cleaned dataset
print("\nCleaned Dataset:")
print(df)

# Save cleaned dataset as CSV (optional)
df.to_csv("cleaned_dataset.csv", index=False)


Original Dataset with Missing Values & Duplicates:
    id     name   age   salary city
0    1    Alice  25.0  50000.0   NY
1    2      Bob  30.0  60000.0   LA
2    3  Charlie   NaN  55000.0   SF
3    4    David  35.0      NaN   NY
4    5      Eve  28.0  65000.0   SF
5    6    Frank   NaN  62000.0  NaN
6    7    Grace  22.0  48000.0   LA
7    8   Hannah  29.0      NaN   NY
8    9      Ivy  40.0  70000.0   LA
9   10     Jack   NaN  67000.0   SF
10   3  Charlie   NaN  55000.0   SF
11   4    David  35.0      NaN   NY

Cleaned Dataset:
   id     name   age        salary     city
0   1    Alice  25.0  50000.000000       NY
1   2      Bob  30.0  60000.000000       LA
2   3  Charlie  29.5  55000.000000       SF
3   4    David  35.0  59111.111111       NY
4   5      Eve  28.0  65000.000000       SF
5   6    Frank  29.5  62000.000000  Unknown
6   7    Grace  22.0  48000.000000       LA
7   8   Hannah  29.0  59111.111111       NY
8   9      Ivy  40.0  70000.000000       LA
9  10     Jack  29.5  6

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['age'].median(), inplace=True)  # Fill missing 'age' with median
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['salary'].fillna(df['salary'].mean(), inplace=True)  # Fill missing 'salary' with mean
The behavior will change in pandas 3.0. This inplace meth

In [5]:
import pandas as pd
import seaborn as sns

# Load the 'tips' dataset
df = sns.load_dataset('tips')

print(df.head())  # Show first 5 rows

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
