### **Summary of Cleaning Steps**:
1. **Load the data** into Pandas.
2. **Inspect the data** for structure, summary statistics, and missing values.
3. **Check unique values** to spot inconsistencies in categorical columns.
4. **Handle missing values** by filling or dropping them.
5. **Fix inconsistent text data** by standardizing and correcting typos.
6. **Remove outliers** from the numeric columns like `Salary`.
7. **Standardize date formats** to ensure all dates are in the same format.
8. **Remove duplicate rows** from the dataset.
9. **Inspect the cleaned data** to verify your cleaning process was successful.

**Step 1: Load the Dataset**
- First, load the dataset into Pandas so you can inspect it.

In [3]:
import pandas as pd

# Load the dataset
url = 'https://raw.githubusercontent.com/siddhantbhattarai/Machine_Learning_Bootcamp_2024/refs/heads/main/Pandas/Data_Cleaning/dirty_data.csv'
df = pd.read_csv(url)

# Look at the first few rows of the dataset
print(df.head())

               Name   Age           City   Join_Date     Salary  Gender
0  Savannah Patrick  21.0  san francisco  2024-02-04  9999999.0    Male
1    Jessica Ramsey  32.0             SF  15/03/2022  9999999.0  Female
2       Jacob White  70.0             LA  2019-11-03        NaN  female
3        Erik Ortiz  62.0  San Francisco  2020-06-14       10.0    male
4      Tonya Dudley  42.0             LA  2023-07-21  2000000.0  Female


**Step 2: Inspect the Data**
- Before cleaning the data, you need to understand its structure.

In [4]:
# Check the structure of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10100 entries, 0 to 10099
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Name       9077 non-null   object 
 1   Age        9067 non-null   float64
 2   City       10100 non-null  object 
 3   Join_Date  9102 non-null   object 
 4   Salary     7814 non-null   float64
 5   Gender     8853 non-null   object 
dtypes: float64(2), object(4)
memory usage: 473.6+ KB


In [33]:
# Check unique values in the 'Gender' column
print("Unique values in 'Gender':", df['Gender'].unique())

Unique values in 'Gender': ['female' 'male' 'unknown']


In [34]:
# Check unique values in the 'City' column
print("Unique values in 'City':", df['City'].unique())

Unique values in 'City': ['New York' 'Los Angeles' 'San Francisco']


In [36]:
# Check unique values in the 'Age' column
print("Unique values in 'Age':", df['Age'].unique())

Unique values in 'Age': [56. 18. 20. 29. 51. 55. 46. 25. 49. 48. 23. 26. 44. 53. 35. 52. 69. 61.
 36. 40. 41. 58. 59. 30. 28. 21. 62. 68. 33. 63. 27. 38. 45. 47. 31. 65.
 50. 37. 54. 19. 64. 32. 67. 22. 34. 24. 66. 60. 39. 70. 43. 42. 57.]


In [37]:
# Check unique values in the 'Salary' column
print("Unique values in 'Salary':", df['Salary'].unique())

Unique values in 'Salary': [173314.07178227  81495.40654824  55726.33882573 199014.30198763
  76552.65584576  89896.50834987  71082.28300614  83976.06313335
 137222.64128248 169592.74583585 173842.26571579 173960.51414453
 113009.42752972 148776.32806803 146442.92794895  57893.90785747
 162618.74033283 154485.98607858  90536.81895635 129803.82604539
 192769.63174184 181452.22197981 164865.58280781 134759.15710204
 112019.17779216 154132.10837113  57770.52333969 120943.03549679
 105320.21159778  94300.60962554  76290.15147695 198645.67769726
  76961.19739935 121726.76692404 122862.03789501  78988.40628727
  82427.81856035 186276.92769749 138316.30618283  78741.75730467
 174667.68164102 142066.82417717  76705.95213665  79611.19280337
 176256.07248117  64639.05277132 174721.56005582 180312.76356623
 125754.02399468 161147.0826596   64902.38535368  79650.09367284
 171174.0363687  164152.06598728 145042.35474669  77383.86499116
 183277.72098242 174612.25452609 190014.84813989  92329.8166124

In [5]:
# Check for any missing values in each column
print(df.isnull().sum())

Name         1023
Age          1033
City            0
Join_Date     998
Salary       2286
Gender       1247
dtype: int64


In [6]:
# Get summary statistics for numerical columns
print(df.describe())

               Age        Salary
count  9067.000000  7.814000e+03
mean     43.918275  3.471742e+06
std      15.406788  4.201636e+06
min      18.000000  1.000000e+01
25%      30.000000  1.000000e+01
50%      44.000000  2.000000e+06
75%      57.000000  9.999999e+06
max      70.000000  9.999999e+06


**Step 3: Handle Missing Values**
- You can either remove rows with missing values or fill them with a default value. Let's handle missing values for each column:

A. Name Column (Missing Values)
- Since `Name` is important, let's remove rows where the name is missing.

In [7]:
# Remove rows where 'Name' is missing
df = df.dropna(subset=['Name'])

B. Age Column (Missing Values)
- For Age, you might want to fill missing values with the mean or median age.

In [9]:
# Fill missing 'Age' values with the median age
df['Age'] = df['Age'].fillna(df['Age'].median())

C. Salary Column (Missing Values)
- For Salary, let's fill missing values with the median salary.

In [10]:
# Fill missing 'Salary' values with the median salary
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

D. Join_Date Column (Missing Values)
- If Join_Date is missing, we can either drop those rows or fill them with a default date.

In [11]:
# Fill missing 'Join_Date' with a default value (e.g., today's date)
df['Join_Date'] = df['Join_Date'].fillna(pd.to_datetime('today'))

In [12]:
df.head()

Unnamed: 0,Name,Age,City,Join_Date,Salary,Gender
0,Savannah Patrick,21.0,san francisco,2024-02-04,9999999.0,Male
1,Jessica Ramsey,32.0,SF,15/03/2022,9999999.0,Female
2,Jacob White,70.0,LA,2019-11-03,2000000.0,female
3,Erik Ortiz,62.0,San Francisco,2020-06-14,10.0,male
4,Tonya Dudley,42.0,LA,2023-07-21,2000000.0,Female


**Step 4: Fix Inconsistent Data**
- Inconsistent data happens when the same information is represented in different ways (e.g., "male", "Male", "Mmale").

A. Clean Gender Column
- You can fix the gender column by standardizing it to either "Male" or "Female".

In [13]:
# Convert everything to lowercase
df['Gender'] = df['Gender'].str.lower()

In [14]:
# Replace typos and fix the values
df['Gender'] = df['Gender'].replace({
    'mmale': 'male',
    'femle': 'female',
    'femlae': 'female'
})

# Fill any remaining missing values with 'unknown'
df['Gender'] = df['Gender'].fillna('unknown')

B. Clean City Column
- Inconsistent city names (e.g., "New York", "new york", "LA") should be standardized.

In [15]:
# Standardize city names
city_replacements = {
    'new york': 'New York',
    'LA': 'Los Angeles',
    'SF': 'San Francisco',
    'Los Angele': 'Los Angeles',
    'san francisco': 'San Francisco'
}

df['City'] = df['City'].replace(city_replacements)

**Step 5: Handle Outliers**
- Outliers are extreme values that don't make sense. Let's handle the extreme salary values.

A. Salary Column (Outliers)
- Look for very high or very low salaries, and remove or cap them.

In [16]:
# Define the reasonable range for salaries
df = df[df['Salary'] > 20000]  # Remove unrealistic low salaries
df = df[df['Salary'] < 200000]  # Remove unrealistic high salaries

**Step 6: Parse Dates**
- Dates in the Join_Date column may have different formats, so let's convert them to a consistent datetime format.

In [17]:
# Convert the 'Join_Date' column to datetime format
df['Join_Date'] = pd.to_datetime(df['Join_Date'], errors='coerce')

In [18]:
# Check if any invalid dates were converted to NaT (Not a Time)
print(df['Join_Date'].isnull().sum())

88


In [19]:
# Fill any remaining NaT values with a default date
df['Join_Date'] = df['Join_Date'].fillna(pd.to_datetime('2000-01-01'))

**Step 7: Remove Duplicates**
- You might have some duplicated rows in the dataset. Let's remove them.

In [25]:
# Check for duplicated rows and count them
is_duplicated = df.duplicated().sum()

print(f'Total number of duplicated rows: {is_duplicated}')

Total number of duplicated rows: 5


In [26]:
# Remove duplicate rows
df = df.drop_duplicates()

**Step 9: Inspect the Cleaned Data**
- Finally, inspect the cleaned dataset to make sure everything looks correct.

In [28]:
# Check the structure and summary of the cleaned dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 532 entries, 5 to 10073
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Name       532 non-null    object        
 1   Age        532 non-null    float64       
 2   City       532 non-null    object        
 3   Join_Date  532 non-null    datetime64[ns]
 4   Salary     532 non-null    float64       
 5   Gender     532 non-null    object        
dtypes: datetime64[ns](1), float64(2), object(3)
memory usage: 29.1+ KB


In [29]:
# Display the first few rows of the cleaned data
print(df.head())

                Name   Age           City  Join_Date         Salary   Gender
5    Danielle Ingram  56.0       New York 2020-08-21  173314.071782   female
29    Amanda Hendrix  18.0    Los Angeles 2021-06-07   81495.406548     male
73     Jacob Gardner  20.0       New York 2022-02-13   55726.338826  unknown
108       Mark Ochoa  29.0  San Francisco 2024-01-11  199014.301988  unknown
117     Carlos Bauer  51.0  San Francisco 2022-09-19   76552.655846   female


In [30]:
# Check for any missing values in each column
print(df.isnull().sum())

Name         0
Age          0
City         0
Join_Date    0
Salary       0
Gender       0
dtype: int64


In [31]:
# Check for duplicated rows and count them
is_duplicated = df.duplicated().sum()

print(f'Total number of duplicated rows: {is_duplicated}')

Total number of duplicated rows: 0


In [32]:
df.head()

Unnamed: 0,Name,Age,City,Join_Date,Salary,Gender
5,Danielle Ingram,56.0,New York,2020-08-21,173314.071782,female
29,Amanda Hendrix,18.0,Los Angeles,2021-06-07,81495.406548,male
73,Jacob Gardner,20.0,New York,2022-02-13,55726.338826,unknown
108,Mark Ochoa,29.0,San Francisco,2024-01-11,199014.301988,unknown
117,Carlos Bauer,51.0,San Francisco,2022-09-19,76552.655846,female


**Bonus: Save the Cleaned Data**
- You can save the cleaned data to a new CSV file.

In [38]:
# Save the cleaned data
df.to_csv('cleaned_data.csv', index=False)