# Day 34

* Handling Missing Values
* Removal Duplicate Records
* Outliers Removal

In [11]:
import pandas as pd
df = pd.read_csv("sample_data\sample_dirty_dataset.csv")
print(df)

       Name    Age  Gender   Salary Department
0     Shyam   12.0       M   3000.0         HR
1     Alice   25.0       F  50000.0         HR
2       Bob    NaN       M  60000.0         IT
3   Charlie   30.0       M      NaN    Finance
4     David   22.0       M  45000.0         IT
5       Eve   29.0       F  70000.0         HR
6     Frank   40.0    Male  80000.0        NaN
7     Grace    NaN  Female  75000.0    Finance
8     Alice   25.0       F  50000.0         HR
9       Ram   98.0       M  20000.0         HR
10     Hari  100.0       M  20000.0         HR


  df = pd.read_csv("sample_data\sample_dirty_dataset.csv")


**I Have a Dataset**
1. In that dataset I have some missing values.
2. I have to fill them.
3. Before filling them, we check for outliers. 
    * If there are outliers we firstly remove them 
4. We fill the missing values. 

In [10]:
def remove_outliers_IQR(data,column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3-Q1
    lower_bound = Q1-1.5*IQR
    upper_bound = Q3+1.5 *IQR
    
    return data[(data[column] >=lower_bound)& (data[column]<=upper_bound)]


df_outliers_cleaned = (df.pipe(remove_outliers_IQR, column="Age").pipe(remove_outliers_IQR, column="Salary"))


# Fill the Unknown Values -> Strings
df_outliers_cleaned["Department"].fillna("Unknown", inplace = True)

# Remove Duplicates
df_outliers_cleaned.drop_duplicates(inplace = True)

# Making Data consistent
df_outliers_cleaned["Gender"] = df["Gender"].replace({"Male" : "M",
                                           "Female" : "F"})

df_outliers_cleaned.to_csv("Cleaned_data.csv", index = False)
df_outliers_cleaned
   

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_outliers_cleaned["Department"].fillna("Unknown", inplace = True)


Unnamed: 0,Name,Age,Gender,Salary,Department
1,Alice,25.0,F,50000.0,HR
4,David,22.0,M,45000.0,IT
5,Eve,29.0,F,70000.0,HR
6,Frank,40.0,M,80000.0,Unknown
