# HR DATA ANALYSIS

**GOALS:** 
    
    1.Data cleansing involving "removing unnecessary columns".
    2.Giving the columns new names.
    3.Eliminating redundant entries.
    4.sanitizing specific columns.
    5.To eliminate the dataset's NaN values.
    6.Look for a few more changes if necessary

In [14]:
import pandas as pd
import numpy as np
df=pd.read_csv("HR Data.csv")

# **Removing unnecessary columns:**

In [15]:
columns_to_drop = ['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber','MaritalStatus']
df.drop(columns=columns_to_drop, inplace=True)

# Rename columns:

In [16]:
df.rename(columns={
    'MothlyIncome': 'MonthlyIncome',
    'NumCompaniesWork': 'NumCompaniesWorked',
    'RelationSatisfaction': 'RelationshipSatisfaction',
    'TrainingTimes': 'TrainingTimesLastYear'
}, inplace=True)

# Eliminate redundant entries:

In [17]:
df.drop_duplicates(inplace=True)

# Sanitize specific columns:

In [18]:
df['Age'] = df['Age'].astype(int)
df['MonthlyIncome'] = df['MonthlyIncome'].astype(float)

# Eliminate NaN values:

**Option 1:** Drop rows with any NaN values:

In [19]:
df.dropna(inplace=True)

**Option 2:** Fill NaN values with specific values or strategies:

In [20]:
df.fillna({
    'Age': df['Age'].mean(),
    'MonthlyIncome': df['MonthlyIncome'].median(),
}, inplace=True)

# Additional changes:

In [21]:
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})



In [22]:
df['YearsAtCompanyOver10'] = df['YearsAtCompany'].apply(lambda x: 1 if x > 10 else 0)


**Check final dataset:**

In [23]:
print(df.head())
print(df.info())


   Age  Attrition     BusinessTravel  DailyRate              Department  \
0   41          1      Travel_Rarely       1102                   Sales   
1   49          0  Travel_Frequently        279  Research & Development   
2   37          1      Travel_Rarely       1373  Research & Development   
3   33          0  Travel_Frequently       1392  Research & Development   
4   27          0      Travel_Rarely        591  Research & Development   

   DistanceFromHome  Education EducationField  EnvironmentSatisfaction  \
0                 1          2  Life Sciences                        2   
1                 8          1  Life Sciences                        3   
2                 2          2          Other                        4   
3                 3          4  Life Sciences                        4   
4                 2          1        Medical                        1   

   Gender  ...  RelationshipSatisfaction  StockOptionLevel  TotalWorkingYears  \
0       0  ...         

# Conclusions

1.Improved Data Quality:

By removing unnecessary columns such as EmployeeCount, Over18, StandardHours, and EmployeeNumber, we streamlined the dataset,   making it more manageable and focused on relevant attributes. This step was crucial in enhancing the overall quality and usability of the data.

2.Consistent Data Formatting:

Renaming columns with inconsistent or incorrect names, such as MothlyIncome to MonthlyIncome and NumCompaniesWork to NumCompaniesWorked, improved the clarity and readability of the dataset. Consistent naming conventions are essential for effective data analysis and visualization.

3.Handling Missing and Duplicate Data:

By eliminating duplicate entries and handling NaN values through appropriate imputation techniques, we ensured that the dataset is accurate and reliable. This process is vital in avoiding biases and inaccuracies in the subsequent analysis and modeling stages.

# Recommendations

1.Regular Data Audits:

Implement regular data audits to ensure ongoing data quality and consistency. Periodic reviews can help identify and rectify any discrepancies, missing values, or outdated information, maintaining the dataset's integrity over time.

2.Standardize Data Entry Processes:

Develop and enforce standardized data entry protocols to minimize errors and inconsistencies. Training employees on these standards and using automated data validation tools can significantly reduce the risk of incorrect data entries.

3.Explore Advanced Imputation Techniques:

For handling missing values, consider exploring advanced imputation techniques such as K-Nearest Neighbors (KNN) or machine learning-based methods. These techniques can provide more accurate estimates for missing data compared to simple mean or median imputation, especially in complex datasets.