In [1]:
# All important libraries goes here!
import pandas as pd

In [2]:
# Data Frame and removing colums
dataframe = pd.read_csv('./Data/washedData.csv')
dataframe = dataframe.drop(columns='ID')

#### Problem Statement
We are trying to understand the factors that influence whether a company has affected employees. This is important because companies with affected employees may require additional support or interventions. By identifying the key factors, we can target our interventions more effectively and potentially prevent employees from being affected in the future.

To solve this problem, we will use this dataset to build a predictive model. This model will take as input the various financial and operational characteristics of a company and output a prediction of whether the company has affected employees. We can then use this model to predict the status of new companies and guide our interventions.


<br>
<br>

#### (a) Data cleaning

* Handling duplicate values.
* Dealing with missing/null values.
* Addressing inconsistent data.

We will go through every column

<br>

#### (a) (ii) looking if there are missing values in each column of the dataframe

In [3]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4137 entries, 0 to 4136
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   No of employee     4137 non-null   int64 
 1   Annual turnover    4129 non-null   object
 2   TCTC               4123 non-null   object
 3   Basic Salary       4135 non-null   object
 4   Cash Injection     4137 non-null   int64 
 5   Contrib Waiver     4137 non-null   int64 
 6   Affected Employee  4137 non-null   int64 
dtypes: int64(4), object(3)
memory usage: 226.4+ KB


From the results it shows that there arent any missing values, since the total entries = Not-Null Count of 4137

<br>

#### (a) (iii) Checking for duplicate rows

In [4]:
dataframe.duplicated().sum()

1075

There are no duplicates in the dataset, for the sum = 0

<br>

#### (a) (iv) Ensuring data consistency

These attributes may contain some inconsistencies, for the datatype has to be an integer.
we will remove those values and run an algorithm to ensure consistency.

 2.   Annual turnover    4129 non-null   object
 3.   TCTC               4123 non-null   object
 4.  Basic Salary       4135 non-null   object

In [5]:
dataframe['Annual turnover'] = pd.to_numeric(dataframe['Annual turnover'], errors='coerce')
dataframe['TCTC'] = pd.to_numeric(dataframe['TCTC'], errors='coerce')
dataframe['Basic Salary'] = pd.to_numeric(dataframe['Basic Salary'], errors='coerce')

dataframe[['Annual turnover', 'TCTC', 'Basic Salary']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4137 entries, 0 to 4136
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Annual turnover  3876 non-null   float64
 1   TCTC             3978 non-null   float64
 2   Basic Salary     4099 non-null   float64
dtypes: float64(3)
memory usage: 97.1 KB


Some minor cleaning has been performed, and unknown values have been removed, For further cleaning these null values are to be removed from the dataset

In [6]:
dataframe

Unnamed: 0,No of employee,Annual turnover,TCTC,Basic Salary,Cash Injection,Contrib Waiver,Affected Employee
0,63,3.098000e+09,13782989.04,9500520.21,0,1,0
1,73,1.500000e+07,326574.61,992400.00,1,1,1
2,18,0.000000e+00,543629.21,397789.63,0,1,1
3,25,5.200000e+07,725607.67,496910.00,0,1,1
4,1,7.652706e+05,205385.34,31530.00,0,0,0
...,...,...,...,...,...,...,...
4132,1,0.000000e+00,0.00,0.00,0,0,1
4133,2,5.509920e+05,45916.00,45916.00,1,1,1
4134,1,0.000000e+00,0.00,0.00,0,0,1
4135,1,0.000000e+00,0.00,0.00,0,0,1


<br>
<br>

#### Statistical Analysis

* Mean TCTC: Explain the average compensation.
* Median Basic Salary: Discuss the central tendency.
* Standard Deviation of Annual Turnover: Highlight variability.

In [7]:
# Code goes here!

<br>
<br>

#### Exploratory Data Analysis (EDA)

* Histograms of "Annual turnover" and "Basic Salary."
* Box plots of "TCTC" to identify outliers.
* Scatter plots to explore relationships between variables.

Interpret the insights gained from each visualization.

In [8]:
# Code goes here!
# Ever exploration is to be done on an individual block.
# With a markdown to explain the Exploration

<br>
<br>

#### Machine Learning

##### Predictive Modeling for Annual Turnover

Using **Annual Turnover** as the target variable (what you want to predict) and use other features as predictors.

* No of Employees
* TCTC
* Basic Salary
* Cash Injection
* Contrib Waiver

we can use regression algorithms like linear regression, decision trees, or random forests to build the model.

<br>

##### Employee Classification

We can use machine learning to classify employees into different categories based

* Cash Injection
* Contrib Waiver
* Affected Employee

We might want to classify employees into **Highly Affected** and **Less Affected** categories.
We can use classification algorithms like logistic regression, decision trees, or support vector machines.

<br>

##### Employee Segmentation

Clustering techniques like K-means clustering can be used to segment employees based on their characteristics.

We can use features like:

* No of Employees
* TCTC
* Basic Salary

To create meaningful clusters

In [9]:
# Code goes here!

<br>
<br>

#### Evaluation of Machine Learning

Present performance metrics (e.g., Mean Absolute Error, R-squared) for each algorithm used.

Explain what the results mean:

* Which algorithm performed better?
* How accurate is the prediction of turnover?

<br>
<br>

#### Presentation of Results

Summarize key findings:

* Trends in employee turnover.
* Compensation fairness insights.

Mention any actionable recommendations based on the analysis.