# Task_1

Made by: Anton Beny M S

In this notebook, we are going to understand Data Visualization using mathematical models.

Firstly, we are going to import our necessary libraries and configure them.

In [None]:
import statistics

#!pip install lets-plot -U
import lets_plot as lplt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.style.use('dark_background')

## Data

Now we are going to import pre-existing data in the form of a .csv file.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/r3kste/AIC/main/Task_1/data/data.csv')
data

Now, that we have our data, lets see some of its statistics

In [None]:
data.describe()

### Inferences
* The 75th percentile of **Age** is 28.5. This means that 75% of the total employees who left are within the age group 24-28.

## Distribution of Employees

Now, let us first know about the distribution of the employees.

In [None]:
lplt.ggplot(data) + lplt.geom_point(lplt.aes(x="Education", y="Age", color="Gender")) + lplt.flavor_high_contrast_dark()

In [None]:
#Frequency Table of Some Fields

n_male = len(data.loc[data['Gender'] == 'Male'])
n_female = len(data.loc[data['Gender'] == 'Female'])

n_bach = len(data.loc[data['Education'] == 'Bachelors'])
n_mba = len(data.loc[data['Education'] == 'MBA'])

print('Number of:\n\nMale:', n_male, '\nFemale:', n_female, '\n\nBachelors:', n_bach, '\nMBA:', n_mba)

### Inferences
* Male to Female Ratio is around: 17:1 which is (coincidentally) exactly equal to the Ratio of People having Bachelors Degree to those having a MBA Degree
* It is clear, that the younger employees, from the ages of 24 to 35, only had a Bachelor's Degree

## Age of Employees

Now, lets visualize our data, in terms of the employees' age.

In [None]:
#Ages
age = list(set(sorted(data['Age'])))
print('Age:\t', age)

#Frequency Table of Ages
c_age = np.array(data.groupby('Age').size())
print('c_Age:\t', c_age)

Now, lets plot a bar graph between Age and Frequency, to further understand the distribution of the employees.

In [None]:
fig, ax = plt.subplots()
ax.set_title('Age vs Frequency', fontsize=20)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

plt.bar(age, c_age, width=0.5)
plt.show()

### Inferences

* As mentioned earlier, it is evident that most of the employees who left are within the age group from 24 to 28.
* After Age 28, as we move along the x-axis, the Age increases, however, the number of employees leaving keep fluctuating, but nevertheless, does not come close to those from 24-28.

## Reasons for Leaving

Now, lets try to find out the major reasons why these people left their jobs.<br><br>
We will split the population into two groups: <br>
*Group One:* consisting the Age group 24 to 28. <br>
*Group Two:* the other, consists the Ages 29 and above.

In [None]:
#Group1
data1 = data.loc[data['Age'] < 29]
data1 = data1.sort_values(['Remarks'], ascending=True)
data1.sort_values(['S.No'], ascending=True)

In [None]:
#Group2
data2 = data.loc[data['Age'] > 28]
data2 = data2.sort_values(['Remarks'], ascending=True)
data2.sort_values(['S.No'], ascending=True)

In [None]:
#List of possible reasons for leaving the company
remarks = sorted(list(set(data['Remarks'][:])))
print(remarks)

In [None]:
# Finding the frequency of each Remark, and storing it

c_remarks = np.array(data.groupby('Remarks').size())  # Overall
print("c_remarks:\t", c_remarks)

c_remarks1 = np.array(data1.groupby('Remarks').size())  # Group One
print("c_remarks1:\t", c_remarks1)

c_remarks2 = np.subtract(c_remarks, c_remarks1)  # Group Two
print("c_remarks2:\t", c_remarks2)

Now, lets plot the above data as a bar graph.

In [None]:
fig, ax = plt.subplots()
ax.set_title("Reasons for Leaving among Age Groups", fontsize=20)
ax.set_xlabel("AGE", fontsize=14)
ax.set_ylabel("Remarks", fontsize=14)

plt.barh(remarks, c_remarks, height=0.8, label='Overall')  # Overall
plt.barh(remarks, c_remarks1, height=0.5, label='Age <= 28', color='#446399')  # Group One
plt.barh(remarks, c_remarks2, height=0.2, label='Age >= 29', color='#1a1a44')  # Group Two

plt.legend()
plt.show()

## Inferences

#### Overall
* The major reason for Leaving the Job seems to be **"Issues with the Manager"**, which is closely followed by the employees remarking that they left because of their "Lack of Growth".
* The next major reason is **"More Challenging Job Roles/Higher designation"**, followed by "Better Salary".
* The other reasons are seemingly less significant.
* Most of these can be explained in terms of work experience and environment.


#### Group One
* For Group One, the major reason is **"Issues with the Manager"**, which is closely followed by "Lack of Growth". This is similar to the overall demographic as well.
* For Group One, the other reasons are not as significant as the first two.
* It is evident that Group One is the major faction among the people who left due to **"Issues with the Manager"**.
* This is understandable given that these are the younger employees, with little to no previous experience, and desire to attain their maximum possible potential and growth.
                                                                                    
#### Group Two
* For Group Two, the major reason is **"Lack of Growth"** which is closely followed by **"More Challenging Job Roles/Higher designation"**.
* Unlike Group One, much fewer individuals from Group Two reported **"Issues with the Manager"**.
* It is clear that Group Two individuals the major faction among those who left for a **"More Challenging Job Roles/Higher designation"**.
* Once again, this is understandable, considering that Group Two is the more experienced set of employees, who understand the work environment, and looking for better roles.
* There are little to zero employees in Group Two, who left for these following reasons: **"Termination - Theft"**, **"Termination - Poor Performance"**, **"Higher Education"** and **"Absconding"**

## Monthly Income

Now, lets look at one of the more important data, which is the **Monthly Income**.

In [None]:
# List to store average monthly income of people of various age groups.
minco0 = []
for i in list(set(age)):
    temp = data.loc[data['Age'] == i, 'Monthly Income']
    minco0.append(np.mean(temp))

pd.DataFrame(list(zip(age, minco0)), columns=['Age', 'Average Monthly Income'])

Now, let us plot a bar graph of Age vs Monthly Income

In [None]:
lplt.ggplot(data) + lplt.geom_point(
    lplt.aes(color="Education", y="Monthly Income", x="Age")) + lplt.flavor_high_contrast_dark()

In [None]:
fig, ax = plt.subplots()
ax.set_title("Age vs Average Monthly Income", fontsize=20)
ax.set_xlabel("AGE", fontsize=12)
ax.set_ylabel("Average Monthly Income", fontsize=12)

plt.bar(age, minco0, width=0.5)
plt.show()

Now, it is clear that **ABC197**, the **"CXO"** is an outlier, with an Average Monthly Income of 2.3 L, and is skewing our data. So let us remove this entry for now.

In [None]:
fig, ax = plt.subplots()
ax.set_title("Age vs Average Monthly Income", fontsize=20)
ax.set_xlabel("AGE", fontsize=12)
ax.set_ylabel("Average Monthly Income", fontsize=12)

cp_age = [x for x in age if x != age[-1]]
cp_minco0 = [x for x in minco0 if x != minco0[-1]]

plt.bar(cp_age, cp_minco0, width=0.5)
plt.show()

This is much better...
             
### Inferences

* Correlation between Age and Salary:<br>It is clear that in Group One, that is from ages 24 to 28, there is little to zero correlation between age and salary, that is, the average income doesn't change with change in Age.<br>However, from ages 29 and above, that is Group Two, there is a strong correlation between these two, that is, average income increases with increase in Age.

## Grades and Designation

Now, let us explore a bit on the side of **Grade** and **Designation**

In [None]:
cp_data = data.sort_values(['Grade'], ascending=True)  # Make a copy of original Dataframe
cp_cp_data = cp_data.iloc[[0][:]]  # Reindex the required rows
cp_data = cp_data.drop(cp_data.index[0])
cp_data = pd.concat([cp_data, cp_cp_data])
del cp_cp_data  # Delete temporary dataframe
cp_data  # Dataframe with "Grade" sorted according to "E1<E2<M1<M2<M3<M4<CXO"

Now, let us see the relation between Grade and Designation.

In [None]:
grade = sorted(list(set(data.Grade)))
grade.append(grade.pop(0))
print(grade)

In [None]:
designation = []
for i in range(0, len(grade)):
    temp = list(data.loc[data['Grade'] == grade[i], 'Designation'])
    designation.append(temp[0])
print(designation)

In [None]:
pd.DataFrame(list(zip(grade, designation)), columns=['Grade', 'Designation'])

Now, let us compare **Age**, **Grade** and **Designation** together.

In [None]:
gr = []
design = []

for i in age:
    gr.append(statistics.mode(list(data.loc[data['Age'] == i, 'Grade'])))
    design.append(statistics.mode(list(data.loc[data['Age'] == i, 'Designation'])))

pd.DataFrame(list(zip(age, gr, design)), columns=['Age', 'Grade', 'Design'])

In [None]:
fig, ax = plt.subplots()
ax.set_title("Age vs Grade", fontsize=20)
ax.set_xlabel("AGE", fontsize=12)
ax.set_ylabel("GRADE", fontsize=12)

plt.plot(age, gr, '8')
plt.show()

### Inference
*It doesn't take a genius to observe that the **Grade** has a strong correlation with **Age** and therefore, work experience, which inturn results in promotions to better designations and a better salary.

## Last Rating

Now, let us look into some of the other data provided to us, starting with **Last Rating**.

In [None]:
last_rating = [1, 2, 3, 4, 5]
minco = []

for i in last_rating:
    temp = data.loc[data['Last Rating'] == i, 'Monthly Income']
    minco.append(np.mean(temp))

pd.DataFrame(list(zip(last_rating, minco)), columns=['Last Rating', 'Average Monthly Income'])

In [None]:
fig, ax = plt.subplots()
ax.set_title("Last Rating vs Monthly Income", fontsize=20)
ax.set_xlabel("LAST RATING", fontsize=12)
ax.set_ylabel("AVERAGE MONTHLY INCOME", fontsize=12)

plt.plot(last_rating, minco, '8-.')
plt.show()

### Inference
* We know that a rating of 1: Poor and 5: Excellent
* This plot seems to imply that those with the last rating of 3 have a higher average income than those with 5.
* We can clearly see that this doesn't make any sense.

### Explanation
* One reason the average income of the people with the last rating of 3 is high, is due to our **"CXO"** outlier, **Mr. ABC197**
* Another reason, could be that the employees with a higher rating are just less likely to leave (especially the ones with a better salary), because they are most probably happy with their job. So this is not a complete representation of all the people working at the company.

Let's try to plot the same graph, but this time, excluding **Mr. ABC197**

In [None]:
minco = []
b_data = data.drop(196)

for i in last_rating:
    temp = b_data.loc[data['Last Rating'] == i, 'Monthly Income']
    minco.append(np.mean(temp))

pd.DataFrame(list(zip(last_rating, minco)), columns=['Age', 'Average Monthly Income'])

In [None]:
fig, ax = plt.subplots()
ax.set_title("Last Rating vs Monthly Income", fontsize=20)
ax.set_xlabel("LAST RATING", fontsize=12)
ax.set_ylabel("AVERAGE MONTHLY INCOME", fontsize=12)

plt.plot(last_rating, minco, '8-.')
plt.show()

As we can see, it is back to normalcy.

## Regression

Now, let us use regression to further analyse our data.

In [None]:
from sklearn.linear_model import LinearRegression as LinReg
from sklearn.preprocessing import PolynomialFeatures as PolyReg

Let us define a method to do the necessary linear regression using sklearn and plots it.

In [None]:
def linreg(x, y, deco1):
    reg0 = LinReg()
    reg0.fit(np.array(x).reshape(-1, 1), y)

    c0 = reg0.intercept_
    m0 = reg0.coef_

    plt.plot(x, y, deco1)
    x_plot = np.linspace(1, len(x), 10)
    y_plot = c0 + m0 * x_plot
    plt.plot(x_plot, y_plot)

In [None]:
linreg(last_rating, minco, '8-.')

Greatly Underfitted... But, It was not unexpected.
So let us try to use polynomial regression.                                  

Let us define a method that can do the necessary regression and plot it using sklearn.

In [None]:
def pol_reg(deg, g_title, x, x0, x1, x_title, y, y_title, deco1, deco2):
    fig, ax = plt.subplots()
    ax.set_title(g_title, fontsize=20)
    ax.set_xlabel(x_title, fontsize=12)
    ax.set_ylabel(y_title, fontsize=12)

    poly_feat = PolyReg(degree=deg, include_bias=False)
    x_poly = poly_feat.fit_transform(np.array(x).reshape(-1, 1))

    reg = LinReg()
    reg.fit(x_poly, y)

    c = reg.intercept_
    co = []
    for i in range(0, deg):
        co.append(reg.coef_[i])

    plt.plot(x, y, deco1)
    x_plot = np.linspace(x0, x1, 250)
    y_plot = []

    for i in x_plot:
        temp = c
        k = 1
        for j in range(0, deg):
            temp += co[j] * pow(i, k)
            k += 1
        y_plot.append(temp)

    plt.plot(x_plot, y_plot, deco2)
    plt.show()

Now, let us plot the graphs between **Last Rating** and **Average Monthly Income** for various degrees and find the optimal degree.

In [None]:
for i in range(2, 7):
    pol_reg(i, f'Degree {i}', last_rating, last_rating[0], last_rating[-1], 'Last Rating', minco,
            'Average Monthly Income', '8', '')

As we can see, Degree 4 seems to be a good choice.

Now, let us do the same, but with **Age** instead of **Last Rating**.<br>
Note: We will be excluding **Mr. ABC197**.

In [None]:
for i in range(2, 7):
    pol_reg(i, f'Degree {i}', cp_age, sorted(age)[0] - 2, sorted(age)[-1], 'Age', cp_minco0, 'Average Monthly Income',
            '*', '')

In this case, we can say that Degree 5, seems to be the best fit.<br>
Interestingly, according to this particular plot, the **Average Monthly Income** at Age 54 is close to the **Monthly Income** of Mr. ABC197 who is also 54 years old.

Welp, I guess I can not think of any more of these graphs... So
## Thank You.