***
This notebook is a part of the solution for DSG: City of LA competition. The solution splited into 5 parts. Here is the list of notebook in correct order. The part of solution you are currently reading is highlighted in bold.

[1. Introduction to the solution of DSG: City of LA](https://www.kaggle.com/niyamatalmass/1-introduction-to-the-solution-of-dsg-city-of-la)

[2. Raw Job Postings to structured CSV](https://www.kaggle.com/niyamatalmass/2-raw-job-bulletins-to-structured-csv)

[3. Identify biased language](https://www.kaggle.com/niyamatalmass/3-identify-biased-language)

[**4. Improve the diversity and quality**](https://www.kaggle.com/niyamatalmass/4-improve-the-diversity-and-quality)

[5. Jobs Promotional Pathway](https://www.kaggle.com/niyamatalmass/5-jobs-promotional-pathway)
***

<h1 align="center"><font color="#5831bc" face="Comic Sans MS">Improve the diversity and quality of applicants pool</font></h1> 

# <font color="#5831bc" face="Comic Sans MS">Notebook Overview</font>
We already made a lot of progress in our solution. In this notebook, we are going to see how we can improve the diversity and quality of applicants in the City of LA jobs. We will use our structured CSV dataset and biased languages that we identified in part 3 of our solutions. We will import those datasets from the previous kernel using ```Kaggle dataset from kernel output``` feature. 

We will use our structured versions of job bulletins and do Exploratory Data Analysis. This kernel consists of three parts. 

* **Language affecting diversity and quality**
* **Do salary affect quality?**

Without further talking let's get started and see what we can find! 
***

# <font color="#5831bc" face="Comic Sans MS">Language affecting diversity and quality</font>
In part 3 of our solution, we identified languages that are biased to a certain gender. But we didn't do analysis on that in that solution part. Now we are going to analyze that data and try to find insights that will be useful for improving the diversity and quality of applicants pool. 

> ### <font color="#5831bc" face="Comic Sans MS">Ready our datasets</font>
As always, we have ready our datasets for our analysis. We will import our data sets in here. And also we are going to see a slight peek of our data. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')


# # set font
# plt.rcParams['font.family'] = 'sans-serif'
# plt.rcParams['font.sans-serif'] = 'Helvetica'

# set the style of the axes and the text color
plt.rcParams['axes.edgecolor']='#333F4B'
plt.rcParams['axes.linewidth']=0.8
plt.rcParams['xtick.color']='#333F4B'
plt.rcParams['ytick.color']='#333F4B'
plt.rcParams['text.color']='#333F4B'

plt.rcParams['font.family'] = "serif"

%matplotlib inline

In [None]:
########################
# read our datasets
########################
df_jobs_struct = pd.read_csv("../input/3-identify-biased-language/df_jobs_with_gender_coded.csv")
df_la_ethn = pd.read_csv('../input/la-applicants-gender-and-ethnicity/rows.csv')
df_jobs_struct.head()

Here is our dataset! If you scroll far right of the datasets you will see information about job description language that we extracted in previous notebook. 

In [None]:
#############################
# Doing some preprocessing 
# for later use
###########################

# convert job class to numeric
df_jobs_struct['JOB_CLASS_NO'] = pd.to_numeric(
    df_jobs_struct['JOB_CLASS_NO'])
df_la_ethn['Job Number'] = df_la_ethn['Job Number'].str.extract('(\d+)')
df_la_ethn['Job Number'] = pd.to_numeric(df_la_ethn['Job Number'])


df_jobs_struct['ENTRY_SALARY_GEN'] = \
df_jobs_struct['ENTRY_SALARY_GEN'].replace({'\$':''}, regex = True)

df_jobs_struct['ENTRY_SALARY_GEN'] = \
df_jobs_struct['ENTRY_SALARY_GEN'].replace({'\(flat-rated\)':''}, regex = True)

df_jobs_struct['ENTRY_SALARY_GEN'] = \
df_jobs_struct['ENTRY_SALARY_GEN'].replace({'nan':''}, regex = True)

def _process_salary(salary):
    if salary is np.nan:
        salary = str(0)
    if "-" not in salary:
        return int(salary.replace(',', ''))
    else:
        sal_list = salary.split('-')
        med_salary = (int(sal_list[0].replace(',', '')) + int(sal_list[1].replace(',', ''))) / 2
        return med_salary

df_jobs_struct['ENTRY_SALARY_GEN_MED'] = \
df_jobs_struct['ENTRY_SALARY_GEN'].apply(lambda x: _process_salary(x))


> ### <font color="#5831bc" face="Comic Sans MS">Specific gender domination in job description</font>
In this section, we will try to find number of jobs that are different gender coded. Let's see it in real. 

In [None]:
def plot_bar(dataframe, x, y):
    ax = sns.barplot(y= dataframe[x], x = dataframe[y], palette=('Blues_d'))
    
    for i, v in enumerate(dataframe[y]):
        ax.text(v + 3, i + .25, str(v), color='Blue', fontweight='bold')
    
    sns.set_context('notebook')

In [None]:
%config InlineBackend.figure_format = 'retina'

df_dupli = df_jobs_struct.drop_duplicates(subset='JOB_CLASS_NO', keep="first")
temp_coding = df_dupli['coding'].value_counts().sort_values(
    ascending = False).rename_axis(
    'Gender Coding').reset_index(
    name='Counts')

plot_bar(temp_coding, 'Gender Coding', 'Counts')

Awh! That looks bad. We can see masculine coded jobs are popular in City of LA jobs. That's is very bad. Also, there are a good number of jobs that don't use any biased languages. 

These masculine biased job description can negatively biased women. That dramatically decreases diversity and quality among job applicants.     
<div class="alert alert-block alert-info">
<p/>
    Also, attract more women can have a huge impact on the business. Research suggests that women can have a decisive sway on the collective intelligence of groups; and, at a larger level, companies with a higher percentage of female leadership are consistently shown to be more profitable.
<p/>
</div>

<br/>
So, in order to increase diversity and quality of applicants pool of city of la, they need to solve the masculine biased present in their job descriptions. Let's see which masculine/feminine words city of la used a lot. 

> ### <font color="#5831bc" face="Comic Sans MS">Popular masculine words in City of LA jobs</font>
In this section, we will be looking at popular masculine words in the city of la jobs description. This will help the city of la to identify masculine coded words and help to solve the problems.

In [None]:
%config InlineBackend.figure_format = 'retina'

plt.figure(figsize=(10, 8))

temp_masc_words = df_dupli['masculine_words'].str.split(',', expand=True).stack().value_counts().sort_values(
    ascending = False).rename_axis(
    'Masculine words').reset_index(
    name='Counts')

plot_bar(temp_masc_words, 'Masculine words', 'Counts')

We can see that ```principles, force, analysis, lead``` are by far the most common masculine coded words in the job descriptions. Also, there are other words as well. Now we identified those words, the City of LA can easily take necessary actions to improve those words. 

> ### <font color="#5831bc" face="Comic Sans MS">Popular feminine words in City of LA jobs</font>
In this section, we will be looking at popular feminine-coded words in the city of la jobs description. This will help the city of la to not add these words in future so that feminine-coded job description increase dramatically. 

In [None]:
%config InlineBackend.figure_format = 'retina'

plt.figure(figsize=(10, 8))

temp_masc_words = df_dupli['feminine_words'].str.split(',', expand=True).stack().value_counts().sort_values(
    ascending = False).rename_axis(
    'Feminine words').reset_index(
    name='Counts')

plot_bar(temp_masc_words, 'Feminine words', 'Counts')

We see that ```responsible``` is by far the most popular feminine words in la jobs followed by ```support and connection and others```. The City of LA can take necessary action to make sure to find similar neutral words instead of these biased words. 



Note: These words are found in ```DUTIES``` AND ```REQUIREMENTS``` sections of jobs. So it is highly recommended to find alternative words instead of these biased words. This will dramatically increase the diversity and quality of applicants pool in the city of la jobs. 

<div class="alert alert-block alert-info">
<p/>
    <b>Recommendations: </b> <br/>
    â€¢ First, we saw that masculine coding jobs are very high in the City of LA jobs. Then we find popular masculine and feminine words from jobs descriptions. Now it is highly recommended for the City of LA that they should find alternative words for these biased words. If they can't at least they should consider other options to minimize the effect of these words from jobs. 
<p/>
</div>

<br/>

***

# <font color="#5831bc" face="Comic Sans MS">Does Salary affect diversity and quality?</font>
In this section, we are going to look at how salary is affecting diversity and quality of city of la jobs. 

In [None]:
temp_sal_coding = df_dupli.groupby('coding')['ENTRY_SALARY_GEN_MED'].mean().round().sort_values(
    ascending = False).rename_axis(
    'Gender Coding').reset_index(
    name='Salary')

plot_bar(temp_sal_coding, 'Gender Coding', 'Salary')

    - Wow! We find some interesting insights! Jobs that are strongly feminine-coded, I mean those jobs that use more feminine-coded words, give salary a lot more compared to those jobs that don't use any biased language. 

> ### <font color="#5831bc" face="Comic Sans MS">Salary by EXAM_TYPE</font>
Jobs can be promotional. Sometimes candidates can only promotion within the city of la or within departments. And sometimes it can be open to anyone. Let's see the mean salary of each exam type. 

In [None]:
temp_sal_exam_type = df_dupli.groupby('EXAM_TYPE')['ENTRY_SALARY_GEN_MED'].mean().round().sort_values(
    ascending = False).rename_axis(
    'Exam Type').reset_index(
    name='Salary')

plot_bar(temp_sal_exam_type, 'Exam Type', 'Salary')

We are seeing that Jobs that are open to everyone, pays a lot less compared to INTER DEPARTMENTAL, OPEN INTER PROMOTION. That's really bad news. Because, if we want to increase the quality of our applicant's pool, we have to show them an attractive salary. Otherwise, we will not get quality candidates. Also, it will decrease diversity also. 

> ### <font color="#5831bc" face="Comic Sans MS">Salary by Education Major</font>
Now we are going to look at salary of different education majors. Let's see what we can find useful. 

In [None]:
plt.figure(figsize=(10, 20))



# temp_sal_ed_major = df_jobs_struct.groupby('EDUCATION_MAJOR')['ENTRY_SALARY_GEN_MED'].median().round().sort_values(
#     ascending = False).rename_axis(
#     'Education Major').reset_index(
#     name='Salary')

s = df_jobs_struct.set_index('ENTRY_SALARY_GEN_MED').EDUCATION_MAJOR.str.split(',',expand=True).stack().to_frame('Messages').reset_index()
s = s.groupby('Messages')['ENTRY_SALARY_GEN_MED'].median().round().sort_values(
    ascending = False).rename_axis(
    'Education Major').reset_index(
    name='Salary')

plot_bar(s[:50], 'Education Major', 'Salary')

    Wow! That's awesome! Public administrations, buisiness, information, engineering are mostly high paying degree in City of LA jobs. Now we can use this data to improve quality of our applicants pool. How? City of LA can focus on certain major for increasing quality and diversity and increase it's salary. That's it can be very helpful. 

> ### <font color="#5831bc" face="Comic Sans MS">School type by salary</font>
Now we are going to try find some info how salary changes for school type and use that data to find a way to increase quality and diversity.

In [None]:
plt.figure(figsize=(10, 15))
temp_sal_sc_type = df_jobs_struct.groupby('SCHOOL_TYPE')['ENTRY_SALARY_GEN_MED'].median().round().sort_values(
    ascending = False).rename_axis(
    'School Type').reset_index(
    name='Salary')

plot_bar(temp_sal_sc_type, 'School Type', 'Salary')

    Firstly we are accounting college, law school, engineer school and other school type has a very high salary. The lowest is the high school diploma. In order to increase our diversity and quality of applicants, we have to take special care to school type. 

    We have to try to increase salary in a diverse set of school type. If we can't, at least we have at some extra perks to attract quality and diverse candidates across different school type. 
    
   *We are seeing some noise in our data but most of them right. *

> ### <font color="#5831bc" face="Comic Sans MS">Who get highest and lowest salary</font>
Now we are going to look at who get highest and lowest salary. This will help us understand who needs more attentions. 

In [None]:
temp_job_tit_sal = df_jobs_struct.groupby('JOB_CLASS_TITLE')['ENTRY_SALARY_GEN_MED'].median().round().sort_values(
    ascending = False).rename_axis(
    'Job Title').reset_index(
    name='Salary')

plot_bar(temp_job_tit_sal[:20], 'Job Title', 'Salary')

In [None]:
plot_bar(temp_job_tit_sal.tail(20)[:19], 'Job Title', 'Salary')

Obviously, the highest salary gets the CHIEF, DIRECTOR etc. But the least salary gets PARKING ATTENDANT. The City should give more attention to this less paying jobs and try to increase diversity and quality. They get 70% to 80% less paid compare to the top tier. 

# <font color="#5831bc" face="Comic Sans MS">Conclusion</font>
After doing some EDA, we have some interesting insights into our data. These recommendations will certainly be very helpful to The City of LA. They can use this information to think more deeply about their job creation. Now they will have a clear understanding of where to put focus to increase the diversity and quality of the applicant's pool. 


Thanks for reading the notebook. We will see at the next part of the solutions. 