***
This notebook is a part of the solution for DSG: City of LA competition. The solution splited into 5 parts. Here is the list of notebook in correct order. The part of solution you are currently reading is highlighted in bold.

[1. Introduction to the solution of DSG: City of LA](https://www.kaggle.com/niyamatalmass/1-introduction-to-the-solution-of-dsg-city-of-la)

[2. Raw Job Postings to structured CSV](https://www.kaggle.com/niyamatalmass/2-raw-job-bulletins-to-structured-csv)

[**3. Identify biased language**](https://www.kaggle.com/niyamatalmass/3-identify-biased-language)

[4. Improve the diversity and quality](https://www.kaggle.com/niyamatalmass/4-improve-the-diversity-and-quality)

[5. Jobs Promotional Pathway](https://www.kaggle.com/niyamatalmass/5-jobs-promotional-pathway)
***

<h1 align="center"><font color="#5831bc" face="Comic Sans MS">Identify Biased Language in City of LA job descriptions</font></h1> 

# <font color="#5831bc" face="Comic Sans MS">Notebook Overview</font>
This notebook successfully identifies the language in City of LA job description that can negatively bias group of applicants. We do this by identifying/finding predefined biased words and using machine learning model in the job description. 


We used multiple techniques for identifying language that is biased. Each technique gave us interesting and different kinds of results. This notebook has the main section for each technique and its results.

Here are the short contents:
* **Identify biases by pre-defined word**
* **Identify biases by machine learning model**


We will use structured versions of job bulletins created in part 2 of our solutions. We will import that data in this notebook by directly. Because in Kaggle you can add kernel output as a dataset. We will use that feature. So without further talking let's get started!

# <font color="#5831bc" face="Comic Sans MS">1. Identify biases by pre-defined word</font> 
Employers often use words that are biased to certain group mostly male/female that can negatively bias them. http://gender-decoder.katmatfield.com/ has a very good list of these words. We can find a good list of male and female-coded words there. Like ```defend``` is masculine coded and ```commit``` is feminine-coded. The words come from a very good **research paper written by Danielle Gaucher, Justin Friesen, and Aaron C. Kay, called ```Evidence That Gendered Wording in Job Advertisements Exists and Sustains Gender Inequality```**.

In this section, we will be using those words for identifying biased words in job descriptions. Without further do let's begin.

> ### <font color="#5831bc" face="Comic Sans MS">Import dataset and library</font>
We already pre-processed and made a structured CSV version of all job bulletins in the second part of the solution. In this notebook, we will be using that CSV file for doing our analysis on language biases. I also keep a column in that CSV for a clean version of the whole raw job description for each job bulletins, so that we can use that in further analysis. That will also help us a lot. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re

In [None]:
df_jobs_struct = pd.read_csv('../input/2-raw-job-bulletins-to-structured-csv/jobs.csv')
df_jobs_struct.head(10)

Hah! Look very good. Now we have our dataframe ready, let's move on to our next steps. 

> ### <font color="#5831bc" face="Comic Sans MS">Import pre-defined bias word</font>
In this section, we will import masculine and feminine-coded words from http://gender-decoder.katmatfield.com. The words are stored as python arrays. These words are written as the stem to make it easier to match all variants. In other words, the suffix is intentionally left out.

In [None]:
feminine_coded_words = [
    "agree","affectionate","child","cheer","collab","commit","communal",
    "compassion","connect","considerate","cooperat","co-operat",
    "depend","emotiona","empath","feel","flatterable","gentle",
    "honest","interpersonal","interdependen","interpersona","inter-personal",
    "inter-dependen","inter-persona","kind","kinship","loyal","modesty",
    "nag","nurtur","pleasant","polite","quiet","respon","sensitiv",
    "submissive","support","sympath","tender","together","trust","understand",
    "warm","whin","enthusias","inclusive","yield","share","sharin"
]

masculine_coded_words = [
    "active","adventurous","aggress","ambitio",
    "analy","assert","athlet","autonom","battle","boast","challeng",
    "champion","compet","confident","courag","decid","decision","decisive",
    "defend","determin","domina","dominant","driven","fearless","fight",
    "force","greedy","head-strong","headstrong","hierarch","hostil",
    "impulsive","independen","individual","intellect","lead","logic",
    "objective","opinion","outspoken","persist","principle","reckless",
    "self-confiden","self-relian","self-sufficien","selfconfiden",
    "selfrelian","selfsufficien","stubborn","superior","unreasonab"
]

hyphenated_coded_words = [
    "co-operat","inter-personal","inter-dependen","inter-persona",
    "self-confiden","self-relian","self-sufficien"
]

possible_codings = (
    "strongly feminine-coded","feminine-coded","neutral",
    "masculine-coded","strongly masculine-coded"
)

Nice! We are seeing feminine and masculine coded words. Also, there are some list for different types of gender that we will be using. At the last, there are explanations for each gender coding. At the last, we see an array named ```possible codings```. We will use these words to describe our job descriptions. Let's see what this coding means:

    - "feminine-coded": "This job ad uses more words that "
                            "are subtly coded as feminine than words that are "
                            "subtly coded as masculine (according to the "
                            "research). Fortunately, the research suggests "
                            "this will have only a slight effect on how appealing "
                            "the job is to men, and will encourage women applicants."
                            
                            
    - "masculine-coded": "This job ad uses more words that "
                        "are subtly coded as masculine than words that are "
                        "subtly coded as feminine (according to the research). "
                        "It risks putting women off applying, but will probably "
                        "encourage men to apply.
                        
                        
    - "strongly feminine-coded": "This job ad uses more words that "
                        "are subtly coded as feminine than words that are "
                        "subtly coded as masculine (according to the "
                        "research). Fortunately, the research suggests this "
                        "will have only a slight effect on how appealing the "
                        "job is to men, and will encourage women applicants.
                        
                        
    _ "strongly masculine-coded": "This job ad uses more words that "
                        "are subtly coded as masculine than words that are subtly "
                        "coded as feminine (according to the research). It risks "
                        "putting women off applying, but will probably encourage "
                        "men to apply.
                        
                        
    _ "empty": This job ad doesn't use any words "
              "that are subtly coded as masculine or feminine (according to "
              "the research). It probably won't be off-putting to men or "
              "women applicants.
              
              
    _ "neutral": This job ad uses an equal number "
              "of words that are subtly coded as masculine and feminine "
              "(according to the research). It probably won't be off-putting "
              "to men or women applicants.

> ### <font color="#5831bc" face="Comic Sans MS">Create a class for identify above words in job description</font>
The words defined above are stemmed so that we can identify different forms of that word. Also, Job descriptions are messy has a lot of punctuation, useless text, stopwords. We have to process them all. So we created a class to tackle all that once.

In [None]:
class JobAd():

    def __init__(self, ad_text):
        self.ad_text = ad_text
        self.analyse()
        
    def gender_decode(self):
        return self.coding, self.masculine_coded_words, self.feminine_coded_words
        


    def analyse(self):
        word_list = self.clean_up_word_list()
        self.extract_coded_words(word_list)
        self.assess_coding()
        

    def clean_up_word_list(self):
        cleaner_text = ''.join([i if ord(i) < 128 else ' '
            for i in self.ad_text])
        cleaner_text = re.sub("[\\s]", " ", cleaner_text, 0, 0)
        cleaned_word_list = re.sub(u"[\.\t\,“”‘’<>\*\?\!\"\[\]\@\':;\(\)\./&]",
            " ", cleaner_text, 0, 0).split(" ")
        word_list = [word.lower() for word in cleaned_word_list if word != ""]
        return self.de_hyphen_non_coded_words(word_list)

    def de_hyphen_non_coded_words(self, word_list):
        for word in word_list:
            if word.find("-"):
                is_coded_word = False
                for coded_word in hyphenated_coded_words:
                    if word.startswith(coded_word):
                        is_coded_word = True
                if not is_coded_word:
                    word_index = word_list.index(word)
                    word_list.remove(word)
                    split_words = word.split("-")
                    word_list = (word_list[:word_index] + split_words +
                        word_list[word_index:])
        return word_list

    def extract_coded_words(self, advert_word_list):
        words, count = self.find_and_count_coded_words(advert_word_list,
            masculine_coded_words)
        self.masculine_coded_words, self.masculine_word_count = words, count
        words, count = self.find_and_count_coded_words(advert_word_list,
            feminine_coded_words)
        self.feminine_coded_words, self.feminine_word_count = words, count

    def find_and_count_coded_words(self, advert_word_list, gendered_word_list):
        gender_coded_words = list(filter(lambda x: x.lower().startswith(tuple(gendered_word_list)), advert_word_list))
        return (",").join(gender_coded_words), len(gender_coded_words)

    def assess_coding(self):
        coding_score = self.feminine_word_count - self.masculine_word_count
        if coding_score == 0:
            if self.feminine_word_count:
                self.coding = "neutral"
            else:
                self.coding = "empty"
        elif coding_score > 3:
            self.coding = "strongly feminine-coded"
        elif coding_score > 0:
            self.coding = "feminine-coded"
        elif coding_score < -3:
            self.coding = "strongly masculine-coded"
        else:
            self.coding = "masculine-coded"

    def list_words(self):
        if self.masculine_coded_words == "":
            masculine_coded_words = []
        else:
            masculine_coded_words = self.masculine_coded_words.split(",")
        if self.feminine_coded_words == "":
            feminine_coded_words = []
        else:
            feminine_coded_words = self.feminine_coded_words.split(",")
        masculine_coded_words = self.handle_duplicates(masculine_coded_words)
        feminine_coded_words = self.handle_duplicates(feminine_coded_words)
        return masculine_coded_words, feminine_coded_words

    def handle_duplicates(self, word_list):
        d = {}
        l = []
        for item in word_list:
            if item not in d.keys():
                d[item] = 1
            else:
                d[item] += 1
        for key, value in d.items():
            if value == 1:
                l.append(key)
            else:
                l.append("{0} ({1} times)".format(key, value))
        return l


> ### <font color="#5831bc" face="Comic Sans MS">Finally, extract gender bias from job description</font>
We build our class, now let's apply that class in pandas each row and extract gender coding for each job descripton. Also, we store them as a seperate columns. 

**Also, we don't want search the entire job description of each job bulletins, because there are some section that is reduntant or unnecessary. We are specially interested in job ```DUTIES``` and ```REQUIREMENTS``` section. Because these two field candidates looks often.**

In [None]:
df_jobs_struct['DUTY_REQ'] = df_jobs_struct['JOB_DUTIES'] + df_jobs_struct['REQUIREMENTS']
df_jobs_struct['DUTY_REQ'] = df_jobs_struct['DUTY_REQ'].fillna('No text found')

In [None]:
# apply the class to every of our pandas dataframe
df_jobs_struct['coding'], df_jobs_struct['masculine_words'], df_jobs_struct['feminine_words'] = \
zip(*df_jobs_struct['DUTY_REQ'].apply(lambda x: JobAd(x).gender_decode()))

# finally print extracted coding for each job rows
df_jobs_struct[['JOB_CLASS_TITLE', 'coding', 'masculine_words', 'feminine_words']].head(25)

Awesome! We successfully identified words that are biased to a certain gender. We are seeing that some jobs are **masculine** and some are **feminine** coded. Also, some jobs don't have any biased language in their ```DUTIES and REQUIREMENTS``` sections. We will not do any analysis on this topic in this kernel, we will do that in the next kernel. Because this kernel goal is to only identify those languages that negatively bias certain group of applicants. 

In [None]:
# save the dataframe for future use
df_jobs_struct.to_csv('./df_jobs_with_gender_coded.csv', index=False)

# <font color="#5831bc" face="Comic Sans MS">2. Identify biases by machine learning model</font> 
Previously, we found biased words by finding hardcoded words in the job description. Now we want to deep dive. We want to dig into the job description with a machine learning model.

We found a dataset for City of LA job applicants demographics for 187 jobs. So we will use this data in model building. We will build a machine learning model for predicting the gender of job applicants in each job. Then we will dig dive into the model and try to found hidden language that can be biased to a certain group. Without further do let's get started! 

> ### <font color="#5831bc" face="Comic Sans MS">First, ready our dataset</font>
In this section, we will import our newly found data and we will do some preprocessing of our structured CSV and new dataset for working with our machine learning model. We provided extensive documentation for each code part. The dataset is from [here](https://data.lacity.org/A-Prosperous-City/Job-Applicants-by-Gender-and-Ethnicity/mkf9-fagf).

In [None]:
def _process_salary(salary):
    if salary is np.nan:
        salary = str(0)
    if "-" not in salary:
        return int(salary.replace(',', ''))
    else:
        sal_list = salary.split('-')
        med_salary = (int(sal_list[0].replace(',', '')) + int(sal_list[1].replace(',', ''))) / 2
        return med_salary

In [None]:
####### Import demographics of applicants ##############
df_la_ethn = \
pd.read_csv(
    '../input/la-applicants-gender-and-ethnicity/rows.csv')

# see what we have got # 
df_la_ethn.head()

Hah! We can see our applicant's demographic data. It's will be very useful. We will now merge this dataset with our structured CSV by Job class number and do some preprocessing. Please read the comment with the code for understanding what is going on in each part of the code. 

In [None]:

###### Do Some processing and cleaning ##################
df_jobs_struct['JOB_CLASS_NO'] = pd.to_numeric(
    df_jobs_struct['JOB_CLASS_NO'])
df_la_ethn['Job Number'] = df_la_ethn['Job Number'].str.extract('(\d+)')
df_la_ethn['Job Number'] = pd.to_numeric(df_la_ethn['Job Number'])


df_jobs_struct['ENTRY_SALARY_GEN'] = \
df_jobs_struct['ENTRY_SALARY_GEN'].replace({'\$':''}, regex = True)

df_jobs_struct['ENTRY_SALARY_GEN'] = \
df_jobs_struct['ENTRY_SALARY_GEN'].replace({'\(flat-rated\)':''}, regex = True)

df_jobs_struct['ENTRY_SALARY_GEN'] = \
df_jobs_struct['ENTRY_SALARY_GEN'].replace({'nan':''}, regex = True)


df_jobs_struct['ENTRY_SALARY_GEN_MED'] = \
df_jobs_struct['ENTRY_SALARY_GEN'].apply(lambda x: _process_salary(x))


###### Bins the salary for our machine learning model ###########
bins = [0, 50000, 75000, 90000, 100000, 120000, 150000, 200000]
df_jobs_struct['ENTRY_SALARY_GEN_MED_BIN'] = pd.cut(
    df_jobs_struct['ENTRY_SALARY_GEN_MED'], bins)

###############################
# Finally merge our two dataset on job class number
##############################
df_jobs_struct_merge = df_jobs_struct.merge(
    df_la_ethn, left_on='JOB_CLASS_NO', right_on='Job Number')
df_jobs_struct_merge.head()

We successfully merge our structured CSV dataset with applicants demographics dataset. Feel free to scroll to the right to see each applicants demographics also. In the next, section we will be building our model. 

> ### <font color="#5831bc" face="Comic Sans MS">Build our ML model</font>
Now we will be building our ML model. We are going to using a binary classification algorithm using SkLearn. But in order to train our model, we have to feed the right data to our model. So we have to create the right training dataset for our model. 

    Since we are identifying the biased language, so we are going to use those columns that are pure text. We are going to use these columns ```JOB_DUTIES```, ```REQUIREMENTS```, ```EXP_JOB_CLASS_FUNCTION```. **```These columns are pure text, we are going to use these columns to predict whether job applicants majority is female or male```**. Then we are going to explore the model for understanding biased language. 

In [None]:
import eli5
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import mean_squared_log_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier

> ### <font color="#5831bc" face="Comic Sans MS">Create training dataset</font>
First, we will create a new column and store whether a job applicants pool is female/male dominate. We will do that by using the external data we merge with our CSV file. If more than 50% of applicants are certain gender than we will identify them as that gender coded. ```Female coded jobs identified as 0 and male-coded jobs are identified as 1.  ```

In [None]:
df_jobs_struct_merge['female_perc'] = (df_jobs_struct_merge['Female'] / df_jobs_struct_merge['Apps Received']) *100
df_jobs_struct_merge['male_perc'] = (df_jobs_struct_merge['Male'] / df_jobs_struct_merge['Apps Received']) *100

conditions = [
    (df_jobs_struct_merge['female_perc'] >= 50),
    (df_jobs_struct_merge['male_perc'] >= 50)]
choices = [0, 1]
df_jobs_struct_merge['gender_code'] = np.select(conditions, choices, default=1)

In [None]:
# drop unnecessary columns

DROP_COLUMNS = ['raw_job_text', 'JOB_CLASS_TITLE', 'JOB_CLASS_NO', 'OPEN_DATE',
                'TEMP_EXAM_TYPE', 'TEMP_SALARY', 'TEMP_REQUIREMENTS', 'WHERE_TO_APPLY',
                'DEADLINE', 'SELECTION_PROCESS', 'raw_clean_job_text', 'REQUIREMENTS_PROCESS',
                'req_list', 'temp_entity', 'Fiscal Year', 'Job Number', 'Job Description',
                'Apps Received', 'Female', 'Male', 'Unknown_Gender', 'ENTRY_SALARY_GEN',
                'ENTRY_SALARY_DWP', 'REQUIREMENT_SET_ID',
                'Black', 'Hispanic', 'Asian', 'Caucasian', 'American Indian/ Alaskan Native',
                'Filipino', 'Unknown_Ethnicity', 'ENTRY_SALARY_GEN_MED']

# y_only = df_jobs_struct_merge[['Female', 'Male']]

X_only = df_jobs_struct_merge


# replae nan values
X_only['JOB_DUTIES'] = X_only['JOB_DUTIES'].fillna('Not Found')
X_only['DRIVERS_LICENSE_REQ'] = X_only['DRIVERS_LICENSE_REQ'].fillna('Not Found')
X_only['REQUIREMENTS'] = X_only['REQUIREMENTS'].fillna('Not Found')
X_only['EDUCATION_MAJOR'] = X_only['EDUCATION_MAJOR'].fillna('Not Found')
X_only['EXP_JOB_CLASS_FUNCTION'] = X_only['EXP_JOB_CLASS_FUNCTION'].fillna('Not Found')
X_only['EXP_JOB_CLASS_TITLE'] = X_only['EXP_JOB_CLASS_TITLE'].fillna('Not Found')

In [None]:
# we need a custom pre-processor to extract correct field,
# but want to also use default scikit-learn preprocessing (e.g. lowercasing)
default_preprocessor = CountVectorizer().build_preprocessor()
def build_preprocessor(field):
    field_idx = list(X_only.columns).index(field)
    return lambda x: default_preprocessor(x[field_idx])

vectorizer = FeatureUnion([
    ('JOB_DUTIES', TfidfVectorizer(
        stop_words='english',
        preprocessor=build_preprocessor('JOB_DUTIES'))),
    ('REQUIREMENTS', TfidfVectorizer(
        stop_words='english',
        preprocessor=build_preprocessor('REQUIREMENTS'))),
    ('EXP_JOB_CLASS_FUNCTION', TfidfVectorizer(
        preprocessor=build_preprocessor('EXP_JOB_CLASS_FUNCTION')))
])
X_train = vectorizer.fit_transform(X_only.values)

df_jobs_struct_merge['gender_code'] = df_jobs_struct_merge['gender_code'].fillna(1)

y_train = df_jobs_struct_merge['gender_code']
X_train

> ### <font color="#5831bc" face="Comic Sans MS">Model building and training</font>
We are going to build a binary classification model. We will be using a ```GradientBoostingClassifier```. There is no specific reason to use this classifier. Also, we are going split our training dataset into train and test set for evaluation. 

In [None]:
# spilt our training dataset into training and test set
train_x, test_x, train_y, test_y = train_test_split(
    X_train, y_train,test_size=0.1, random_state = 2019)

# declare our model
model = GradientBoostingClassifier(random_state = 20199, n_estimators = 200)

# train our model
model.fit(train_x, train_y)

# evaluate our model performance
y_pred_valid = model.predict(test_x)
acc = accuracy_score(test_y, y_pred_valid)
print(f'valid acc: {acc:.5f}')

Okay! That looks good. We have successfully build a model that predicts gender dominance if we give them a job description. But put a side note that this model is not a state of the art. We can definitely and of course improve our model, this is just a demo model. Now let's deep dive into the model and see what our model learned, what we can get! 

> ### <font color="#5831bc" face="Comic Sans MS">Find biased languages by analyzing our trained model</font>
Now we are going to use some ```model explainability``` technique to extract gender biases in job descriptions. Before we do that let's spent some time on model explainability. 

<div class="alert alert-block alert-info">
<p/>
<b>Model explainability:</b><br/>
    So we have trained our model. So, if we can find what our model learned for predicting female/male dominance in job description we can identify bias language. Also, if we discover why our model predicts certain jobs female/male dominance we can easily find those text features that cause that result. And that how model explainability works. Now we are going to use that. 
<p/>
</div>

Now we are going to show which text feature our model thinks important using a library called eli5 which makes to easy to describe model insights. 

In [None]:
eli5.show_weights(model, vec=vectorizer)

The part before the double underscore is the vectorizer name/column name, and the feature name goes after that. The darker the green, the important feature it is.

    - We are seeing that ```preparing, personnel``` in REQUIREMENTS AND JOB DUTIES columns are very important features.

Another handy feature is analyzing individual predictions. Let's check some predictions from the validation set. You see a summary of various words contribution at the top, and then below you can see features highlighed in text.

In [None]:
eli5.show_prediction(model, doc=X_only.values[67], vec=vectorizer)

In [None]:
y_train.values[67]

Awesome! Let's take some time what we have got here. 

    - We gave our model job description data for which position is 67. And we saw that this job is dominated by male candidates (y_train[67 is 1 and it means male dominated]). Then we use our model to predict dominance. After that, we use model explainability technique to identify why our model thinks that given JOB_DUTIES and REQUIREMENTS are female dominance. 

    - After visualizing the results, we can see that job duties has words ```administration``` that is highlighted with a strong green colour. It indicates that these words are responsible for model prediction. Which means these words are biased towards men. 
    
    
Now let's see for women how our model thinks which language are biased. Let's check for 4th postion data, it is male dominated. Lets what we got! 

In [None]:
eli5.show_prediction(model, doc=X_only.values[1], vec=vectorizer)

In [None]:
y_train.values[15]

Wow! We can see ```collection/media``` is highly positive. Meaning that ```collection/media``` is biased towards female. Also, from the first plot, we can see that ```personnel,responsibile, fine, arts``` are positively affecting the prediction. That's means these words are female dominated. 

# <font color="#5831bc" face="Comic Sans MS">Conclusion</font>
We finally identified languages that negatively bias certain group of job applicants. First, we saw a hard-coded method to find those languages. And secondly, we saw how we can use a machine learning model to identify biases in job descriptions. 
<br/>

<div class="alert alert-block alert-info">
<b>Recommendations:</b><br/>
<br/>
    • <b>Improve the machine learning model:</b> Currently, we are using a normal classifier without any modification of parameters and others dependent. It can easily turn into a good machine learning model that can identify biases more accurately.<br/>
    <br/>
    • <b>Gather more gender and ethnicity data:</b> The data we are using in our machine learning model is very small. It contains only 150+ jobs ethnicity and gender data. The City of LA can easily gather new data on this topic and improve model performance. If the model gets more data to train, the model could identify deep language biases and that will exlusively for City of LA job descriptions, because the model is trained using City of LA jobs data. <br/>
<p/>
</div>


In the next part the solutions, we will make EDA on the structured CSV and try to find some interesting insights. Thanks for reading. See you in the next part of the solutions! 