<a href="https://colab.research.google.com/github/yannmean/Artificial-Intelligence_CS221-2023-FALL/blob/main/OngigModels_YM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. House keeping

This section is mainly for setting up the environment, installing and loading packages, setting up the working directly on Google Drive.

Please note that colab is make for python by default, but I created a R environment within (some people call it "the r magic"). So the codes are still written in R. 

In [1]:
# Mount to my googled drive directory which contains the original dataset, and the ridge coefficients
from google.colab import drive
drive.mount('/content/drive')
# Here I created this directory in my google drive to host all the objects needed for this computation
%cd /content/drive/MyDrive/Ongig/ 
!ls

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Ongig
adverbsCounts.csv			jobPostWordVector.pdf
adverbs.csv				Ongig_Colab
adverb-weight-dataset.csv		ridgeAdvCoefs.csv
adverb-weight-dataset_newAdvScores.csv


In [2]:
%%capture
# activate R in python console
!pip install rpy2==3.5.1

In [3]:
# Load R environment
%load_ext rpy2.ipython

In [None]:
%%R
# install and load necessary packages and load from library

install.packages("dplyr")
install.packages("reshape")
install.packages("ggplot2")
install.packages("tidyr")

pkgs <- c("dplyr", "reshape", "ggplot2", "tidyr")
sapply(pkgs, require, character.only = T)

# 1. Compute the Adverb Score

## 1.1 Introduction 
To obtain the adverb score for one job posting, we need the following three components: 

1) the weight of each adverb that exist in this job posting;\
2) the number of times each adverb has appeared in the same job posting;\
3) the total number of adverbs in the same job posting

The first component - weight of each adverb - has been obtained via a statistical learning method, ridge regression, and stored as a dataframe of two variables: 

*   advScore - the weight (stored as float)
*   adv - the adverbs (stored as string)

The second component - adverb frequency in each job posting - has been obtained by counting the number of each adverb that appeared in a given job posting.

Finally, the third component - total number of adverbs - is calculated by counting the total number of adverbs (if one adverb appeared twice, we count it as 2, instead of 1.)

## 1.2 Adverb Score Construction

First, let's clarify the notations 


*   Let $s_i$ be the adverb score for job posting $i$
*   Let $w_j$ be the ridge coefficient of adverb $j$ (a.k.a. the "weight")
*   Let $n_{ij}$ be the frequency count of adverb $j$ in job posting $i$
*   Let $N_i$ be the total number of adverbs in job posting $i$

Therefore, the adverb score of job posting j can be calculated as the following:

\begin{align}
 s_i = \sum_{j} w_j \frac{n_{ij}}{N_i} 
\end{align}

In other words, the adverb score is the sum of the weighted adverb frequencies of the adverbs that are from the job posting.

## 1.3 Data Preparation for Computation
In the actual computation, we will not compute the adverb score for each job posting one-by-one, as this is not efficient and will take too long when we have millions of adverb scores that we would like to compute. Rather we formulate the computation into a Hadamard product of two matrices (element-wise multiplication), hence the computation is much faster. 

### 1.3.1 First, we need to prepare our dataset by creating the adverb frequency table by parsing, counting adverbs for each posting. 

There are three objects which need to be pre-loaded:
*   The original dataset (adverb-weight-dataset.csv)
*   The master adverb list from Ongig (adverbs.csv)
*   The adverb weights (ridgeAdvCoefs.csv)

In [None]:
%%R
# load the original dataset
df <- read.csv("adverb-weight-dataset.csv")

# load the master list of adverbs
advList <- read.csv("adverbs.csv", header = F)

# load the adverb weights
ridge_weights <- read.csv("ridgeAdvCoefs.csv")







RInterpreterError: ignored

In [None]:
%%R
# subset the dataset to only include "job_id" and "adverbs"
df_adv <- df[, c("job_id", "adverbs")] 
df_adv$adverbs <- gsub(" ", "", df_adv$adverbs)
# note the largest number of adverbs that one posting has is 64.
df_adv <- df_adv %>% separate(adverbs, paste0("adv", c(1:64)), sep = ",", remove = T)

# obtain adverbs from the current job posting
adverbsFromPosting <- unique(as.vector(as.matrix(df_adv[,2:ncol(df_adv)])))
adverbsFromPosting <- adverbsFromPosting[adverbsFromPosting!= "" & !is.na(adverbsFromPosting)] 

# create a data frame to hold adverb frequencies for each job posting
df_advCounts <- data.frame(matrix(0, ncol = nrow(advList) + 1, nrow = nrow(df)))
colnames(df_advCounts) <- c("job_id", advList$V1)

# this step might take 1 minute or 2 
for (i in 1:nrow(df_adv)) {
  df_advCounts[i, "job_id"] = df_adv$job_id[i]
  df_adv_temp <- df_adv[i,]
  df_adv_temp <- df_adv_temp[!is.na(df_adv_temp)]
  
  # inner loop start from the second index as the first index is the job_id
  for (j in 2:length(df_adv_temp)) {
    col_index <- which(colnames(df_advCounts) == df_adv_temp[j])
    df_advCounts[i, col_index] <-  df_advCounts[i, col_index] + 1  
  }
}

In [None]:
%%R
# df_advCounts dataframe contains all the adverbs, however there are many adverbs 
# from the master adverb lists do not exist in current job postings, here we only
# keep the adverbs that have appeared in the current job postings. 
df_advCounts <- df_advCounts[, colnames(df_advCounts) %in% adverbsFromPosting]
df_advCounts$job_id <- df_adv$job_id

# compute the total number of adverbs in each job posting, and store that value in variable "advTotal"
df_advCounts$advTotal <- rowSums(df_advCounts[, !names(df_advCounts) %in% c("job_id", "advTotal")])

# sanity check the dimension of this dataframe, the number of colum should be 534 adverbs + job_id + adv_total
dim(df_advCounts)

### 1.3.2 We store a version of the adverb frequency table for future use. (In case this session is timed-out, we also don't have to reconstruct it.)

In [None]:
%%R
# Now that this dataset should be saved in the google drive folder
write.csv(df_advCounts, file = "/content/drive/MyDrive/Ongig/adverbsCounts.csv") 

### 1.3.3 Now we need to make sure the ridge_weights matrix (the weight matrix) and the df_advCounts matrix (the frequency matrix) have the same dimension, and all adverbs are following the same orders along the column. 


In [None]:
%%R
# obtain the number of job postings we have
n = nrow(df_advCounts)

# we broadcast the ridge_weights following the dimension of the adverb frequency matrix, df_advCounts
ridge_weights_expand <- cbind(ridge_weights, rep(ridge_weights[2], each = n)) %>% t() %>% as.data.frame()
ridge_weights_expand <- ridge_weights_expand[3:nrow(ridge_weights_expand), 3:ncol(ridge_weights_expand)] 
colnames(ridge_weights_expand) <- ridge_weights_expand[1,]
ridge_weights_expand <- ridge_weights_expand[-1,]
ridge_weights_expand <- ridge_weights_expand[,!colnames(ridge_weights_expand) %in% c("job_id", "advTotal")]
# sanity check the dimension of ridge_weights_expand
dim(ridge_weights_expand)

# re-order the columns to make sure adverbs in the weight matrix follow the same order
ridge_weights_expand <- ridge_weights_expand[, order(colnames(ridge_weights_expand))] 

ridge_weights_expand <- as.matrix(sapply(ridge_weights_expand, as.numeric))

### 1.3.4 Now make sure the frequency matrix is having the same dimension and the same adverb ordering as the weight matrix.



In [None]:
%%R
# First double check that all adverbs from df_advCounts are included in the weight matrix
df_advCounts <- df_advCounts[,colnames(df_advCounts) %in% colnames(ridge_weights_expand) & ! colnames(df_advCounts) %in% c("job_id", "advTotal")]

# make sure the adverbs follow the same order
df_advCounts <- df_advCounts[, order(colnames(df_advCounts))]

df_advCounts <- as.matrix(sapply(df_advCounts, as.numeric))
# sanity check of the dimension 
dim(df_advCounts)

## 1.4 Adverb Score Computation







In [None]:
%%R
# Step 1. compute the Hadamard product of the weight matrix and the frequency matrix (element-wise products) 
# call this the adverb raw score
advRawScores <- ridge_weights_expand * df_advCounts 

# sanity check for dimention
dim(advRawScores)

In [None]:
%%R
# Step 2. compute the sum of the raw scores
advRawScoresSum <- rowSums(advRawScores)

# Step 3. reweight the raw score by the total number of advs 
advScoresNew <- advRawScoresSum/rowSums(df_advCounts)

# Step 4. set the advScoresNew as 0 for those postings without any adverbs
advScoresNew[is.na(advScoresNew)] = 0

# Step 5. merge the new score back into the original dataset for future use
df$advScoresNew <- advScoresNew

In [None]:
%%R
# save a version of the updated dataset
write.csv(df, file = "/content/drive/MyDrive/Ongig/adverb-weight-dataset_newAdvScores.csv")

In [None]:
%%R
# Vistualizee the distribution of the new adverb score
ggplot(df, aes(x = advScoresNew)) + geom_histogram(bins = 100) + theme_classic() + ggtitle("Distribution of New Adverb Score From Ridge Regression")

In [None]:
%%R
# Vistualizee the distribution again, but removing all the zeros
ggplot(df[df$advScoresNew!=0,], aes(x = advScoresNew)) + geom_histogram(bins = 100) + theme_classic() + ggtitle("Distribution of New Adverb Score (Non-zero) From Ridge Regression")

## 1.5 Rescale the new adver score to 0 to 100

In [None]:
%%R
# reload the previously safe the dataset if the session logs out. 
# df <- read.csv("/content/drive/MyDrive/Ongig/adverb-weight-dataset_newAdvScores.csv")

as <- df$advScoresNew

### 1.5.1 arctan transformation 


In [None]:
%%R
# scale by 100 so that arctan (x) is more spread across the whole range pf -3.14/2, 3.14/2
as_arctan <- atan(as*100) 
summary(as_arctan)

In [None]:
%%R
hist(as_arctan, breaks = 100)

### 1.5.2 compute z score

In [None]:
%%R
mean <- mean(as_arctan)
sd <- sd(as_arctan)
as_arc_z <- (as_arctan - mean) / sd
summary(as_arc_z)

In [None]:
%%R
hist(as_arc_z, breaks = 100)

### 1.5.3 shift the distribution from range (-$\pi$/2, $\pi$/2), to (0, 100)

In [None]:
%%R
# by symmmetry, the mean of a symmetric distribution on (0, 100) should be 50
as_rescaled <- 50 + as_arc_z*(sd*30) # rescale sd by 30 to spread on the new domain
summary(as_rescaled)

In [None]:
%%R
hist(as_rescaled, breaks = 100)

### 1.5.4 for those smaller than zero are set at zero, larger than 100 are set at 100 (didn't happen in our current case), it might happen in future with very small probability (4 standard deviation from the mean)

In [None]:
%%R
as_rescaled[as_rescaled < 0] = 0
as_rescaled[as_rescaled > 100] = 100

### 1.5.5 testing and evaluation the new rescaled adv score:

#### create variables needed for the evaluations

In [None]:
%%R
df$as_rescaled <- as_rescaled
df$unique_apply_rate <- df$unique_applystarts / df$unique_views
df$word_count_sqr <- df$word_count * df$word_count

#### examine correlation betweent rescaled and the unscaled new adverb score.

In [None]:
%%R
cor(df$advScoresNew, df$as_rescaled)

In [None]:
%%R
plot(df$advScoresNew, df$as_rescaled) # monotonic transformation, sigma curve

#### predicting the outcome - unique application rate (uptake), compared with the new adverb score that hasn't been rescaled.

In [None]:
%%R
lm_new_adv_score <- lm(unique_apply_rate ~ advScoresNew + word_count + word_count_sqr + title + title*advScoresNew, data = df)
lm_new_adv_score_rescale <- lm(unique_apply_rate ~ as_rescaled + word_count + word_count_sqr + title + title*as_rescaled, data = df)

In [None]:
%%R
# new adverb score (unscaled)
summary(lm_new_adv_score)

In [None]:
%%R
# new adverb score (rescaled)
summary(lm_new_adv_score_rescale)

In [None]:
%%R
# anova on new adverb score (unscaled)
anova(lm_new_adv_score)

In [None]:
%%R
# anova on new adverb score (rescaled)
anova(lm_new_adv_score_rescale)

## 1.6 Notes on Adverb Score


*   Rescaled new adverb score slightly outperforms the new adverb score without scaling.
*   This version computes adverb scores only based on the adverbs that the training algorithm has "seen." Adverbs that the algorithm hasn't seen before are not included in the computation. (Note: we can build algorithms which can address "unseen" adverbs if you want.)
*   Some job postings ( $\sim$ 43.5 %) do not contain any adverb, therefore their adverb score is defined as 0.  



# 2. Gender Score

Gender score combines the coefficients of 10 models: 9 of which are from the primary step where female words, male words, and female male words combined are used to fit three different outcomes: overall uptake, male uptake, and female uptake; 1 model is from the secondary step, where we fit a super-learner to combine the results from the above 9 models. The detailed model fitting are provided in the Overleaf documentation. This colab is to guide through the computation. 

After several precomputation steps, the gender score renders down to the sum of the following four componenets: 

*   Female word component: a vector of lengh 432 (3 times of the number of female words)
*   Male word component: a vector of lengh 474 (3 times of the number of male words 
*   Intensity component: a vector of length 906 (3 times of all gender words)
*   Constant = 0.21 (Intercept from the super-learner)



## 2.1 To compute the female word component, we need the following information: 

* female word multiplier: provided as f_multiplier in Ongig google drive, Ongig_Colab folder
* female word count: the words must be ordered alphabetically, each corresponding word count value is an integer starting from 0
* total number of female word: the sum of the total female word

Example used here is from job_id = 1094715

In [6]:
%cd /content/drive/MyDrive/Ongig/Ongig_Colab/

/content/drive/MyDrive/Ongig/Ongig_Colab


In [22]:
%%R
# Let's load everything upfront

# These following vectors are used to compute each component of the gender score
f_multiplier <- read.csv("f_multiplier.csv")[,"x"] # female component
m_multiplier <- read.csv("m_multiplier.csv")[,"x"] # male component
m_f_multiplier <- read.csv("m_f_multiplier.csv")[,"x"] # intensity component

example <- read.csv("job_id_1094715_data.csv")
example_femaleWordCounts <- read.csv("job_id_1094715_fwCounts.csv", row.names = NULL)
example_maleWordCounts <- read.csv("job_id_1094715_mwCounts.csv", row.names = NULL)

# remove job_id 
example_femaleWordCounts <- example_femaleWordCounts[,-1]
example_maleWordCounts <- example_maleWordCounts[,-1]

# combine the female and male words
example_genderWordCounts <- cbind(example_maleWordCounts, example_femaleWordCounts)

femaleWordCount <- rowSums(example_femaleWordCounts[1,])
maleWordCount <- rowSums(example_maleWordCounts[1,])
genderWordCount <- rowSums(example_genderWordCounts[1,])

In [26]:
%%R
# The most important step is to order the words alphabetically,
# so that the coefficients will line up correctly with its corresponding word!

example_femaleWordCounts <- example_femaleWordCounts[,order(names(example_femaleWordCounts))]
example_femaleWordCounts # now the words should be alphabetically ordered
ncol(example_femaleWordCounts)

# replicate the wordCounts 3 times
# then take the dot product with the f_multiplier
# as.numeric function makes sure the values are numeric
female_component <- as.numeric(f_multiplier) %*% as.numeric(rep(example_femaleWordCounts[1,],3))
female_component

             [,1]
[1,] -0.003023532


## 2.2 To compute the male word component, we need the following information: 

* male word multiplier: provided as m_multiplier in Ongig google drive, Ongig_Colab folder
* male word count: the words must be ordered alphabetically, each corresponding word count value is an integer starting from 0
* total number of male word: the sum of the total male word

Example used here is still from job_id = 1094715. Below the calculation is very similar to the calculation we have done above.

In [27]:
%%R
# The most important step is to order the words alphabetically,
# so that the coefficients will line up correctly with its corresponding word!
example_maleWordCounts <- example_maleWordCounts[,order(names(example_maleWordCounts))]

# replicate the wordCounts 3 times
# then take the dot product with the f_multiplier
# as.numeric function makes sure the values are numeric
male_component <- as.numeric(m_multiplier) %*% as.numeric(rep(example_maleWordCounts[1,],3))
male_component <- male_component / maleWordCount
male_component

              [,1]
[1,] -0.0001136762


## 2.3 To compute the intensity component, we need the following information: 

* gender word multiplier: provided as m_f_multiplier in Ongig google drive, Ongig_Colab folder
* male and female word count: the words must be ordered alphabetically, each corresponding word count value is an integer starting from 0
* total number of gender words: the sum of the total male and female word

Example used here is still from job_id = 1094715. Below the calculation is very similar to the calculation we have done above.

In [28]:
%%R
# The most important step is to order the words alphabetically,
# so that the coefficients will line up correctly with its corresponding word!
example_genderWordCounts <- example_genderWordCounts[,order(names(example_genderWordCounts))]

# replicate the wordCounts 3 times
# then take the dot product with the f_multiplier
# as.numeric function makes sure the values are numeric
intensity_component <- as.numeric(m_f_multiplier) %*% as.numeric(rep(example_genderWordCounts[1,],3))
intensity_component <- intensity_component / genderWordCount
intensity_component

          [,1]
[1,] 0.0127578


## 2.4 Final step of computing gender score for job_id = 1094715 is to sum up all the components above and add the constant (The constant is the same for all job postings)

This result agrees with what we have computed following our algorithm. In the folder "/content/drive/MyDrive/Ongig/Ongig_Colab/", you will find a document - "dfwithFwMw_IdGenderScore.csv", which contains all the job_id and its precomputed gender score for your reference.

In [29]:
%%R
CONST = 0.2146933

gender_score <- female_component + male_component + intensity_component + CONST
gender_score

          [,1]
[1,] 0.2243139


## 2.5 Discussion

* Again, it is very important to order the words before taking the dot product. If words are not ordered, the coefficients will be misaligned with their corresponding word. 

* This version of the gender score is not scaled, thus the range is not from 0 to 100 yet. Since we are going to use the "First impression letter grade" we may just convert the score directly to the letter grade without rescaling to [0, 100]. 

* The reason I think we may compute the three components (female component, male component, and intensity compoent) separately, then add them up is that then we will have more granular information on your side, especially we think gender score is a high-dimensional scale, thus we may give recommendation depending on certain dimensions. 