# Recruitment Analysis

A simple script to test for anomalies within recruitment.

### Importing the necessary libraries for this:

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import math
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.graphics.regressionplots import abline_plot

### Import the recruitment file:

In [None]:
recruitment = pd.read_csv("recruitment.csv")

### Testing for those who are unable to commit 1 year to PV


The idea is simple, just checked who did not tick yes to the commitment question in the google form.

#### Now we aim to clean the table to remove unnecessary columns for easier viewing:

In [None]:
# dropping unnecessary columns such as tag number, year of study, graduating
del recruitment['Tag Number']
del recruitment['1h) As of August 2021, what year of study would you be in?']
del recruitment['1i) Will you be graduating before Jun 2022?']

#### We check for those that have not checked for the commitments section:

In [None]:
# first let me check what is the data type for the column
recruitment['1j) Should you be accepted into Protege Ventures, would you be able to commit to sessions on a weekly basis for the next 12 months? Sessions will be suspended during exam season.'].dtype

# so it is a string basically, okay

In [None]:
unavailable = recruitment.loc[recruitment['1j) Should you be accepted into Protege Ventures, would you be able to commit to sessions on a weekly basis for the next 12 months? Sessions will be suspended during exam season.'] != 'checked']
len(unavailable)

#### Since everyone can commit to PV, seeing that there are 0 people who unchecked, this column is also unnecessary, and can be removed to save columns, together with the next column:

In [None]:
del recruitment['1j) Should you be accepted into Protege Ventures, would you be able to commit to sessions on a weekly basis for the next 12 months? Sessions will be suspended during exam season.']
del recruitment['1k) What other commitments might you have for the next 12 months? (ie CCA Clubs, Freelance work, etc)']

### Test: Testing for lenient assessors

This idea revolves around generating average word count for each score, for each question.

I.e. for question 1, average word count for score 1, 2, 3, 4, 5 is maybe 50, 100, 150, 200, 250, through counting each word in the answer, and plotting a regression. We continue this for each question.

We can see thus who tends to grade above / below the mean scores for several questions, and flag them out for being lenient.

We would also need to run a hypothesis test to prove that this is the case actually, that we can in fact, use word count as a way to determine potential score one should received.

#### We need to clean up the table, so let's check what the type of the scores is, and change it to integers:

In [None]:
# find a random variable
variable = recruitment.iloc[0, [3]]
variable.dtype

In [None]:
# now we clean for all the columns that have scores present in them

# firstly we start by creating a column list with the relevant numbers needed
column_list = [3, 4, 6, 7, 9, 10, 12, 13, 21, 22, 24, 25]

# now loop through the column:
for index in column_list:
    
    # loop through each cell in each column
    for i, row_value in recruitment.iloc[:, [index]].iterrows():
        
        # change the values to strings so that you can remove extra comments
        recruitment.iloc[i, [index]] = recruitment.iloc[i, [index]].astype(str)
        
        # now remove any newlines from the cells
        recruitment.iloc[i, [index]] = recruitment.iloc[i, [index]].replace('\n','', regex=True)
        
        # because some scores have ranknigs attached to them, i.e. applicant 67: 2: I don't see much .... score
        # so we have to remove this, by taking the first score attached, i.e. take 2 and filter out the rest
        recruitment.iloc[i, [index]] = recruitment.iloc[i, [index]].str[:1]

In [None]:
# convert the strings into integers because it doesn't work by iloc somehow    
recruitment[recruitment.columns[column_list]] = recruitment[recruitment.columns[column_list]].apply(pd.to_numeric, errors = 'coerce')

#### Now we try to split up each part, and create tables to store the text count for each answer of each applicant, for each question in a table:

In [None]:
column_list = [3, 4, 6, 7, 9, 10, 12, 13, 21, 22, 24, 25]

# creating the various tables for the various questions
question_one = recruitment.iloc[:, [0, 1, 2, 3, 4]].copy()
question_two = recruitment.iloc[:, [0, 1, 5, 6, 7]].copy()
question_three = recruitment.iloc[:, [0, 1, 8, 9, 10]].copy()
question_four = recruitment.iloc[:, [0, 1, 11, 12, 13]].copy()
question_five = recruitment.iloc[:, [0, 1, 20, 21, 22]].copy()
question_six = recruitment.iloc[:, [0, 1, 23, 24, 25]].copy()

In [None]:
from leniency import wordcount

# create a column for the word count
question_one['WordCount'] = np.nan

wordcount(question_one)

#### More Data Cleaning! Change all assessor scores that are 0 to 1:

Because apparently some people didn't read the question. It states rank from 1 - 5, not 0 - 5. So 0 and 1 scores have the same weight basically, we can just change 0 scores to 1 for those that actually read the questions properly.

This will mess up the standard deviations later if not done so, so let us change the scores that are 0 to 1.

In [None]:
from leniency import cleanscores

# create a loop to look through all the assessor scores and change it to 1
weighted_score = []

cleanscores(question_one)

#### Next, let's get the average scores of each applicant, by mean of both assessors:

This is important because as tested later, it is ideal that we find a variable for writing ability. Some people may write abit, but the content is lacking in actual material or quality. We need to find a variable for writing ability. 

Writing ability can be quantified by experience / ability to convey experience. And we can use experience in question 1I as a proxy variable to reduce this omitted variable bias, to reduce wild standard deviations in the tests later.

In [None]:
# let's get the average score of each applicant
question_one['AverageScore'] = (question_one["1l Assesor A's Score (1-5)"] + question_one["1l Assessor B's Score (1-5)"]) / 2

#### Now we need to clean up the table even more, by merging assessor A and B into the same column:

The idea is to do so through copying the table into another table, removing assessor A and their scores for the first table, and then removing assessor B and their scores for the second table, and then placing assessor B table under assessor A table.

In [None]:
# get assessor A scores for the first table, aka drop assessor b scores
question_one_a = question_one.copy()

# get assessor B scores for the second table, aka drop assessor a scores
question_one_b = question_one.copy()

In [None]:
# now we need to drop columns of assessor B and their scores in the first table
del question_one_a['Assesor B']
del question_one_a["1l Assessor B's Score (1-5)"]

# and we also need to drop columns of assessor A and their scores in the second table
del question_one_b['Assesor A']
del question_one_b["1l Assesor A's Score (1-5)"]

In [None]:
# let's now change the colmnn names of table one
question_one_a = question_one_a.rename(columns = {"Assesor A": "Assessor", "1l Assesor A's Score (1-5)": "AssessorScore"})

# and let's also change the column names of table two so that we can merge the tables together
question_one_b = question_one_b.rename(columns = {"Assesor B": "Assessor", "1l Assessor B's Score (1-5)": "AssessorScore"})

# now concatenate the two values together into a table containing everybody's scores for question one
question_one = pd.concat([question_one_a, question_one_b], axis = 0, ignore_index = True)

In [None]:
from leniency import logcount

# get the question score and word count table
q_one_scores = question_one.iloc[:, [2, 3, 4]].copy()

# add logarithmic count to the word count
logcount(q_one_scores)

q_one_scores

### Let's do a linear regression of log(word count) against the ordinal scores:

Initially, wanted to run ordinal regression. On second thought, it is not necessary because scores can be quantified, and differences can also be quantified. A linear / logistic regression works fine in this case.

The current equation works as follows:

$ {score} = {\beta}_{0} + {\beta}_{j} \:wordcount_{j} + ε $

Currently the understanding for the linear regression is that:

$\hat{score} = \hat{\beta}_{0} + \hat{\beta}_{j} \:log(wordcount_{j}) $

Of course, there is the case whereby we fail to account for a variable (omitted variable bias) of experience, as well as language ability.

A proxy for experience would be question one itself, asking about experience, for the next questions. As established earlier, to combat this omitted variable bias, we should find other possible proxy variables. Due to the lack of such a variable, we can't adopt this solution.

$\hat{score} = \hat{\beta}_{0} + \hat{\beta}_{j} \:log(wordcount _{j}) + \hat{\beta}_{k} \:proxy _{k} $

In [None]:
# import the linear regression library from sklearn
from sklearn.linear_model import LinearRegression
from leniency import regressquestion

# creating variables here for easier use later
X = q_one_scores['LogCount'].values.reshape(-1, 1)
Y = q_one_scores['AssessorScore'].values.reshape(-1, 1)

regressquestion(q_one_scores)

#### We want to get the std err variable for our calculations later on:

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(X[:,0], Y[:,0])

#### Now getting a summary of the entire statistics of the model:

In [None]:
from leniency import regressionstats

# find the regression statistics
regressionstats(X, Y)

#### Here we have two hypothesis, and need to test if the hypothesis is true:

${H_{0}} : \hat{\beta_{j}} = 0 $ , word count does not affect scores  
${H_{1}} : \hat{\beta_{j}} > 0 $ , word count affects scores

Let's use a significance level of $ {\alpha} = 0.01 $

The coefficient of log(word count) is significantly smaller than the significance level as seen from P > t, hence word count thus affect the scores.

Thus equation is:

$\hat{score} = \hat{1.079} + \hat{0.7025} \:log(wordcount_{j}) $

#### Now find the predicted scores:

In [None]:
from leniency import predictedscore

# find the predicted score now
predictedscore(question_one, slope, intercept)

#### We can try using a weighted predicted score using the average scores as a baseline for accuracy:

To reduce the chance of shitty answers being over-predicted.

In [None]:
from leniency import weightedscore

# now to get the weighted score
weightedscore(question_one)

#### Second algorithm for coming up with a weighted predicted score that punishes bad scores less:

In [None]:
from leniency import adjustedscore

# now to get the adjusted weighted score
adjustedscore(question_one)

#### We need to add the standard errors away from the assessor score, and also include average score:

The point of the average scores is that, maybe both assessors think the applicant answer is shit, but apparently it got a good grading on this model because of the false assumption of word count.

In [None]:
from leniency import assessordeviation

# now to get the standard deviations of the individual assessor
assessordeviation(question_one, std_err)

### Data Classification

We can now collect the standard deviations and find averages for each person, to test their leniency / strictness.

In [None]:
from leniency import namedict

# now create a dictionary to store all the names and their relevant score for the particular qn

# after storing all the names, immediately add it to the deviation table
deviation_table = pd.DataFrame([namedict(question_one)])
deviation_table = deviation_table.transpose().reset_index()
deviation_table.columns = ['Assessor', 'q1i']

### Now let's fill up the rest for the other questions that we have:

We are just copying the same exact script for the other 5 questions so let's just simplify the entire process and remove line breaks, comments, etc.

In [None]:
# creating for all the questions' table
question_two['WordCount'] = np.nan
wordcount(question_two)
    
question_three['WordCount'] = np.nan
wordcount(question_three)

question_four['WordCount'] = np.nan
wordcount(question_four)
    
question_five['WordCount'] = np.nan
wordcount(question_five)

question_six['WordCount'] = np.nan
wordcount(question_six)

In [None]:
# cleaning data for all the questions' table
weighted_score = []
cleanscores(question_two)

weighted_score = []
cleanscores(question_three)

weighted_score = []
cleanscores(question_four)

weighted_score = []
cleanscores(question_five)

weighted_score = []
cleanscores(question_six)

In [None]:
# getting the average score for each table
question_two['AverageScore'] = (question_two["2a Assesor A's Score (1-5)"] + question_two["2a Assesor B's Score (1-5)"]) / 2

question_three['AverageScore'] = (question_three["2b Assesor A's Score (1-5)"] + question_three["2b Assessor B's Score (1-5)"]) / 2

question_four['AverageScore'] = (question_four["2c Assesor A's Score (1-5)"] + question_four["2c Assessor B's Score (1-5)"]) / 2

question_five['AverageScore'] = (question_five["2f Assesor A's Score (1-5)"] + question_five["2f Assessor B's Score (1-5)"]) / 2

question_six['AverageScore'] = (question_six["2g Assesor A's Score (1-5)"] + question_six["2g Assessor B's Score (1-5)"]) / 2

In [None]:
# necessary transformations for table 2
question_two_a = question_two.copy()
question_two_b = question_two.copy()

del question_two_a['Assesor B']
del question_two_a["2a Assesor B's Score (1-5)"]
del question_two_b['Assesor A']
del question_two_b["2a Assesor A's Score (1-5)"]

question_two_a = question_two_a.rename(columns = {"Assesor A": "Assessor", "2a Assesor A's Score (1-5)": "AssessorScore"})
question_two_b = question_two_b.rename(columns = {"Assesor B": "Assessor", "2a Assesor B's Score (1-5)": "AssessorScore"})
question_two = pd.concat([question_two_a, question_two_b], axis = 0, ignore_index = True)

In [None]:
q_two_scores = question_two.iloc[:, [2, 3, 4]].copy()
logcount(q_two_scores)

In [None]:
X = q_two_scores['LogCount'].values.reshape(-1, 1)
Y = q_two_scores['AssessorScore'].values.reshape(-1, 1)
regressquestion(q_two_scores)

In [None]:
regressionstats(X, Y)

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(X[:,0], Y[:,0])

predictedscore(question_two, slope, intercept)
weightedscore(question_two)
adjustedscore(question_two)
assessordeviation(question_two, std_err)

deviation_table['q2a'] = deviation_table['Assessor'].map(namedict(question_two))

#### Continue for Question 3 Table:

In [None]:
# necessary transformations for table 3
question_three_a = question_three.copy()
question_three_b = question_three.copy()

del question_three_a['Assesor B']
del question_three_a["2b Assessor B's Score (1-5)"]
del question_three_b['Assesor A']
del question_three_b["2b Assesor A's Score (1-5)"]

question_three_a = question_three_a.rename(columns = {"Assesor A": "Assessor", "2b Assesor A's Score (1-5)": "AssessorScore"})
question_three_b = question_three_b.rename(columns = {"Assesor B": "Assessor", "2b Assessor B's Score (1-5)": "AssessorScore"})
question_three = pd.concat([question_three_a, question_three_b], axis = 0, ignore_index = True)

In [None]:
q_three_scores = question_three.iloc[:, [2, 3, 4]].copy()
logcount(q_three_scores)

In [None]:
np.log(132)

In [None]:
q_three_scores

In [None]:
X = q_three_scores['LogCount'].values.reshape(-1, 1)
Y = q_three_scores['AssessorScore'].values.reshape(-1, 1)
regressquestion(q_three_scores)

In [None]:
regressionstats(X, Y)

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(X[:,0], Y[:,0])

predictedscore(question_three, slope, intercept)
weightedscore(question_three)
adjustedscore(question_three)
assessordeviation(question_three, std_err)

deviation_table['q2b'] = deviation_table['Assessor'].map(namedict(question_three))

#### Continue for Question 4 Table:

In [None]:
# necessary transformations for table 4
question_four_a = question_four.copy()
question_four_b = question_four.copy()

del question_four_a['Assesor B']
del question_four_a["2c Assessor B's Score (1-5)"]
del question_four_b['Assesor A']
del question_four_b["2c Assesor A's Score (1-5)"]

question_four_a = question_four_a.rename(columns = {"Assesor A": "Assessor", "2c Assesor A's Score (1-5)": "AssessorScore"})
question_four_b = question_four_b.rename(columns = {"Assesor B": "Assessor", "2c Assessor B's Score (1-5)": "AssessorScore"})
question_four = pd.concat([question_four_a, question_four_b], axis = 0, ignore_index = True)

In [None]:
q_four_scores = question_four.iloc[:, [2, 3, 4]].copy()
logcount(q_four_scores)

In [None]:
X = q_four_scores['LogCount'].values.reshape(-1, 1)
Y = q_four_scores['AssessorScore'].values.reshape(-1, 1)
regressquestion(q_four_scores)

In [None]:
regressionstats(X, Y)

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(X[:,0], Y[:,0])

predictedscore(question_four, slope, intercept)
weightedscore(question_four)
adjustedscore(question_four)
assessordeviation(question_four, std_err)

deviation_table['q2c'] = deviation_table['Assessor'].map(namedict(question_four))

#### Continue for Question 5 Table:

In [None]:
# necessary transformations for table 5
question_five_a = question_five.copy()
question_five_b = question_five.copy()

del question_five_a['Assesor B']
del question_five_a["2f Assessor B's Score (1-5)"]
del question_five_b['Assesor A']
del question_five_b["2f Assesor A's Score (1-5)"]

question_five_a = question_five_a.rename(columns = {"Assesor A": "Assessor", "2f Assesor A's Score (1-5)": "AssessorScore"})
question_five_b = question_five_b.rename(columns = {"Assesor B": "Assessor", "2f Assessor B's Score (1-5)": "AssessorScore"})
question_five = pd.concat([question_five_a, question_five_b], axis = 0, ignore_index = True)

In [None]:
q_five_scores = question_five.iloc[:, [2, 3, 4]].copy()
logcount(q_five_scores)

In [None]:
X = q_five_scores['LogCount'].values.reshape(-1, 1)
Y = q_five_scores['AssessorScore'].values.reshape(-1, 1)
regressquestion(q_five_scores)

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(X[:,0], Y[:,0])

predictedscore(question_five, slope, intercept)
weightedscore(question_five)
adjustedscore(question_five)
assessordeviation(question_five, std_err)

deviation_table['q2f'] = deviation_table['Assessor'].map(namedict(question_five))

#### Continue for Question 6 Table:

In [None]:
# necessary transformations for table 6
question_six_a = question_six.copy()
question_six_b = question_six.copy()

del question_six_a['Assesor B']
del question_six_a["2g Assessor B's Score (1-5)"]
del question_six_b['Assesor A']
del question_six_b["2g Assesor A's Score (1-5)"]

question_six_a = question_six_a.rename(columns = {"Assesor A": "Assessor", "2g Assesor A's Score (1-5)": "AssessorScore"})
question_six_b = question_six_b.rename(columns = {"Assesor B": "Assessor", "2g Assessor B's Score (1-5)": "AssessorScore"})
question_six = pd.concat([question_six_a, question_six_b], axis = 0, ignore_index = True)

In [None]:
q_six_scores = question_six.iloc[:, [2, 3, 4]].copy()
logcount(q_six_scores)

In [None]:
X = q_six_scores['LogCount'].values.reshape(-1, 1)
Y = q_six_scores['AssessorScore'].values.reshape(-1, 1)
regressquestion(q_six_scores)

In [None]:
regressionstats(X, Y)

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(X[:,0], Y[:,0])

predictedscore(question_six, slope, intercept)
weightedscore(question_six)
adjustedscore(question_six)
assessordeviation(question_six, std_err)

deviation_table['q2g'] = deviation_table['Assessor'].map(namedict(question_six))
deviation_table

In [None]:
deviation_table.to_csv(r'C:\Users\User\OneDrive\Documents\Student Life\Protege Ventures\Data Science\Recruitment\leniency.csv', index = False)