# Recruitment Analysis

A simple script to test for anomalies within recruitment.

### Importing the necessary libraries for this:

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import math
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.graphics.regressionplots import abline_plot

### Import the recruitment file:

In [2]:
recruitment = pd.read_csv("recruitment.csv")

### Testing for those who are unable to commit 1 year to PV


The idea is simple, just checked who did not tick yes to the commitment question in the google form.

Now we aim to clean the table to remove unnecessary columns for easier viewing:

In [3]:
# dropping unnecessary columns such as tag number, year of study, graduating
del recruitment['Tag Number']
del recruitment['1h) As of August 2021, what year of study would you be in?']
del recruitment['1i) Will you be graduating before Jun 2022?']

We check for those that have not checked for the commitments section:

In [4]:
# first let me check what is the data type for the column
recruitment['1j) Should you be accepted into Protege Ventures, would you be able to commit to sessions on a weekly basis for the next 12 months? Sessions will be suspended during exam season.'].dtype

# so it is a string basically, okay

dtype('O')

In [5]:
unavailable = recruitment.loc[recruitment['1j) Should you be accepted into Protege Ventures, would you be able to commit to sessions on a weekly basis for the next 12 months? Sessions will be suspended during exam season.'] != 'checked']
len(unavailable)

0

Since everyone can commit to PV, seeing that there are 0 people who unchecked, this column is also unnecessary, and can be removed to save columns, together with the next column:

In [6]:
del recruitment['1j) Should you be accepted into Protege Ventures, would you be able to commit to sessions on a weekly basis for the next 12 months? Sessions will be suspended during exam season.']
del recruitment['1k) What other commitments might you have for the next 12 months? (ie CCA Clubs, Freelance work, etc)']

### Test: Testing for lenient assessors

This idea revolves around generating average word count for each score, for each question.

I.e. for question 1, average word count for score 1, 2, 3, 4, 5 is maybe 50, 100, 150, 200, 250, through counting each word in the answer, and plotting a regression. We continue this for each question.

We can see thus who tends to grade above / below the mean scores for several questions, and flag them out for being lenient.

We would also need to run a hypothesis test to prove that this is the case actually, that we can in fact, use word count as a way to determine potential score one should received.

We need to clean up the table, so let's check what the type of the scores is, and change it to integers:

In [8]:
# find a random variable
variable = recruitment.iloc[0, [3]]
variable.dtype

dtype('O')

In [9]:
# now we clean for all the columns that have scores present in them

# firstly we start by creating a column list with the relevant numbers needed
column_list = [3, 4, 6, 7, 9, 10, 12, 13, 21, 22, 24, 25]

# now loop through the column:
for index in column_list:
    
    # loop through each cell in each column
    for i, row_value in recruitment.iloc[:, [index]].iterrows():
        
        # change the values to strings so that you can remove extra comments
        recruitment.iloc[i, [index]] = recruitment.iloc[i, [index]].astype(str)
        
        # now remove any newlines from the cells
        recruitment.iloc[i, [index]] = recruitment.iloc[i, [index]].replace('\n','', regex=True)
        
        # because some scores have ranknigs attached to them, i.e. applicant 67: 2: I don't see much .... score
        # so we have to remove this, by taking the first score attached, i.e. take 2 and filter out the rest
        recruitment.iloc[i, [index]] = recruitment.iloc[i, [index]].str[:1]

In [10]:
# convert the strings into integers because it doesn't work by iloc somehow    
recruitment[recruitment.columns[column_list]] = recruitment[recruitment.columns[column_list]].apply(pd.to_numeric, errors = 'coerce')

Now we try to split up each part, and create tables to store the text count for each answer of each applicant, for each question in a table:

In [11]:
column_list = [3, 4, 6, 7, 9, 10, 12, 13, 21, 22, 24, 25]

# creating the various tables for the various questions
question_one = recruitment.iloc[:, [0, 1, 2, 3, 4]].copy()
question_two = recruitment.iloc[:, [0, 1, 5, 6, 7]].copy()
question_three = recruitment.iloc[:, [0, 1, 8, 9, 10]].copy()
question_four = recruitment.iloc[:, [0, 1, 11, 12, 13]].copy()
question_five = recruitment.iloc[:, [0, 1, 20, 21, 22]].copy()
question_six = recruitment.iloc[:, [0, 1, 23, 24, 25]].copy()

In [12]:
from leniency import wordcount


wordcount(question_one)
question_one

ImportError: cannot import name 'wordcount' from 'leniency' (unknown location)