# Applications processing Automation

The purpose of this code is to automate the first trivial filtering steps in the processing of the applications for the TReND in Africa Computational Neuroscience and Machine Learning Basics course.

This code is organized as a set of functions to be applied as a processing pipeline on the application responses data (See [documentation](https://docs.google.com/document/d/1n4pMEOgMuenuFpN6zXQtZlpYFXwPat2P4-SzZaN8mFg/edit?usp=drivesdk)).

### **How to use (as a developer):**
Just clone the Github repository and get into the business!\
If you have anaconda and yupyter installed locally you can just clone the repory directly on your machine. Elsewise, you can clone it into Google Colab.
(In either case, don't forget to regularly pull and push changes).

### **How to use (as a reviewer):**
If you are on Github now, open this notebook in Google Colab, or clone the whole repo locally, so you can run the cells. In case of running it in Colab, don't forget to save and download the resulting Excel sheet of the processed responses into a local folder.

In [1166]:
import numpy as np
import pandas as pd

In [1167]:
# Note that you have to download the responses data Excel sheet from Google Drive and put it in the same folder as the code.
# You don't have to do this if you cloned the Github repo (all will be organized in the repo).
 
# TODO: Load data directly from Google Drive.

# Loading students responses data
#DATA_DIR = './Copy of Answers_Application_form_TReND_Comp_Neuro_FIRSTPASS.xlsx'
STD_DATA_DIR = './dummy_students_responses_data.xlsx'
std_raw_responses_df = pd.read_excel(STD_DATA_DIR)

# Loading references responses data
REF_DATA_DIR = './dummy_references_responses_data.xlsx'
ref_raw_responses_df = pd.read_excel(REF_DATA_DIR)

# Adding two columns to the responses DataFrame (initialized with None for all cells).
std_raw_responses_df['Flag'] = None  #String ("flagged" or None)
std_raw_responses_df['Notes'] = None #String (Text of notes == reasaons for flagging)

In [1168]:
raw_responses_df.columns

Index(['Timestamp', 'Email address', 'First Name', 'Last Name', 'Unnamed: 4',
       'Unnamed: 5', 'Unnamed: 6', 'Career Stage',
       'Name of current University or Research Institution',
       'Undergraduate degree (completed or ongoing, eg. Neuroscience, Mathematics)',
       'Master's degree (completed or ongoing, if applicable, eg. Neuroscience, Mathematics)',
       'PhD degree (completed or ongoing, if applicable, eg. Neuroscience, Mathematics)',
       'Current research focus or research focus of the last research project you were engaged in (if applicable)',
       'Why would you like to attend the course? (2000 characters max)',
       'How do you think you could contribute to the course?  (2000 characters max)',
       'At the end of the first week the students will start a short individual research project. What would be your dream project?  (2000 characters max)',
       'Please attach a 1-page CV in pdf format (documents longer than one page will be discarded). If you h

In [1169]:
ref_raw_responses_df.columns

Index(['Email Address', 'Student Code', 'Student First Name',
       'Student Last Name', 'Letter'],
      dtype='object')

In [1170]:
# Use this dictionary as a reference for column names.

questions_dic = {i: column for i, column in enumerate(std_raw_responses_df.columns)}
questions_dic

{0: 'Timestamp',
 1: 'Email address',
 2: 'First Name',
 3: 'Last Name',
 4: 'Unnamed: 4',
 5: 'Unnamed: 5',
 6: 'Unnamed: 6',
 7: 'Career Stage',
 8: 'Name of current University or Research Institution',
 9: 'Undergraduate degree (completed or ongoing, eg. Neuroscience, Mathematics)',
 10: "Master's degree (completed or ongoing, if applicable, eg. Neuroscience, Mathematics)",
 11: 'PhD degree (completed or ongoing, if applicable, eg. Neuroscience, Mathematics)',
 12: 'Current research focus or research focus of the last research project you were engaged in (if applicable)',
 13: 'Why would you like to attend the course? (2000 characters max)',
 14: 'How do you think you could contribute to the course?  (2000 characters max)',
 15: 'At the end of the first week the students will start a short individual research project. What would be your dream project?  (2000 characters max)',
 16: 'Please attach a 1-page CV in pdf format (documents longer than one page will be discarded). If you hav

In [1171]:
# Replpace column names with indices.

std_responses_df = std_raw_responses_df.rename(columns={column: i for i, column in enumerate(questions_dic.values())})
std_responses_df.columns

Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], dtype='int64')

In [1172]:
# Carefully specify names of the columns to be processed (mostly responses for essay questions).

# TODO: To avoid any naming mistakes, replace questions, as columns names, with a key (or simply, indices).

essay_qs = [13, 14, 15]

In [1173]:
# Specify the minimum and maximun number of words for ansewrs for essay questions.
# Note: These parameters apply for for all essay questions.

MIN_WORDS_NUM = 50
MAX_WORDS_NUM = 2000

## Utility functions

In [1174]:
def word_count(answer):
    """
    Takes a specific answer (cell) of a specific essay question and returns the answer's number of words.
    """
    
    return len(answer.split())


def set_flag(responses_df, email):
    """
    Sets the 'flag' column value to "flagged" for a response chosen by it's 'Email address'
    """
    
    # This modifies the DataFrame itself (i.e change in place)
    # # 1 - 'Email address' column, 18 = 'flag' column
    responses_df.iloc[responses_df[1] == email, 18] = "flagged"

    
def leave_note(responses_df, response_index, note_text):
    """
    Appends a note to the 'Notes' column.
    """
    
    edited_responses_df = responses_df.copy()
    
    if note_text not in str(edited_responses_df.iloc[response_index, 19]):
        if edited_responses_df.iloc[response_index, 19] is None:
            edited_responses_df.iloc[response_index, 19] = note_text
        else:
            edited_responses_df.iloc[response_index, 19] = str(edited_responses_df.iloc[response_index, 19]) + " - " + note_text + " - " 
            
    return edited_responses_df
    
    
def index_column_names(df):
    """
    Replaces column names with indices.
    """

    indices_dic = {i: column for i, column in enumerate(df.columns)}
    df = df.rename(columns={column: i for i, column in enumerate(indices_dic.values())})

    return df

## Main Pipeline

In [1181]:
def remove_duplicates(responses_df):
    """
    removes duplicated rows (responses) based on 'Email address' and keeps the last response submitted.
    Note: Some students may make changes to their responses and submit a new one,
    this's why this function keeps the last response submitted and removes preceding ones.
    
    TODO: Check with the organizers what else is an adequate action.
    
    params :
        response_df: the responses data (DataFrame)
    returns:
        edited_responses_df: An edited response_df with duplicates removed
    """
    edited_responses_df = responses_df.copy()
    
    # 1 = 'Email address' column
    edited_responses_df.drop_duplicates(subset=[1], keep='last')
    
    return edited_responses_df


def flag_duplicates(responses_df):
    """
    flags duplicated rows (responses) based on 'Email address' and keeps the last response submitted.
    Note: Some students may make changes to their responses and submit a new one,
    this's why this function keeps the last response submitted and removes preceding ones, and leaves a note.
    
    TODO: Check with the organizers what else is an adequate action.
    
    params :
        response_df: the responses data (DataFrame)
    returns:
        edited_responses_df: An edited responses_df with 'flag' column updated
    """
    
    edited_responses_df = responses_df.copy()
    
    # Format: df['col'] = (value_if_false).where(condition, value_if_true)
    
    # 18 = 'flag' column, 19 = 'notes' column
    edited_responses_df[18] = (edited_responses_df[18]).where(
        # True/False nupmy array - True: duplicated, False: unique (before inversion)
        np.invert(np.array(edited_responses_df.duplicated(subset=[1], keep='last'))),
        "flagged"
    )
    # 18 = 'flag' column, 19 = 'notes' column
    edited_responses_df[19] = (edited_responses_df[19]).where(
        # True/False nupmy array - True: duplicated, False: unique (before inversion)
        np.invert(np.array(edited_responses_df.duplicated(subset=[1], keep='last'))),
        "A duplicated response"
    )
    
    return edited_responses_df


def flag_short(responses_df, essay_qs):
    """
    flags insufficently short answers (less than a specific lower limit) for a specified
    set of essay questions, and leaves a note.
    
    params :
        response_df: the responses data (DataFrame)
        essay_qs   : essay questions (list)
    returns:
        edited_responses_df: An edited responses_df with short answers flagged
    """
    
    edited_responses_df = responses_df.copy()
    
    # Go through all the responses and for each response go through the answers for the essay questions
    for row_index in range(len(edited_responses_df)):
        for question in essay_qs:
            
            if word_count(str(edited_responses_df.iloc[row_index, question])) < MIN_WORDS_NUM:
                edited_responses_df.iloc[row_index, 18] = "flagged"
                
                edited_responses_df = leave_note(edited_responses_df, row_index, "Insufficient short answer/s")        
                    
    return edited_responses_df
                    

# Should we flag long answers ??
def flag_long(responses_df, essay_qs):
    """
    flags extremely long answers (more than a specific upprt limit) for a specified
    set of essay questions, and leaves a note.
    
    params :
        response_df: the responses data (DataFrame)
        essay_qs   : essay questions (list)
    returns:
        edited_response_df: An edited responses_df with long answers flagged
    """
     
    edited_responses_df = responses_df.copy()
    
    # Go through all the responses and for each response go through the answers for the essay questions
    for row_index in range(len(edited_responses_df)):
        for question in essay_qs:
            
            if word_count(str(edited_responses_df.iloc[row_index, question])) > MAX_WORDS_NUM:
                edited_responses_df.iloc[row_index, 18] = "flagged"
                
                edited_responses_df = leave_note(edited_responses_df, row_index, "Extremely long answer/s")
                        
    return edited_responses_df


def match_references(std_responses_df, ref_responses_df):
    """
    flags student response if they get less than the required number of reference letters, and leaves a note.
    
    params :
        std_responses_df : students responses data (DataFrame)
        ref_responses_df : references responses data (DataFrame)
    returns:
        edited_std_responses_df: An edited std_responses_df with unsatisfied conditions for references letters answers removed
    """
     
    ref_letters_df = pd.DataFrame()
    edited_std_responses_df = std_responses_df.copy()
    
    for row_index in range(len(edited_std_responses_df)):
        
        if edited_std_responses_df.iloc[row_index, 1] not in ref_responses_df['Student Code'].values:
            
            edited_std_responses_df.iloc[row_index, 18] = "flagged"
            edited_std_responses_df = leave_note(edited_std_responses_df, row_index, "Got no reference letters")
            
        elif ref_responses_df['Student Code'].value_counts()[edited_std_responses_df.iloc[row_index, 1]] == 1:
            
            edited_std_responses_df.iloc[row_index, 18] = "flagged"
            edited_std_responses_df = leave_note(edited_std_responses_df, row_index, "Got only one reference letter")
            
        elif ref_responses_df['Student Code'].value_counts()[edited_std_responses_df.iloc[row_index, 1]] > 2:
            
            edited_std_responses_df.iloc[row_index, 18] = "flagged"
            edited_std_responses_df = leave_note(edited_std_responses_df, row_index, "Some reference/s submitted more than two letters")
                        
    return edited_std_responses_df



In [1176]:
def main(responses_df):
    
    responses_df_flagged_duplicates = flag_duplicates(responses_df)
    responses_df_flagged_short = flag_short(responses_df_flagged_duplicates, essay_qs)
    responses_df_final = flag_long (responses_df_flagged_short, essay_qs)
    
    return responses_df_final

In [1177]:
responses_df_final = main(std_responses_df)

In [1182]:
df = match_references(responses_df_final, ref_raw_responses_df)

In [1183]:
df.iloc[:, 18:20]

Unnamed: 0,18,19
0,flagged,A duplicated response - Got only one reference...
1,flagged,A duplicated response - Got only one reference...
2,flagged,Insufficient short answer/s - Got only one ref...
3,flagged,Insufficient short answer/s
4,flagged,Insufficient short answer/s
5,flagged,Insufficient short answer/s
6,flagged,Insufficient short answer/s
7,flagged,Insufficient short answer/s
8,flagged,Insufficient short answer/s
9,flagged,Insufficient short answer/s
