# Applications processing Automation

The purpose of this code is to automate the first trivial filtering steps in the processing of the applications for the TReND in Africa Computational Neuroscience and Machine Learning Basics course.

This code is organized as a set of functions to be applied as a processing pipeline on the application responses data (See [documentation](https://docs.google.com/document/d/1n4pMEOgMuenuFpN6zXQtZlpYFXwPat2P4-SzZaN8mFg/edit?usp=drivesdk)).

### **How to use (as a developer):**
Just clone the Github repository and get into the business!\
If you have anaconda and yupyter installed locally you can just clone the repory directly on your machine. Elsewise, you can clone it into Google Colab.
(In either case, don't forget to regularly pull and push changes).

### **How to use (as a reviewer):**
If you are on Github now, open this notebook in Google Colab, or clone the whole repo locally, so you can run the cells. In case of running it in Colab, don't forget to save and download the resulting Excel sheet of the processed responses into a local folder.

In [93]:
import numpy as np
import pandas as pd

In [94]:
# Note that you have to download the responses data Excel sheet from Google Drive and put it in the same folder as the code.
# You don't have to do this if you cloned the Github repo (all will be organized in the repo).

# TODO: Load data directly from Google Drive.

#DATA_DIR = './Copy of Answers_Application_form_TReND_Comp_Neuro_FIRSTPASS.xlsx'
DATA_DIR = './dummy_responses_data.xlsx'

raw_responses_df = pd.read_excel(DATA_DIR)

# Adding two columns to the responses DataFrame (initialized with None for all cells).
raw_responses_df['flag'] = None  #String ("flagged" or None
raw_responses_df['notes'] = None #String (Text of notes == reasaons for flagging)

raw_responses_df.columns

Index(['Timestamp', 'Email address', 'First Name', 'Last Name', 'Unnamed: 4',
       'Unnamed: 5', 'Unnamed: 6', 'Career Stage',
       'Name of current University or Research Institution',
       'Undergraduate degree (completed or ongoing, eg. Neuroscience, Mathematics)',
       'Master's degree (completed or ongoing, if applicable, eg. Neuroscience, Mathematics)',
       'PhD degree (completed or ongoing, if applicable, eg. Neuroscience, Mathematics)',
       'Current research focus or research focus of the last research project you were engaged in (if applicable)',
       'Why would you like to attend the course? (2000 characters max)',
       'How do you think you could contribute to the course?  (2000 characters max)',
       'At the end of the first week the students will start a short individual research project. What would be your dream project?  (2000 characters max)',
       'Please attach a 1-page CV in pdf format (documents longer than one page will be discarded). If you h

In [95]:
# Use this dictionary as a reference for column names.

questions_dic = {i: column for i, column in enumerate(raw_responses_df.columns)}
questions_dic

{0: 'Timestamp',
 1: 'Email address',
 2: 'First Name',
 3: 'Last Name',
 4: 'Unnamed: 4',
 5: 'Unnamed: 5',
 6: 'Unnamed: 6',
 7: 'Career Stage',
 8: 'Name of current University or Research Institution',
 9: 'Undergraduate degree (completed or ongoing, eg. Neuroscience, Mathematics)',
 10: "Master's degree (completed or ongoing, if applicable, eg. Neuroscience, Mathematics)",
 11: 'PhD degree (completed or ongoing, if applicable, eg. Neuroscience, Mathematics)',
 12: 'Current research focus or research focus of the last research project you were engaged in (if applicable)',
 13: 'Why would you like to attend the course? (2000 characters max)',
 14: 'How do you think you could contribute to the course?  (2000 characters max)',
 15: 'At the end of the first week the students will start a short individual research project. What would be your dream project?  (2000 characters max)',
 16: 'Please attach a 1-page CV in pdf format (documents longer than one page will be discarded). If you hav

In [96]:
# Replpace column names with indices.

responses_df = raw_responses_df.rename(columns={column: i for i, column in enumerate(questions_dic.values())})
responses_df.columns

Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], dtype='int64')

In [97]:
# Carefully specify names of the columns to be processed (mostly responses for essay questions).

# TODO: To avoid any naming mistakes, replace questions, as columns names, with a key (or simply, indices).

essay_qs = [13, 14, 15]

In [98]:
# Specify the minimum and maximun number of words for ansewrs for essay questions.
# Note: These parameters apply for for all essay questions.

MIN_WORDS_NUM = 50
MAX_WORDS_NUM = 2000

## Utility functions

In [99]:
def word_count(answer):
    """
    Takes a specific answer (cell) of a specific essay question and returns the answer's number of words.
    """
    
    return len(answer.split())


def set_flag(responses_df, email):
    """
    Set the 'flag' column value to "flagged" for a response chosen by it's 'Email address'
    """
    
    # This modifies the DataFrame itself (i.e change in place)
    # # 1 - 'Email address' column, 18 = 'flag' column
    responses_df.loc[responses_df[1] == email, 18] = "flagged"

## Main Pipeline

In [100]:
def remove_duplicates(responses_df):
    """
    removes duplicated rows (responses) based on 'Email address' and keeps the last response submitted.
    Note: Some students may make changes to their responses and submit a new one,
    this's why this function keeps the last response submitted and removes preceding ones.
    
    TODO: Check with the organizers what else is an adequate action.
    
    params :
        response_df: the responses data (DataFrame)
    returns:
        edited_responses_df: An edited response_df with duplicates removed
    """
    edited_responses_df = responses_df.copy()
    
    # 1 - 'Email address' column
    edited_responses_df.drop_duplicates(subset=[1], keep='last')
    
    return edited_responses_df


def flag_duplicates(responses_df):
    """
    flags duplicated rows (responses) based on 'Email address' and keeps the last response submitted.
    Note: Some students may make changes to their responses and submit a new one,
    this's why this function keeps the last response submitted and removes preceding ones.
    
    TODO: Check with the organizers what else is an adequate action.
    
    params :
        response_df: the responses data (DataFrame)
    returns:
        edited_responses_df: An edited responses_df with 'flag' column updated
    """
    
    edited_responses_df = responses_df.copy()
    
    # 18 = 'flag' column, 19 = 'notes' column
    edited_responses_df[18], edited_responses_df[19] = np.where(
        
        # True/False nupmy array (True: duplicated, False: unique)
        [edited_responses_df.duplicated(subset=[1], keep='last')].array,
        
        # Update row values if the corresponding True/False array element is True
        ['flagged', edited_responses_df[19] + " - " + "A duplicated response" + " - "],
        
        # Leave row values as they were if the corresponding True/False array element is False
        [edited_responses_df[18], edited_responses_df[19]]
    )
    
    return edited_responses_df


def flag_short(responses_df, essay_qs):
    """
    flags insufficently short answers (less than a specific lower limit) for a specified
    set of essay questions.
    
    params :
        response_df: the responses data (DataFrame)
        essay_qs   : essay questions (list)
    returns:
        edited_responses_df: An edited responses_df with short answers removed
    """
    
    edited_responses_df = responses_df.copy()
     
    # Go through all the responses and for each response go through the answers for the essay questions
    for index, response in edited_responses_df.iterrows():
        for question in essay_qs:
            
            if response[question] < MIN_WORDS_NUM:
                responses[18] = "flagged"
                
                if "Insufficient short answer/s" not in responses[19]:
                    responses[19] = responses_df['notes'] + " - " + "Insufficient short answer/s" + " - "
                    

# Should we flag long answers ??
def flag_long(responses_df, essay_qs):
    """
    flags extremely long answers (more than a specific upprt limit) for a specified
    set of essay questions.
    
    params :
        response_df: the responses data (DataFrame)
        essay_qs   : essay questions (list)
    returns:
        edited_response_df: An edited responses_df with long answers removed
    """
     
    edited_responses_df = responses_df.copy()
    
    # Go through all the responses and for each response go through the answers for the essay questions
    for index, response in edited_responses_df.iterrows():
        for question in essay_qs:
            
            if response[question] < MAX_WORDS_NUM:
                responses[18] = "flagged"
                
                if "Extremely long answer/s" not in responses[19]:
                    responses[19] = responses_df[19] + " - " + "Extremely lomg answer/s" + " - "
                    
    return edited_responses_df

In [101]:
responses_df_flagged_duplicates = flag_duplicates(responses_df)

AttributeError: 'list' object has no attribute 'array'

In [102]:
test_df = responses_df.duplicated(subset=[1], keep='last')

In [103]:
test_df.array

<NumpyExtensionArray>
[ True,  True, False, False, False, False, False, False, False, False, False,
 False,  True,  True, False, False, False, False, False]
Length: 19, dtype: bool

In [104]:
np.array([[test_df], [test_df]]).shape

(2, 1, 19)

In [105]:
np.array([[test_df], [test_df]])

array([[[ True,  True, False, False, False, False, False, False, False,
         False, False, False,  True,  True, False, False, False, False,
         False]],

       [[ True,  True, False, False, False, False, False, False, False,
         False, False, False,  True,  True, False, False, False, False,
         False]]])

In [None]:
responses_df

In [None]:
def main(responses_df):
    
    responses_df_flagged_duplicates = flag_duplicates(responses_df)
    responses_df_flagged_short = flag_short(responses_df_flagged_duplicates, essay_qs)
    
    final_responses_df = responses_df_flagged_short
    
    return final_responses_df