# University Test Rakings in Italy

_A python-based solution to provide an early response to test takers_

**Author** <br>
Simone Maria Giancola <br>
simonegiancola09@gmail.com <br>
+0039 3314788683 <br>
Linkedin Profile: [Simone Giancola](https://www.linkedin.com/in/simone-maria-giancola-011465173/) <br>
I am absolutely not a professional, so any suggestion, either typo, code, writing, or analysis related is highly appreciated, and will be welcomed with enthusiasm.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#The-test" data-toc-modified-id="The-test-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The test</a></span><ul class="toc-item"><li><span><a href="#Structure" data-toc-modified-id="Structure-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Structure</a></span></li><li><span><a href="#Points" data-toc-modified-id="Points-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Points</a></span></li><li><span><a href="#Concerns" data-toc-modified-id="Concerns-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Concerns</a></span></li></ul></li><li><span><a href="#A-simple,-Python-made-(partial)-solution" data-toc-modified-id="A-simple,-Python-made-(partial)-solution-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>A simple, Python-made (partial) solution</a></span><ul class="toc-item"><li><span><a href="#Instances-of-the-problem" data-toc-modified-id="Instances-of-the-problem-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Instances of the problem</a></span></li><li><span><a href="#Code" data-toc-modified-id="Code-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Code</a></span></li><li><span><a href="#Results'-imprecision" data-toc-modified-id="Results'-imprecision-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Results' imprecision</a></span></li></ul></li><li><span><a href="#Running-the-main-function" data-toc-modified-id="Running-the-main-function-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Running the main function</a></span></li></ul></div>

The following is a simple code, which however has great potential. I chose to include it in my personal projects since it is something which could be very useful for students who did one of the entry tests for Italian national universities. It does not prove fine Python coding skills, also because it was done on September 2019, a couple of months after the end of my first Python exam. The explanation and motivation are definitely longer than the code itself, because this document is intended for non-Italian readers which should not know anything about how enrolment in some Italian Universities is carried out.<br>
**Note** <br>
Being an early project, I did not know how to generalize it to get information from some online source, but I did something similar in my Covid19 Analysis project, which is on my Github profile as well. Thus, the code is genuine from early experience, but takes files from my personal disk. Nevertheless what should matter is the implementation of the idea. 

## Introduction

The courses which require this evaluation include but are not limited to: <br>
    - Medicine and Surgery (6 years course), together with  Dentistry
    - Architecture (5 years course)
    - Veterinary (5 years course)
    - Health professions (Nursery, physiotherapy,...)
Each year, thousands of aspiring university students take those tests, hoping to obtain a place among the few available. For some degrees, there is a national selection, for others, the selection is local. The former ones follow this kind of flow: <br>
- Applications open, students can pay the fee and access the system
- a list of available Universities with relevant available places is proposed, each year subject to slight changes
-each student chooses a raking of a subset of the options or of the whole, thus stating which is the first preference, the second and so on
-on test day (around the beginning of September), each university organizes a way to host all students who chose it as a first choice in order to allow them to take the test (**note:** this is crucial for the understanding of the code, a student takes the national test on the location of his/her first choice). <br>
    - **note:** please note also that due to the current pandemic the 2020 session was arranged differently, and so this kind of information is lost and the code is useless for 2020 (more on it later).
- one month later (more or less), anonymous results are made available to the public, everyone can see his/her score, but not his/her relative position. Moreover, to know this it is crucial that the test taker remembers the unique identification code attached to the test papers (more on it in the next chapter)
-some time later, rankings are shared, and each student is aware of the actual position in the ranking. Given that slots are limited, some are in and some are out, depending on the choices that all the students who had a higher score took. <br>
    - it may be that one particular location's places are exhausted way before another (due to fame of the University, or quality of life in the city where it is located), and so the choice of a not admitted student falls on his/her second option, or third, or fourth and so on. 
-periodically, admitted test takers are prompted to choose whether or not to accept the first choice they are allowed to take among their preferences, allowing for others to choose in the event that some are not satisfied with the option the university they can enroll in. <br>
    - the reasons for not accepting are numerous. I could mention distance from home and fame of the university as an example, but in the end everyone has their reasons. 
    - an interesting fact is that many test takers do not eventually enroll in the degree they might get into because they just take the test "for fun" (very common in Italy) or for business reasons (there are some companies which prepare students to take tests and so some collaborators do it in order to be always updated on new modalities, questions, and claims on those as well, to protect their customers)
- rankings keep on moving forward for a lot of time but tend to lose speed, meaning that already after some months no one enrolls anymore, because lectures start around October and enrolling late means having lost part of them. 


## The test

### Structure

Subject to change each year, with rumours about dates and questions, the Ministry of Education chooses how it is made. All questions are the same, whether one takes it in Milan, Rome or Naples. Subjects are generally related to: <br>
- mathematics
- physics
- logical reasoning
- verbal reasoning
- biology and science
- anatomy
- general culture <br>
    - defined as anything related to history but of common knowledge, from politics to science and philosophy. The italian Constitution and world recent's happenings are an example, but anything really could be considered part of general culture. <br>

Sample questions of the test for Universities teaching medicine in English can be found at the following link: <br>
[Sample Questions (English)](https://www.admissionstesting.org/Images/539381-imat-past-paper-2018.pdf) <br>
**Note:** If the link does not work it is sufficient to search for keywords such as:" _IMAT test Italy sample questions_ "

### Points

The current method, as of 2020, awards $+1.5$ points for each correct question, and $-0.4$ points for each incorrect question. The minimum score to be admitted to the ranking is 20. This may seem a low score, but the test is often complicated: even though 90 points are available it is common that the average of the around 70000 test takers for medicine and surgery is below 45.

### Concerns

Often causing anxiety in young 19 years old students, and being the only way to graduate and work in some specific fields (example: even Italian Private universities have a medicine test, organized in similar ways), each year many claims are raised. These could be related to specific unclear questions, scandals, complaints concerning the modalities and the number of available slots (considered to be always less than optimal). <br>
It is also common that students decide to attend a different degree and try the subsequent year to get into the one they actually want. <br>
On top of all this mess, and if you do not believe it, just ask an Italian friend, result dates do not help. 
To be clearer, one could observe that: <br>
- Most of the Universities accept students before the results of those tests are posted. So, if one does not pass, the risk is that of not being able to enroll in any degree. 
- The test is done on an answer sheet which is corrected by a computer
- Each test taker is directly linked to his/her own ID, which anonymizes them to avoid help from a possible friend/relative who has in hand the tests at some point and could change the answers.
    - for this reason it is also interesting to notice that the answer sheet cannot be written on, except from ticks on the answers. This ensures that no paper is recognizable. Say for example that one knows that the test of his friend has a strnge circle at an angle of the paper, then he/she could recognize it and change the answers, if he/she is the one who puts asnwer sheets on the optic reader. Any "dirty" answer sheet should be invalidated. It is also forbidden for a student to memorize his/her ID, as to avoid the sharing of this information. <br>
- being corrected by a computer, results are probably uploaded on a system which directly links the ID, the score, the name of the student, and preferences to a single location. <br>
- **BUT** Results are posted one month later, anonymously, and then some time later with names. Basically, most of the test takers, anxious about the result, either remember the answers they gave, or remember their code (somehow illegally, but who could blame them), or even both, in order to be aware of the result in the fastest way possible, while already having enrolled in a different course in the event that they do not pass. 

It is evident that this process could be simplified, by either setting test dates before or sharing results earlier for example. Probably, there is a reason for doing so that only those dealing with the test arrangements are aware of, I do not want to judge. <br>
By just observing the current and past situation, I decided to give some relax to my friends, by attempting to give them an earlier asnwer. The next chapters will explain how I did so. 

## A simple, Python-made (partial) solution

As stated at the beginning of this notebook, the idea behind and the implementation of it are quite simple, but showed to be socially useful for many people. This is not a document showing particular skills in Python coding, but nevertheless it proves how simple scripts, which let a computer do the hard work, can truly help. To support this, evidence I can only rely on the confidence of the reader when I say that lots of people asked me about their result the year I coded this. The solution is however partial, as I will try to explain in one of the following subchapters. 

### Instances of the problem

Once anonymous results are public, for some privacy reasons I did not completely get, the Ministry of Education does not share a complete pdf ranking. Fortunately, in a few hours usually, a pdf containing all results starts to appear online. In the event that this does not happen, one can build it on his/her own by using online free tools that merge documents. 
This is the crucial piece of information needed: a pdf with a table containing scores and IDs of all test takers, not necessarily ordered. <br>
Next, what is known about IDs is that each one is made of a list of numbers and characters, in which the first symbols denote the place where the test was taken by the individual, and thus the individual's first preference. Said so, we could say that having a pdf with all scores means knowing the first preference of everyone, along with their relative overall ranking. <br>
What is missing is a list of the universities, with relative keys put at the beginning of the IDs, and places available. <br>
Eventually, with just a couple of documents, one with rakings and one translating information, we can build a simple set of functions which does the following _pseudocode_ : <br>
- accept a target score
- set leftout counter to zero
- loop
- find highest overall score x
- while x is more than the target score
- recognize which is x's first choice
- check that available seats for the choice are not zero
    - if not zero remove one slot there and remove x from the ranking
    - if zero increase leftout counter by 1, and remove x from the ranking
- end while
- end loop
- return the leftout counter

The above process will tell the user for a given score if any person who did better did not get into his/her choice, thus moving to his/her second choice. Being that only the first choice is known, this is the only feasible operation when anonymous rakings are shared, but it is still something. If someone knows that there is basically no waiting list, then he/she is definitely in his/her first choice, unless places were exhausted at the moment that precise score was reached. The test regulation states that when there is a tie, younger students, even born in the same year but in a later month, are preferred, but this cannot be evaluated since the only information in hand is the first preference. No birth date is retrievable from the ranking. 

### Code

In [138]:
import tabula
import pandas as pd

I decided to divide the process in a few functinos which help accomplishing different tasks, calling one another across the computation. The libraries used are quite easy: <br>
- Tabula allows the user to transform a pdf with tables into a dataframe
    - I have trusted that it does so with not so much error, but this is not controlled by my code and it could be argued. I would however say that it surely does a faster and more precise work than a normal person. 
- Pandas is the one of the most common tools for dataframe manipulation

The first function is a simple dataframe wrapper which accepts a path where a pdf is stored, and returns a dataframe. Since rankings are very long documents (medicine is around 600 pages of scores), when called it requires some time. Being the only library I found some months ago, I am not aware of faster methods, but anyway this is not something related to the actual implementation, and it is a one time operation, since then the dataframe can be stored and the pdf deleted. 

In [139]:
def pdf_to_df(path,pages='all'):
    l=tabula.read_pdf(path, stream=True, pages='all')
    #creates a list of dfs to join
    df=l[0]
    for i in range(1,len(l)):
        df=df.append(l[i])
    
    return df

The second wrapper creates a dictionary with as keys university names and as values the number of places. I did this since there are available excel files ith this kind of information, and I had managed to download one of them. For what concerns the IDcode-university pairing this is not the case, I had to find them by reading documents, I did not find any table online, and I had to write it by hand. Those two dictionaries will be used for the prestated pseudocode.  

In [140]:
def uni_places(path):
    df=pd.read_excel(path)
    d={}
    for i in range(len(df)-1):
        a=df.iloc[i][0].lower()[9:]
        d[a]=df.iloc[i][1]
    
    return d
code_uni={'02':'bari','03':'bologna','C7':'varese insubria','C6':'milano "bicocca"','01':'politecnica delle marche','C8':'vercelli "avogadro"','04':'cagliari','46':'brescia','08':'catania','C5':'catanzaro','53':'chieti','09':'ferrara','10':'firenze','C9':'foggia','11':'genova','55':"l'aquila",'14':'messina','15':'milano','17':'modena','39':'molise','18':'napoli','49':'napoli "luigi vanvitelli" (sede di napoli e sede di caserta)','19':'padova','20':'palermo','21':'parma','22':'pavia','23':'perugia (sede di perugia e sede di terni)','24':'pisa','26':'la sapienza','27':'roma "tor vergata"','28':'salerno','29':'sassari','30':'siena','31':'torino','33':'trieste','34':'udine','40':'verona'}


One of the first things to do is simplify the dataframe. Those usually store (row wise): the ID, each partial score in the different sections, and the total score. For what concerns our objective only the ID and the total scores are needed. Since scorenames are not homogeneous, meaning that some pdfs call it _"score"_ while others _"totscore"_ , and so on, the function does also ask which kind of string denotes our target value. Being of many pages, the keywords which name each column repeat, and so they must be skipped. I did so by looping across all results and ignoring those who are basically column titles, while storing only ID and overall score. The overall score was at the sixth column (number 5 with indexing since it starts from zero), because tests were made of 4 parts, thus making for each test taker an ID, 4 partial scores and one final score, in order. 

In [141]:
def get_list(df,score_name='Punti TOT'):
    l=[]
    for i in range(len(df)):
        if df.iloc[i,5]!=score_name:
            t=(df.iloc[i,0],float(df.iloc[i,5]))#.replace(',','.')) )
            #the score is reported with the italian system (decimals with comma)
            #it is converted to the python system with this line
            #and stored in the tuple
            l.append(t)
    return l

**Update**: a more efficient way is obviously possible, but since I do not want to change the code I wrote months ago, because it would basically mean cheating (declaring I did some work in the past and changing it with updated knowlesge acquired with time), I will post it here below. The method would have changed since the old function returns a list of tuples, while this new one is a dataframe. 

In [142]:
#say I have a df and I want to keep only the ID column (first one) and 
# the score column (last one) 
#the command would be
#l=df[['ID_column_name','score_column_name']] 
#to get all rows and 2 columns
#l=l[l['score_column_name']!=score_column_name] 
#to delete the rows that repeat the column name

Having in hand a list of tuples I then wrote a function which sorts it using as criterion the second element of each pair, that of the score. For some datasets it is needed since scores are not sorted. 

In [143]:
def rank(l):
    l.sort(key=lambda pair: pair[1], reverse=True)
    return l

Next I coded the main function, which takes a sorted list of tuples, two dictionaries and a score and does the job outlined before. At each iteration, the dictionary linking name and slots available of a given university is updated, taking out one seat. Once seats in one location are exhausted, the function will return false and count the number of leftout students. 

In [144]:
def until(score,ranking,code_uni,uni_places):
    left_out=0
    for i in range(len(ranking)):
        present_score=ranking[i][1]
        if present_score>=score:
            code=ranking[i][0][0:2]
            #code extraction of the individual at the given position
            if decrease_and_check(code,code_uni,uni_places)==False:
                ##decrease_and_check is explained in the next cell
                left_out+=1
        else:
            print('number of left_outs is:',left_out)
            return left_out
        
    return left_out

The function named _decrease and check_ does the job of recognizing the university given the code, and either giving him/her one slot among those available in his/her preference, or returning False, in order to increase the overall left_out counter, which keeps track of how many test taker did not get into their first preference. 

In [145]:
def decrease_and_check(code,code_uni,uni_places):    
    key=code_uni.get(code)
    if uni_places[key]<0:
        #checks if all slots have been occupied with a higher score 
        return False
    else:
        uni_places[key]-=1
        return True

I also added some functions which helped gain a greater understanding of the situation when I was building the code. In order: <br>
- _get position_ : once the dictionary is modified with updated available places, it returns the first index at which the first university runs out of slots. 
- _find same score_ :  returns how many students have obtained the same score. This is useful when evaluating it with relevant places still available at someone's position. For example, since birth date is not known, by combining the uni_places modified dictionary after having run the until function, and this function, one could see places left and people with the same score who may or may not choose before due to age advantage. 
- _slots requested_ : counts for each university code how many people took preferences above one's score. Useful to see concentration of requests above one's performance. 

In [146]:
def get_position(ranking,code_uni,uni_places):
    for x in ranking:
        code=x[0][0:2]
        b=check(a,code_uni,uni_places)
        if b==False:
            return ranking.index(x)
        
        
def find_same_score(score,ranking):
    same=[]
    for i in range(len(ranking)):
        if ranking[i][1]==score:
            t=(i+1,ranking[i])
            same.append(t)
    return same


def slots_requested(score,ranking):
    code_requested={}
    for i in range(len(ranking)):
        if ranking[i][1]>=score:
            code_requested.setdefault(ranking[i][0][0:2],0)
            code_requested[ranking[i][0][0:2]]+=1
        else:
            return code_requested

### Results' imprecision

It is worth noting that this script does not in any way give 100% certainty of having or not having passed. The main issues mentioned and not mentioned before include but are not limited to the following: <br>
- tabula's accuracy in converting a pdf of 500 pages or more was not tested and taken with confidence as 100%. A failure in this process of data collection could lead to not reliable results. 
- having in hand only one's first preference, information is not complete. Second preferences and a higher score mean still the individual chooses first, so given a score, certainty of being accepted happens only if $left_{out}=0$, or if $left_{out}<\text{available slots in one's preference}$
- age is not known, and as stated by the rules of the exam, in case of equal scores, the younger test taker chooses first 
- many tests combine courses, so for example medicine tests are done together with dentistry tests, and both are given the same code. This causes even more uncertainty. As a rule of thumb for example, medicine is more competitive than dentistry, but who knows year by year what will be the given level of choices upon a given score for one or the other. However, dividing these preferences is impossible given that the code is the same, and the only way to consider this without compromising the whole system is probably to ignore this factor. Coming back to the example made before, on one side, given that medicine is more competitive than dentistry, a dentistry test taker could have confidence in having passed if the program gives a "good result", while for medicine, given that the proportion of test takers is a lot higher (less students try dentistry), the noisy contribution can be treated as negligible. 
- many test takers try the test without the intention to enroll (for fun or other reasons). 

Nevertheless, I would again state that these are not results that should be taken as 100% correct, but more of an indicator of performance with respect to other test takers. 

## Running the main function

Given the dimensions of the original rankings, I will run the functions on a subset of the pdf I left inside the Github repository of this project. Doing a complex task, tabula is fairly slow. I hope that some slots will get to zero as to show a possible "not accepted in first choice" solution. All documents are related to 2019 admission tests. I will use the two dictionaries below for the sake of simplicity, but an excel file with medicine places is available as well if someone wants to play with it. Names are always simplified to keep them short. 

In [147]:
#### these two are the codes, names and places dictionaries of 2019 architecture courses
### I kept them in the event someone wanted to use data without extracting it
#from a source
code_uni_arch={'48':'bari','38':'basilicata','03':'bologna','46':'brescia','04':'cagliari','05':'calabria','06':'callerino','08':'catania','53':'chieti','09':'ferrara','10':'firenze','11':'genova','55':"l'aquila",'16':'milano','18':'napoli','49':'campania','19':'padova','20':'palermo','21':'parma','22':'pavia','23':'perugia','24':'pisa','01':'marche','47':'reggio calabria','26':'roma 1','27':'roma 2','A7':'roma 3','28':'salerno','29':'sassari','32':'torino','62':'trento','33':'trieste','34':'udine','37':'venezia'}
uni_places_arch={'bari':147,'basilicata':85,'bologna':180,'brescia':60,'cagliari':110,'calabria':90,'callerino':92,'catania':195,'chieti':200,'ferrara':146,'firenze':450,'genova':162,"l'aquila":97,'milano':1073,'napoli':571,'campania':150,'padova':97,'palermo':240,'parma':120,'pavia':60,'perugia':78,'pisa':66,'marche':70,'reggio calabria':230,'roma 1':603,'roma 2':60,'roma 3':180,'salerno':70,'sassari':60,'torino':430,'trento':75,'trieste':45,'udine':97,'venezia':350}
#########
##copying the original places dictionary to have the updated version after one iteration
##and the original one
uni_places_arch_50=uni_places_arch.copy()
uni_places_arch_30=uni_places_arch.copy()

##


df=pdf_to_df(path='/Users/admin/Desktop/architecture_ranking.pdf',pages='1-50')
results=get_list(df)
sorted_ranking=rank(results)

##we check for two random scores, say 50 and 30



In [148]:
left_50=until(50,sorted_ranking,code_uni_arch,uni_places_arch_50)

number of left_outs is: 0


In [149]:
left_30=until(30,sorted_ranking,code_uni_arch,uni_places_arch_30)

number of left_outs is: 23


In [150]:
same_50=find_same_score(50,sorted_ranking)
print('IDs with same score of 50 points: \n',same_50)

IDs with same score of 50 points: 
 [(367, ('37AR9UACAVEIVUE', 50.0)), (368, ('37AR9YGRCWD84P7', 50.0)), (369, ('A7AR9Z7A1CYA4T7', 50.0)), (370, ('16AR97RGJT6CMX3', 50.0)), (371, ('16AR9HEVVNQQ7VB', 50.0)), (372, ('16AR9ULUN5C2R2J', 50.0)), (373, ('09AR9JDPGFAD1KA', 50.0))]


In [151]:
req_30=slots_requested(30,sorted_ranking)
print('Slots requested for each Uni once 30 points are reached: \n',req_30)

Slots requested for each Uni once 30 points are reached: 
 {'16': 1097, '62': 56, '26': 272, '47': 115, '37': 267, '19': 56, '49': 114, '09': 120, '32': 271, '11': 45, '48': 86, '18': 126, '03': 120, '27': 16, '38': 41, '28': 20, '10': 175, '24': 30, 'A7': 74, '34': 12, '53': 34, '01': 13, '46': 29, '04': 41, '29': 31, '21': 23, '20': 36, '22': 24, '08': 49, '23': 20, '55': 8, '33': 15, '06': 11, '05': 6}


We could, for example, check why the university identified by the 16 code has so many requests. 

In [152]:
print(code_uni_arch['16'],'number of available places: ',uni_places_arch[code_uni_arch['16']])

milano number of available places:  1073


Code 16 indeed is linked to Milan's University, located in one of the most wanted cities of Italy for students, and also being one of the best architecture courses across the countries, and one of the biggest, students-wise. 