# **Context**

![](https://bbs.binus.ac.id/international-business/wp-content/uploads/sites/3/2019/01/QS-ranking.jpg)

University of Indonesia, Bandung Institute of Technology, and Gadjah Mada University were respectively named the top 3 universities in Indonesia in 2019 by Quacquarelli Symonds (QS), one of the most popular World University Ranking publishers, so it goes without saying that these universities were among the most popular choices by the brightest students in the country. However, just how tough was it to get into these universities in 2019? I will attempt to answer that question by performing exploratory data analysis on the UTBK 2019 datasets.

First, we import all the relevant libraries and load the datasets themselves.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
majors_list = pd.read_csv("../input/indonesia-college-entrance-examination-utbk-2019/majors.csv")
universities_list = pd.read_csv("../input/indonesia-college-entrance-examination-utbk-2019/universities.csv")
science_scores = pd.read_csv("../input/indonesia-college-entrance-examination-utbk-2019/score_science.csv")
humanities_scores = pd.read_csv("../input/indonesia-college-entrance-examination-utbk-2019/score_humanities.csv")

Then, we check to see if there is any missing data.

In [None]:
majors_list.info()

In [None]:
universities_list.info()

In [None]:
science_scores.info()

In [None]:
humanities_scores.info()

So far we don't seem to have any problem with missing data, which is good. We will later check for invalid data and clean them again.

## **Data Wrangling**

In [None]:
majors_list.head()

In [None]:
universities_list.head()

We will now first wrangle the data regarding university majors by merging majors list and university list.

In [None]:
majors_list = majors_list.merge(universities_list, on='id_university').drop(['Unnamed: 0_x', 'Unnamed: 0_y'], axis=1)
majors_list = majors_list[['id_major', 'id_university', 'major_name', 'university_name', 'type', 'capacity']].set_index('id_major')
majors_list.head()

Generally there are 3 types of admissions for Indonesian public universities: non-test admission through high school report cards (SNMPTN Undangan), national entrance exam for university admission (UTBK/SBMPTN), and independent admission test; these three admission types are held in respective succession. For UTBK, the allocated quota for admission is at the very least 40%. Here, we are making the assumption that all universities allocated exactly 40% for UTBK, even though in reality some universities allocated more quota in UTBK because the quota for the non-test admission wasn't fulfilled yet.

In [None]:
majors_list['utbk_capacity'] = (0.4 * majors_list['capacity']).apply(int)
majors_list.head()

We will come back to the majors data later. Now, it's time to wrangle the applicants/test scores data.

In [None]:
science_scores.head()

In [None]:
humanities_scores.head()

We will add some columns for academic aptitude score, specialized basic ability score, and average score.

In [None]:
science_scores = science_scores[['id_first_major', 'id_first_university','id_second_major', 'id_second_university', 'id_user', 
                   'score_bio', 'score_fis', 'score_kim', 'score_mat',
                   'score_kmb', 'score_kpu', 'score_kua', 'score_ppu']].set_index('id_user')
humanities_scores = humanities_scores[['id_first_major', 'id_first_university','id_second_major', 'id_second_university', 'id_user', 'score_eko',
                   'score_geo', 'score_sej', 'score_sos', 'score_mat',
                   'score_kmb', 'score_kpu', 'score_kua', 'score_ppu']].set_index('id_user')

In [None]:
science_scores['specialized_score'] = science_scores.iloc[:, 4:8].mean(axis=1)
humanities_scores['specialized_score'] = humanities_scores.iloc[:,4:8].mean(axis=1)

science_scores['aptitude_score'] = science_scores.iloc[:, 8:12].mean(axis=1)
humanities_scores['aptitude_score'] = humanities_scores.iloc[:, 8:12].mean(axis=1)

science_scores['average_score'] = science_scores.iloc[:, 4:12].mean(axis=1)
humanities_scores['average_score'] = humanities_scores.iloc[:, 4:12].mean(axis=1)

In [None]:
science_scores.head()

In [None]:
humanities_scores.head()

As you can see, there are two exam types in Indonesia divided by the specializations. The first specialization is (natural) sciences and technology, and the second one is social sciences and humanities. In the old days, applicants can take two different specializations with just one ID. However, the mechanism in 2019 might have been different from the old days, so we will check to see first if there were any shared IDs between the specialized tests to decide how we will wrangle the data further.

In [None]:
sum(science_scores.index.isin(humanities_scores.index))

It seems like there were no shared IDs, so we can just combine both datasets by concatenating for easier wrangling process and analysis.

In [None]:
science_scores['type'] = 'science'
humanities_scores['type'] = 'humanities'

test_scores = pd.concat([science_scores, humanities_scores])

In [None]:
test_scores = test_scores[['id_first_major', 'id_first_university', 'id_second_major','id_second_university', 
                           'score_bio', 'score_fis', 'score_kim',
                           'score_eko','score_geo', 'score_sej', 'score_sos',
                           'score_mat', 'score_kmb', 'score_kpu', 'score_kua', 'score_ppu',
                           'specialized_score', 'aptitude_score', 'average_score', 'type']]
test_scores.head()

Great! Now we only have to assign the major and university names for the choices made by applicants. 

In [None]:
test_scores = pd.merge(test_scores, majors_list[['major_name', 'university_name']], left_on='id_first_major', 
                       right_on=majors_list.index, how='left')
test_scores = pd.merge(test_scores, majors_list[['major_name', 'university_name']], left_on='id_second_major', 
                       right_on=majors_list.index, how='left', suffixes=('_1', '_2'))

In [None]:
test_scores.head()

There's a chance that there were some invalid data where the major ID doesn't match up with any major in the majors dataset we have, so we will first inspect that.

In [None]:
sum(test_scores['major_name_1'].isna())

In [None]:
sum(test_scores['major_name_2'].isna())

Turns out our suspicion was right. For ease of analyzing, we will just drop the rows without valid majors for either major choice 1 or major choice 2, since there are only a few of them.

In [None]:
test_scores = test_scores.dropna(subset=['major_name_1', 'major_name_2'], axis=0)

In [None]:
sum(test_scores['major_name_1'].isna())

In [None]:
sum(test_scores['major_name_2'].isna())

We will now perform data wrangling on both datasets regarding various admission variables: admission status (for applicants), passing grade, accepted applicants, total competing applicants, and acceptance rate (for majors).

*Just a quick note: this acceptance rate here will only account for total competing applicants. By "total competing applicants", I mean applicants who were "actively" competing for the quota allocated, i.e. the applicants who already got accepted in their first choice aren't going to be counted in the total competing applicants for their second choice. I personally believe that by calculating the acceptance rate this way, we will get a number that better reflects the acceptance rate or selectivity of a certain choice.*

In [None]:
test_scores = test_scores.sort_values('average_score', ascending = False)

test_scores['status'] = np.nan
majors_list['accepted_applicants'] = 0
majors_list['total_competing_applicants'] = 0
majors_list['passing_grade'] = np.nan

In [None]:
for applicant in test_scores.index:
    first_major = test_scores.loc[applicant, 'id_first_major']
    second_major = test_scores.loc[applicant, 'id_second_major']
    first_major_capacity = majors_list.loc[first_major, 'utbk_capacity']
    second_major_capacity = majors_list.loc[second_major, 'utbk_capacity']
    first_major_accepted = majors_list.loc[first_major, 'accepted_applicants']
    second_major_accepted = majors_list.loc[second_major, 'accepted_applicants']
    if first_major_accepted < first_major_capacity:
        majors_list.loc[first_major, 'accepted_applicants'] += 1
        majors_list.loc[first_major, 'total_competing_applicants'] += 1
        majors_list.loc[first_major, 'passing_grade'] = test_scores.loc[applicant, 'average_score']
        test_scores.loc[applicant, 'status'] = "Accepted First Choice"
    elif second_major_accepted < second_major_capacity:
        majors_list.loc[second_major, 'accepted_applicants'] += 1
        majors_list.loc[first_major, 'total_competing_applicants'] += 1
        majors_list.loc[second_major, 'total_competing_applicants'] += 1
        majors_list.loc[second_major, 'passing_grade'] = test_scores.loc[applicant, 'average_score']
        test_scores.loc[applicant, 'status'] = "Accepted Second Choice"
    else:
        majors_list.loc[first_major, 'total_competing_applicants'] += 1
        majors_list.loc[second_major, 'total_competing_applicants'] += 1
        test_scores.loc[applicant, 'status'] = "Failed"

That took very long to run, but it was to be expected considering the size of data we have. We will now check to see if we already have what we want.

In [None]:
test_scores.head()

In [None]:
majors_list.head()

Whoops, we haven't added the acceptance rate column yet.

In [None]:
majors_list['acceptance_rate'] = majors_list['accepted_applicants'] / majors_list['total_competing_applicants']

In [None]:
majors_list = majors_list[['id_university', 'major_name', 'university_name', 'type', 'capacity',
       'utbk_capacity', 'accepted_applicants', 'total_competing_applicants',
       'acceptance_rate', 'passing_grade']]
majors_list.head()

We are done with the data wrangling! Now it's time for the analysis.

## **Analysis - General**

Before we proceed to analyzing by majors, let us first analyze the applicants' score. First, we will do so with the average scores.

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=test_scores, x='average_score', hue='type', shade=True);

From what we can see, the science cluster applicants overall tend to perform *only very slightly* better than the humanities cluster applicants judging from the average scores, but is this statistically significant? Let's find out.

In [None]:
# Computing the t and p values using scipy 
from scipy import stats

t, p = stats.ttest_ind(test_scores.loc[test_scores['type'] == 'science', 'average_score'].values,
                       test_scores.loc[test_scores['type'] == 'humanities', 'average_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

Generally, we use the 95% confidence interval, which means that we can confidently say the two groups are statistically different if the p-value is less than 0.05. Here, we see that the **p-value is very close to 0**, which suggests that there is good evidence to **REJECT the Null Hypothesis**. Meaning the there is a statistically significant difference between the two groups. The **t-test** shows that the scores for these two groups are significantly different and that the science cluster applicants outperformed the humanities cluster applicants even if only by slightly.

Let's perform the same analysis for both academic aptitude scores and specialized scores.

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=test_scores, x='aptitude_score', hue='type', shade=True);

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(test_scores.loc[test_scores['type'] == 'science', 'aptitude_score'].values,
                       test_scores.loc[test_scores['type'] == 'humanities', 'aptitude_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

Here we see again that **p-value is virtually 0**, which suggests that there is good evidence to **REJECT the Null Hypothesis**. The **t-test** shows that the scores for these two groups are significantly different and that the science cluster applicants performs better than the humanities cluster applicants in the academic aptitude test.

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=test_scores, x='specialized_score', hue='type', shade=True);

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(test_scores.loc[test_scores['type'] == 'science', 'specialized_score'].values,
                       test_scores.loc[test_scores['type'] == 'humanities', 'specialized_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

Yet another case of **p-value being very close to 0**, which suggests that there is good evidence to **REJECT the Null Hypothesis**. The **t-test** shows that the scores for these two groups are significantly different, however this time we find out that the science cluster applicants typically perform worse than the humanity cluster applicants when it comes to the specialized test.

Now, we will see how many people get admitted into their choice and vice versa.

In [None]:
plt.style.use('ggplot')
ax = test_scores['status'].value_counts().plot.pie(figsize=(8,8),
                                                   autopct='%.2f%%',
                                                   shadow=True,
                                                   explode = (0, 0.1, 0))
ax.set_ylabel('')
ax.set_title('Admission Result - Overall');

Obviously there are more people who get rejected than accepted, but how is it distributed across specialization clusters?

In [None]:
ax = test_scores[test_scores['type'] == 'science']['status'].value_counts().plot.pie(figsize=(8,8),
                                                                                   autopct='%.2f%%',
                                                                                   shadow=True,
                                                                                   explode = (0, 0.1, 0))
ax.set_ylabel('')
ax.set_title('Admission Result - Science Cluster');

In [None]:
ax = test_scores[test_scores['type'] == 'humanities']['status'].value_counts().plot.pie(figsize=(8,8),
                                                                                   autopct='%.2f%%',
                                                                                   shadow=True,
                                                                                   explode = (0, 0.1, 0))
ax.set_ylabel('')
ax.set_title('Admission Result - Humanities Cluster');

We can see that there are more humanities cluster applicants who got accepted into their choices (whether first or second choices) than the science cluster applicants, even though science cluster applicants overall tend to perform better in the test than their humanities cluster counterpart.

Alright, now we will move on to analyze the majors/universities. First, we will find out which majors have the highest number of total competing applicants.

In [None]:
majors_list['major_univ_name'] = majors_list['major_name'] + " - " + majors_list['university_name']

plt.figure(figsize=(12, 5))

sns.barplot(x='total_competing_applicants', y='major_univ_name', data=majors_list.nlargest(20, 'total_competing_applicants'))

plt.title('Top 20 Choices Among Applicants in UTBK 2019')
plt.ylabel('Major and University Name')
plt.xlabel('Total Competing Applicants');

Curiously, among the top 5 highest number of applicants, only UI made into the list. Let's also check the rankings when we divide the clusters/types. 

In [None]:
plt.figure(figsize=(12, 5))

sns.barplot(x='total_competing_applicants', y='major_univ_name', 
            data=majors_list[majors_list['type'] == 'science'].nlargest(20, 'total_competing_applicants'))

plt.title('Top 20 Choices Among Science Cluster Applicants in UTBK 2019')
plt.ylabel('Major and University Name')
plt.xlabel('Total Competing Applicants');

In [None]:
plt.figure(figsize=(12, 5))

sns.barplot(x='total_competing_applicants', y='major_univ_name', 
            data=majors_list[majors_list['type'] == 'humanities'].nlargest(20, 'total_competing_applicants'))

plt.title('Top 20 Choices Among Humanities Cluster Applicants in UTBK 2019')
plt.ylabel('Major and University Name')
plt.xlabel('Total Competing Applicants');

Again, while the top 3 universities are among the top 20 choices, they aren't exactly the most popular. Most likely because the average (and below average) applicants would gravitate to choosing non-top 3 universities to ensure their admittance.

Now, let's explore the top most selective acceptance rate for each major offered in UTBK.

In [None]:
majors_list.nsmallest(20, 'acceptance_rate')[['major_name', 'university_name', 'acceptance_rate']].reset_index()

We see that UI tops and dominates the list. ITB, despite [being the most selective university in 2009](https://en.wikipedia.org/wiki/Bandung_Institute_of_Technology#Academics), doesn't actually make the cut. Let's see if we divide it by clusters.

In [None]:
majors_list[majors_list['type'] == 'science'].nsmallest(20, 'acceptance_rate')[['major_name', 'university_name', 'acceptance_rate']].reset_index()

In [None]:
majors_list[majors_list['type'] == 'humanities'].nsmallest(20, 'acceptance_rate')[['major_name', 'university_name', 'acceptance_rate']].reset_index()

ITB is still nowhere to be found in the rankings! Alright then, how about we perform an aggregate analysis per university?

First, we see which universities have the highest amount of total competing applicants.

In [None]:
total_applicants = majors_list[['university_name', 'total_competing_applicants']].groupby('university_name').sum().reset_index()

plt.figure(figsize=(12, 5))

sns.barplot(x='total_competing_applicants', y='university_name', 
            data=total_applicants.nlargest(20, 'total_competing_applicants'))

plt.title('Top 20 Universities with the Most Applicants')
plt.ylabel('University Name')
plt.xlabel('Total Competing Applicants');

Next, we find out the acceptance rate for each university and rank them.

In [None]:
total_capacity = majors_list[['university_name', 'utbk_capacity']].groupby('university_name').sum().reset_index()
universities_data = pd.merge(total_capacity, total_applicants, on='university_name')
universities_data['acceptance_rate'] = universities_data['utbk_capacity'] / universities_data['total_competing_applicants']

In [None]:
universities_data.nsmallest(20, 'acceptance_rate').reset_index()

ITB actually ranked 13th in terms of (aggregate) selectivity in 2019 with around 11.7% acceptance rate, which is 12 ranks below its ranking ten years ago. UI, on the other hand, was the overall most selective university in 2009 when judging by aggregate acceptance rate.

Selectivity, however, isn't only measured by acceptance rate, but also by the passing grade. So, how does the passing grade compares between majors?

We will divide it by clusters to see which majors were the hardest to get in in terms of passing grade. First, we start with the science cluster.

In [None]:
majors_list[majors_list['type'] == 'science'].sort_values(['passing_grade', 'acceptance_rate'], ascending=False).head(20)[['major_name', 'university_name', 'passing_grade', 'acceptance_rate']].reset_index()

The 3 highest passing grades all came from UI, UGM, and ITB respectively. UI and ITB dominated the rankings of the highest passing grades in science.

In [None]:
majors_list[majors_list['type'] == 'humanities'].sort_values(['passing_grade', 'acceptance_rate'], ascending=False).head(20)[['major_name', 'university_name', 'passing_grade', 'acceptance_rate']].reset_index()

All of the 20 highest passing grades in humanities cluster were from UI, UGM, and ITB, with UI (followed by UGM) overwhelmingly dominating the list.

When considering only the aggregate, how does the minimum passing grade compare between each university? Like, for instance, how high of a grade do you have to earn if you only care about getting into one of these universities regardless of the major?

In [None]:
majors_list[['university_name', 'passing_grade']].groupby('university_name').min().nlargest(20, 'passing_grade').reset_index()

ITB, UI, UGM respectively ranked 1st, 2nd, and 4th. They did live up to their name of being the top 3 universities in 2019.

Now, we will analyze per university. Starting with UI, then ITB, and UGM.

## **Analysis - University of Indonesia**

Let's see the distribution of the applicants who got accepted into UI.

In [None]:
ui_first_choice = test_scores[test_scores['university_name_1'] == 'UNIVERSITAS INDONESIA'][test_scores['status'] == 'Accepted First Choice']
ui_second_choice = test_scores[test_scores['university_name_2'] == 'UNIVERSITAS INDONESIA'][test_scores['status'] == 'Accepted Second Choice']
ui_accepted = pd.concat([ui_first_choice, ui_second_choice])

In [None]:
plt.style.use('ggplot')
ax = ui_accepted['status'].value_counts().plot.pie(figsize=(8,8),
                                                   autopct='%.2f%%',
                                                   shadow=True,
                                                   explode=(0.1, 0))
ax.set_ylabel('')
ax.set_title('UTBK 2019 Applicants Accepted into UI');

The majority of applicants that got accepted into UI made the major in UI their first choices, but there was still around 25% who got into UI through their second choice. Note that their first choice might be another UI major or a major from a completely different university.

We will see the distribution of the test scores (average scores) of applicants that got accepted into UI.

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(data=ui_accepted, x='type', y='average_score')

plt.title('Test Scores of Applicants Accepted into UI through UTBK 2019')
plt.ylabel('Score')
plt.xlabel('Cluster/Type');

Generally, the science and humanities applicants that got accepted into UI performed roughly the same, with science applicants slightly performing better. However, there are some extreme outliers among the science applicants who scored above 850 in UTBK 2019.

We will now compare between applicants who got accepted into UI and applicants who didn't get into UI--either people who failed to get into UI, chose a UI major as a second choice but got into their first choice, or simply didn't choose UI in the first place.

In [None]:
non_ui_accepted = test_scores[~test_scores.index.isin(ui_accepted.index)]

In [None]:
non_ui_accepted.head()

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=ui_accepted, x='average_score', shade=True, label="UI")
sns.kdeplot(data=non_ui_accepted, x='average_score', shade=True, label="Non-UI")

plt.legend();

Just from the graph above, we can tell that applicants that got accepted into UI generally performed better than applicants that didn't get into UI. However, we will make sure first by performing hypothesis testing.

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(ui_accepted['average_score'].values,
                       non_ui_accepted['average_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

We can safely reject the Null Hypothesis and say that, on average, applicants who got admitted into UI outperformed those who didn't. How does that look across different types/cluster, anyway?

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=ui_accepted[ui_accepted['type'] == 'science'], x='average_score', shade=True, label="UI")
sns.kdeplot(data=non_ui_accepted[non_ui_accepted['type'] == 'science'], x='average_score', shade=True, label="Non-UI")

plt.legend();

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(ui_accepted[ui_accepted['type'] == 'science']['average_score'].values,
                       non_ui_accepted[non_ui_accepted['type'] == 'science']['average_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=ui_accepted[ui_accepted['type'] == 'humanities'], x='average_score', shade=True, label="UI")
sns.kdeplot(data=non_ui_accepted[non_ui_accepted['type'] == 'humanities'], x='average_score', shade=True, label="Non-UI")

plt.legend();

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(ui_accepted[ui_accepted['type'] == 'humanities']['average_score'].values,
                       non_ui_accepted[non_ui_accepted['type'] == 'humanities']['average_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

We can see that the result is generally the same as overall, with humanities applicants that got accepted into UI having a larger discrepancy of score from the humanities applicants that didn't get into UI.

Now that we have looked into the general overview of applicants, shall we see the passing grades for each major in UI as well? We'll divide it by science and humanities majors.

In [None]:
majors_list[majors_list['university_name'] == 'UNIVERSITAS INDONESIA'][majors_list['type'] == 'science'].sort_values(['passing_grade', 'acceptance_rate'], ascending=False)[['major_name', 'passing_grade', 'acceptance_rate']].reset_index()

From all the valid data, *Pendidikan Dokter* (Medical School) has the highest passing grade in UI for science cluster, while *Ilmu Keperawatan* (Nursing Science) has the lowest passing grade--though still pretty high at 630.5, considering that the mean of science cluster test scores is below 600 from the KDE plot before.

In [None]:
majors_list[majors_list['university_name'] == 'UNIVERSITAS INDONESIA'][majors_list['type'] == 'humanities'].sort_values(['passing_grade', 'acceptance_rate'], ascending=False)[['major_name', 'passing_grade', 'acceptance_rate']].reset_index()

From all the valid data, *Hubungan Internasional* (International Relations) has the highest passing grade in UI for science cluster, while *Sastra Daerah untuk Sastra Jawa* (Javanese Literally) has the lowest passing grade, albeit, again, still higher than the average test score judging from the KDE plot earlier.

## **Analysis - Bandung Institute of Technology**

Let's see the distribution of the applicants who got accepted into ITB.

In [None]:
itb_first_choice = test_scores[test_scores['university_name_1'] == 'INSTITUT TEKNOLOGI BANDUNG'][test_scores['status'] == 'Accepted First Choice']
itb_second_choice = test_scores[test_scores['university_name_2'] == 'INSTITUT TEKNOLOGI BANDUNG'][test_scores['status'] == 'Accepted Second Choice']
itb_accepted = pd.concat([itb_first_choice, itb_second_choice])

In [None]:
plt.style.use('ggplot')
ax = itb_accepted['status'].value_counts().plot.pie(figsize=(8,8),
                                                   autopct='%.2f%%',
                                                   shadow=True,
                                                   explode=(0.1, 0))
ax.set_ylabel('')
ax.set_title('UTBK 2019 Applicants Accepted into ITB');

From what we can see here, there was a smaller percentage of applicants who got accepted into ITB by making their major of choice in ITB their second chances than there was in UI, but there was still chance of getting into ITB even though you made ITB your second choice (whether your first choice is ITB or not).

We will see the distribution of the test scores (average scores) of applicants that got accepted into ITB.

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(data=itb_accepted, x='type', y='average_score')

plt.title('Test Scores of Applicants Accepted into ITB through UTBK 2019')
plt.ylabel('Score')
plt.xlabel('Cluster/Type');

From a simple visual inspection, we can see that the science applicants that got into ITB performed better in the test than their humanities counterpart. On the other hand, humanities applicants that got accepted into ITB had a smaller variance of test scores, which is understandable, because ITB only offered three humanities majors/study programs compared to its many science majors/study programs. 

We will now compare between applicants who got accepted into ITB and applicants who didn't get into ITB--either people who failed to get into ITB, chose a ITB major as a second choice but got into their first choice, or simply didn't choose ITB in the first place.

In [None]:
non_itb_accepted = test_scores[~test_scores.index.isin(itb_accepted.index)]

In [None]:
non_itb_accepted.head()

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=itb_accepted, x='average_score', shade=True, label="ITB")
sns.kdeplot(data=non_itb_accepted, x='average_score', shade=True, label="Non-ITB")

plt.legend();

From a quick glance, we can see that the average applicants that got into ITB scored better than the rest of test takers. But as usual, we can't be so sure, so now we will perform hypothesis testing.

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(itb_accepted['average_score'].values,
                       non_itb_accepted['average_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

We can now confirm that applicants that got into ITB generally outperformed the average test takers. Let's inspect that by cluster/type. 

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=itb_accepted[itb_accepted['type'] == 'science'], x='average_score', shade=True, label="ITB")
sns.kdeplot(data=non_itb_accepted[non_itb_accepted['type'] == 'science'], x='average_score', shade=True, label="Non-ITB")

plt.legend();

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(itb_accepted[itb_accepted['type'] == 'science']['average_score'].values,
                       non_itb_accepted[non_itb_accepted['type'] == 'science']['average_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=itb_accepted[itb_accepted['type'] == 'humanities'], x='average_score', shade=True, label="ITB")
sns.kdeplot(data=non_itb_accepted[non_itb_accepted['type'] == 'humanities'], x='average_score', shade=True, label="Non-ITB")

plt.legend();

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(itb_accepted[itb_accepted['type'] == 'humanities']['average_score'].values,
                       non_itb_accepted[non_itb_accepted['type'] == 'humanities']['average_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

We can see that the result is generally the same as overall, with science applicants that got accepted into ITB having a larger discrepancy of score from the science applicants that didn't get into ITB.

Now, we will see the passing grades for each major in ITB. We'll divide it by science and humanities "majors". As a note, ITB employed different kind of admission, where successful applicants will be admitted first into a faculty/school before getting admitted into a major after one year in ITB.

In [None]:
majors_list[majors_list['university_name'] == 'INSTITUT TEKNOLOGI BANDUNG'][majors_list['type'] == 'science'].sort_values(['passing_grade', 'acceptance_rate'], ascending=False)[['major_name', 'passing_grade', 'acceptance_rate']].reset_index()

From all the valid data, *Fakultas Teknologi Industri - Kampus Ganesa* (Faculty of Industrial Technology - Ganesa Campus) has the highest passing grade in ITB for science cluster, while *Sekolah Arsitektur, Perencanaan, dan Pengembangan Kebijakan - Kampus Cirebon* (School of Architecture, Planning, and Policy Development - Cirebon Campus) has the lowest passing grade--though still pretty high at 638, considering that the mean of science cluster test scores is below 600 from the KDE plot before.

In [None]:
majors_list[majors_list['university_name'] == 'INSTITUT TEKNOLOGI BANDUNG'][majors_list['type'] == 'humanities'].sort_values(['passing_grade', 'acceptance_rate'], ascending=False)[['major_name', 'passing_grade', 'acceptance_rate']].reset_index()

From all the valid data, *Sekolah Bisnis dan Manajemen* (School of Business and Management) has the highest passing grade in ITB for humanities cluster, while *Fakultas Seni Rupa dan Desain - Kampus Cirebon* (Faculty of Art and Design - Cirebon Campus) has the lowest passing grade, albeit still pretty high at 641.25, considering that the mean of humanities cluster test scores is below 600 from the KDE plot before.

## **Analysis - Gadjah Mada University**

Let's see the distribution of the applicants who got accepted into UGM.

In [None]:
ugm_first_choice = test_scores[test_scores['university_name_1'] == 'UNIVERSITAS GADJAH MADA'][test_scores['status'] == 'Accepted First Choice']
ugm_second_choice = test_scores[test_scores['university_name_2'] == 'UNIVERSITAS GADJAH MADA'][test_scores['status'] == 'Accepted Second Choice']
ugm_accepted = pd.concat([ugm_first_choice, ugm_second_choice])

In [None]:
plt.style.use('ggplot')
ax = ugm_accepted['status'].value_counts().plot.pie(figsize=(8,8),
                                                   autopct='%.2f%%',
                                                   shadow=True,
                                                   explode=(0.1, 0))
ax.set_ylabel('')
ax.set_title('UTBK 2019 Applicants Accepted into UGM');

Again, the majority of people who got accepted into UGM made their major of choice in UGM their first choice, although there was a higher percentage of people who got accepted into UGM that made their major of choice in UGM their second choice (26.17%) than they were in UI (25.46%) and ITB (14.75%).

We will see the distribution of the test scores (average scores) of applicants that got accepted into UGM.

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(data=ugm_accepted, x='type', y='average_score')

plt.title('Test Scores of Applicants Accepted into UGM through UTBK 2019')
plt.ylabel('Score')
plt.xlabel('Cluster/Type');

On the contrary of ITB, UGM accepted humanities applicants tend to perform *slightly* better in the test than their science counterpart, though there were more (rather extreme) outliers for the accepted science applicants.

We will now compare between applicants who got accepted into UGM and applicants who didn't get into UGM--either people who failed to get into UGM, chose a UGM major as a second choice but got into their first choice, or simply didn't choose UGM in the first place.

In [None]:
non_ugm_accepted = test_scores[~test_scores.index.isin(ugm_accepted.index)]

In [None]:
non_ugm_accepted.head()

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=ugm_accepted, x='average_score', shade=True, label="UGM")
sns.kdeplot(data=non_ugm_accepted, x='average_score', shade=True, label="Non-UGM")

plt.legend();

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(ugm_accepted['average_score'].values,
                       non_ugm_accepted['average_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

Again, we can say that the applicants that got accepted into UGM generally performed better in the test than the other test takers.

Like before, we will also perform this analysis per cluster.

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=ugm_accepted[ugm_accepted['type'] == 'science'], x='average_score', shade=True, label="UGM")
sns.kdeplot(data=non_ugm_accepted[non_ugm_accepted['type'] == 'science'], x='average_score', shade=True, label="Non-UGM")

plt.legend();

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(ugm_accepted[ugm_accepted['type'] == 'science']['average_score'].values,
                       non_ugm_accepted[non_ugm_accepted['type'] == 'science']['average_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=ugm_accepted[ugm_accepted['type'] == 'humanities'], x='average_score', shade=True, label="UGM")
sns.kdeplot(data=non_ugm_accepted[non_ugm_accepted['type'] == 'humanities'], x='average_score', shade=True, label="Non-UGM")

plt.legend();

In [None]:
# Computing the t and p values using scipy 
t, p = stats.ttest_ind(ugm_accepted[ugm_accepted['type'] == 'humanities']['average_score'].values,
                       non_ugm_accepted[non_ugm_accepted['type'] == 'humanities']['average_score'].values, 
                       equal_var=False)
print("t-value = " + str(t))
print("p-value = " + str(p))

We can see that in UGM the result is generally the same as overall, with science applicants that got accepted into UGM having a larger discrepancy of score from the science applicants that didn't get into UGM.

Now, we will see the passing grades for each major in UGM. We'll divide it by science and humanities majors. 

In [None]:
majors_list[majors_list['university_name'] == 'UNIVERSITAS GADJAH MADA'][majors_list['type'] == 'science'].sort_values(['passing_grade', 'acceptance_rate'], ascending=False)[['major_name', 'passing_grade', 'acceptance_rate']].reset_index()

From all the valid data, *Kedokteran* (Medical School) has the highest passing grade in UGM for science cluster, while *Akuakultur (Budidaya Perikanan)* (Aquaculture) has the lowest passing grade--though still pretty high at 604.125, considering that the mean of science cluster test scores is below 600 from the KDE plot before.

In [None]:
majors_list[majors_list['university_name'] == 'UNIVERSITAS GADJAH MADA'][majors_list['type'] == 'humanities'].sort_values(['passing_grade', 'acceptance_rate'], ascending=False)[['major_name', 'passing_grade', 'acceptance_rate']].reset_index()

From all the valid data, *Ilmu Hubungan Internasional* (International Relations) has the highest passing grade in UGM for humanities cluster, while *Sastra Jawa* (Javanese Literature) has the lowest passing grade--though still pretty high at 609.125, considering that the mean of science cluster test scores is below 600 from the KDE plot before. Interestingly, UGM's majors with the highest and lowest passing grades are the same as UI's.

## **Analysis - Top 3, and How They Compare to One Another**

To compare the top 3 against each other, please we will plot the KDE of three of them along with the rest of test takers. We will also straightaway divide it by cluster/type.

First, we will do science cluster.

In [None]:
top3_accepted = pd.concat([ui_accepted, itb_accepted, ugm_accepted])
non_top3_accepted = test_scores[~test_scores.index.isin(top3_accepted.index)]

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=ui_accepted[ui_accepted['type'] == 'science'], x='average_score', shade=True, label="UI")
sns.kdeplot(data=itb_accepted[itb_accepted['type'] == 'science'], x='average_score', shade=True, label="ITB")
sns.kdeplot(data=ugm_accepted[ugm_accepted['type'] == 'science'], x='average_score', shade=True, label="UGM")
sns.kdeplot(data=non_top3_accepted[non_top3_accepted['type'] == 'science'], x='average_score', shade=True, label="Non-Top 3")

plt.legend();

Visually we can conclude that on average, the applicants that got into top 3 performed better. But how did they compare to one another? We will perform hypothesis testing to find out.

In [None]:
# UI vs ITB
t1, p1 = stats.ttest_ind(ui_accepted[ui_accepted['type'] == 'science']['average_score'].values,
                       itb_accepted[itb_accepted['type'] == 'science']['average_score'].values, 
                       equal_var=False)
print("UI vs ITB")
print("t-value = " + str(t1))
print("p-value = " + str(p1))
print("")

# UI vs UGM
t2, p2 = stats.ttest_ind(ui_accepted[ui_accepted['type'] == 'science']['average_score'].values,
                       ugm_accepted[ugm_accepted['type'] == 'science']['average_score'].values, 
                       equal_var=False)
print("UI vs UGM")
print("t-value = " + str(t2))
print("p-value = " + str(p2))
print("")

# ITB vs UGM
t3, p3 = stats.ttest_ind(itb_accepted[itb_accepted['type'] == 'science']['average_score'].values,
                       ugm_accepted[ugm_accepted['type'] == 'science']['average_score'].values, 
                       equal_var=False)
print("ITB vs UGM")
print("t-value = " + str(t3))
print("p-value = " + str(p3))
print("")

When it comes to the science clusters, ITB's accepted applicants generally outperformed both UI's and UGM's, while UI outperformed UGM. Let's now see the humanities cluster.

In [None]:
plt.figure(figsize=(10, 10))
sns.set_style("darkgrid")
sns.kdeplot(data=ui_accepted[ui_accepted['type'] == 'humanities'], x='average_score', shade=True, label="UI")
sns.kdeplot(data=itb_accepted[itb_accepted['type'] == 'humanities'], x='average_score', shade=True, label="ITB")
sns.kdeplot(data=ugm_accepted[ugm_accepted['type'] == 'humanities'], x='average_score', shade=True, label="UGM")
sns.kdeplot(data=non_top3_accepted[non_top3_accepted['type'] == 'science'], x='average_score', shade=True, label="Non-Top 3")

plt.legend();

In [None]:
# UI vs ITB
t1, p1 = stats.ttest_ind(ui_accepted[ui_accepted['type'] == 'humanities']['average_score'].values,
                       itb_accepted[itb_accepted['type'] == 'humanities']['average_score'].values, 
                       equal_var=False)
print("UI vs ITB")
print("t-value = " + str(t1))
print("p-value = " + str(p1))
print("")

# UI vs UGM
t2, p2 = stats.ttest_ind(ui_accepted[ui_accepted['type'] == 'humanities']['average_score'].values,
                       ugm_accepted[ugm_accepted['type'] == 'humanities']['average_score'].values, 
                       equal_var=False)
print("UI vs UGM")
print("t-value = " + str(t2))
print("p-value = " + str(p2))
print("")

# ITB vs UGM
t3, p3 = stats.ttest_ind(itb_accepted[itb_accepted['type'] == 'humanities']['average_score'].values,
                       ugm_accepted[ugm_accepted['type'] == 'humanities']['average_score'].values, 
                       equal_var=False)
print("ITB vs UGM")
print("t-value = " + str(t3))
print("p-value = " + str(p3))
print("")

When it comes to the humanities clusters, there is virtually no difference in performance between UI's accepted applicants and ITB's accepted applicants. However, both universities' accepted applicants outperformed UGM's.

## **Conclusion**

From the analyses that we have done, it can be safely concluded that getting into these top 3 universities is tough, judging by a variety of variables such as their passing grades, acceptance rates, and how the accepted applicants performed against the other test takers. They did live up to their names as the best Indonesian universities in 2019!