Import some library to get started

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Then, load all data into a pandas dataframes

In [None]:
major = pd.read_csv("../input/indonesia-college-entrance-examination-utbk-2019/majors.csv", index_col = 0)
major.head(3)

In [None]:
univ = pd.read_csv("../input/indonesia-college-entrance-examination-utbk-2019/universities.csv", index_col = 0)
univ.head(3)

In [None]:
sci_score = pd.read_csv("../input/indonesia-college-entrance-examination-utbk-2019/score_science.csv", index_col = 0)
sci_score.head(3)

In [None]:
hum_score = pd.read_csv("../input/indonesia-college-entrance-examination-utbk-2019/score_humanities.csv", index_col = 0)
hum_score.head(3)

Indonesia has some type of college entrance examination. UTBK is one of them. The quota for new students through UTBK is minimum 40% from the capacity of student that the majors can get. Assume that quota for all majors is 40%. Then I generate 'utbk_capacity' column to get to know the capacity quota for the majors and I generate 'passed_students' column to get to know the numbers of student that pass the UTBK for the related majors.

In [None]:
major['utbk_capacity'] = (0.4 * major['capacity']).apply(int)
major['passed_count'] = 0
major.head(3)

On major data, it does not have universities name for the major we have so I will merge major data and univ data. Then, I set the index of our merging data by its id (id_major) for further processing steps.

In [None]:
major_univ = pd.merge(major, univ, on = 'id_university', how = 'left')
major_univ.set_index('id_major', inplace = True)
major_univ.head(3)

Indonesia College Entrance Examination has two type of exam: exam for science-related major and exam for a humanities-related major. At my year, a student can take 2 exam using one ID. The mechanism of Indonesia College Entrance Examination at 2019 seems different from my year. Let's check whether at 2019 a student can take 2 exam using one ID.

In [None]:
sum(sci_score['id_user'].isin(hum_score['id_user']))

So we get to know that at 2019 a student seems can not take 2 exam using one ID. Then, there will be no problem if we want to merge the science score data and the humanities score data later.

Indonesia College Entrance Examination has two part exams: general exam and specialized exam. Lets generate the score summary of our science score data and humanities score data. I will generate the average score for those exam rather than total score because the number of subjects in specialized exam for our two type exam is different. After we generate the score summary, we can drop the columns that we used to generate it.

In [None]:
sci_score['general_score'] = sci_score[['score_kpu', 'score_kua', 'score_ppu', 'score_kmb']].apply(np.mean, axis = 1)
sci_score['specialize_score'] = sci_score[['score_mat', 'score_fis', 'score_kim', 'score_bio']].apply(np.mean, axis = 1)
sci_score['score_mean'] = sci_score[['score_mat', 'score_fis', 'score_kim', 'score_bio', 'score_kpu', 'score_kua', 'score_ppu',
                                     'score_kmb']].apply(np.mean, axis = 1)
sci_score.drop(['score_mat', 'score_fis', 'score_kim', 'score_bio', 'score_kpu', 'score_kua', 'score_ppu', 'score_kmb'], axis = 1,
               inplace = True)
sci_score.head(3)

In [None]:
hum_score['general_score'] = hum_score[['score_kpu', 'score_kua', 'score_ppu', 'score_kmb']].apply(np.mean, axis = 1)
hum_score['specialize_score'] = hum_score[['score_mat', 'score_geo', 'score_sej', 'score_sos', 'score_eko']].apply(np.mean, axis = 1)
hum_score['score_mean'] = hum_score[['score_mat', 'score_geo', 'score_sej', 'score_sos', 'score_eko', 'score_kpu', 'score_kua', 'score_ppu',
                                     'score_kmb']].apply(np.mean, axis = 1)
hum_score.drop(['score_mat', 'score_geo', 'score_sej', 'score_sos', 'score_eko', 'score_kpu', 'score_kua', 'score_ppu', 'score_kmb'],
               axis = 1, inplace = True)
hum_score.head(3)

Lets check the score distribution of our science data and our humanities data!

In [None]:
sns.set_style('darkgrid')
fig = plt.figure(figsize = (12,8))
sci_score['general_score'].hist(bins = 50, alpha = 0.5, label = 'science')
hum_score['general_score'].hist(bins = 50, alpha = 0.5, label = 'humanities')
plt.legend()
plt.title('General Score')

In [None]:
sns.set_style('darkgrid')
fig = plt.figure(figsize = (12,8))
sci_score['specialize_score'].hist(bins = 50, alpha = 0.5, label = 'science')
hum_score['specialize_score'].hist(bins = 50, alpha = 0.5, label = 'humanities')
plt.legend()
plt.title('Specialize Score')

In [None]:
sns.set_style('darkgrid')
fig = plt.figure(figsize = (12,8))
sci_score['score_mean'].hist(bins = 50, alpha = 0.5, label = 'science')
hum_score['score_mean'].hist(bins = 50, alpha = 0.5, label = 'humanities')
plt.legend()
plt.title('Average Score')

Score for science-type exam seem always greater than humanities-type exam, both the average score and its maximum score. Will the passing grade be like that too?

Before we get do our objective for this notebook, I merge the science score data and the humanities score data in order to check it in one go. We already know that there will be no problem if we want to merge the science score data and the humanities score because a student seems can not take 2 exam using one ID. I also generate 'test_type' column for our two score data to make it more easier to check.

In [None]:
sci_score['test_type'] = 'science'
hum_score['test_type'] = 'humanities'

test_score = pd.merge(sci_score, hum_score, how = 'outer')
test_score.head(3)

After we got our test score data,then I merge it with major_univ data to get the major name, university name, and major type. We need to do that twice because the students can choose two major that they want to pass through UTBK.  Then, I set the index of our merging data by its id (id_user) for further processing steps. I also generate some columns to indicate the majors that the students pass.

In [None]:
test_score = pd.merge(test_score, major_univ[['major_name', 'university_name', 'type']], left_on = 'id_first_major',
                      right_on = major_univ.index, how = 'left')
test_score = pd.merge(test_score, major_univ[['major_name', 'university_name', 'type']], left_on = 'id_second_major',
                      right_on = major_univ.index, how = 'left', suffixes = ('_1', '_2'))
test_score.set_index('id_user', inplace = True)
test_score['pass_id_major'] = np.NaN
test_score['pass_major'] = ''
test_score['pass_universities'] = ''
test_score['note'] = ''
test_score.head(3)

We got our full data that we will processed now. Before we process it, lets check whether there are any invalid data or not. I assume the data is invalid if:
1. There are no major id from full data in major_univ data.
2. Its exam type is different from the major that the students want.
3. 3 first numbers of major id do not same with its university id.

I will check the second and third conditions of invalid data with one function: major_univ_check that I created. I also create one function to fill some columns that indicate the majors that the students pass: pass_indicator. 

In [None]:
def major_univ_check (cols):
    major_id = str(cols[0])
    univ_id = str(cols[1])
    major_type = cols[2]
    test_type = cols[3]
    
    if (major_type == test_type) & (major_id[:len(univ_id)] == univ_id):
        return True
    else:
        return False
    
def pass_indicator(note, major = '-', univ = '-'):
    return note, major, univ

drop_index = list(test_score[(~test_score['id_first_major'].isin(major_univ.index)) & 
                        (~test_score['id_second_major'].isin(major_univ.index))].index)

test_score['major_1_check'] = test_score[['id_first_major', 'id_first_university', 'type_1', 'test_type']].apply(major_univ_check, axis = 1)
test_score['major_2_check'] = test_score[['id_second_major', 'id_second_university', 'type_2', 'test_type']].apply(major_univ_check, axis = 1)
false_major_index = list(test_score[(test_score['major_1_check'] == False) & (test_score['major_2_check'] == False)].index)

try:
    test_score.loc[drop_index + false_major_index, ['note', 'pass_major', 'pass_universities']] = pass_indicator('Error: invalid major/university')
except:
    pass

test_score.head(3)

We do not need major type that we got from major_univ on our full data (type_1 and type_2) again as they finished they job to be one of indicator to check invalid data. We can drop it.

In [None]:
test_score.drop(['type_1', 'type_2'], axis = 1, inplace = True)
test_score.head(3)

As we now that the more bigger our score, our chance to pass the exam will be increase so our data need to be sorted first. 

In [None]:
test_score.sort_values('score_mean', ascending = False, inplace = True)
test_score.head(3)

Now we can process our data. First, we check whether the student's first choice capacity has been filled or not. If the first choice capacity has been filled, then we check whether the student's second choice capacity has been filled or not. If the second choice capacity has been filled too, then the students are failed to pass the exam.

In [None]:
for uid in test_score[test_score['note']==''].index:
    major_1 = test_score.loc[uid, 'id_first_major']
    major_2 = test_score.loc[uid, 'id_second_major']
    if ((test_score.loc[uid, 'major_1_check'] == True) & 
        (major_univ.loc[major_1, 'passed_count'] < major_univ.loc[major_1, 'utbk_capacity'])):
        test_score.loc[uid, ['note', 'pass_major', 'pass_universities']] = pass_indicator('Pass: First Choice',
                                                                                          test_score.loc[uid, 'major_name_1'],
                                                                                          test_score.loc[uid,'university_name_1'])
        major_univ.loc[major_1, 'passed_count'] += + 1
        test_score.loc[uid, 'pass_id_major'] = major_1
    elif ((test_score.loc[uid, 'major_2_check'] == True) & 
        (major_univ.loc[major_2, 'passed_count'] < major_univ.loc[major_2, 'utbk_capacity'])):
        test_score.loc[uid, ['note', 'pass_major', 'pass_universities']] = pass_indicator('Pass: Second Choice',
                                                                                          test_score.loc[uid, 'major_name_2'],
                                                                                          test_score.loc[uid,'university_name_2'])
        major_univ.loc[major_2, 'passed_count'] += + 1
        test_score.loc[uid, 'pass_id_major'] = major_2
    else:
        test_score.loc[uid, ['note', 'pass_major', 'pass_universities']] = pass_indicator('Failed: not passing any major choices')

test_score.head(3)

We have got what we want. We can drop some columns now.

In [None]:
test_score.drop(['major_name_1', 'university_name_1', 'major_1_check', 'major_name_2', 'university_name_2', 'major_2_check'],
                axis = 1, inplace = True)
test_score.head(3)

Lets check the distribution of exam result.

In [None]:
fig = plt.figure (figsize = (12, 6))
plt.title('Exam result')
test_score['note'].value_counts().plot.pie(autopct="%.2f%%");

It is about 30% students that pass UTBK in this dataset. This dataset is about 10% from original size so so the actual percentage students that pass UTBK is about 3%.

How about the students capcity for all majors? Is all its quota has been filled? Lets check it.

In [None]:
print('The number of majors: ', len(major_univ))
print('The number of majors that its quota has been fully filled: ', len(major_univ[major_univ['utbk_capacity'] == major_univ['passed_count']]))
print('The number of majors that its quota has not been filled: ', len(major_univ[major_univ['utbk_capacity'] == 0]))

It is about 60% majors that its quota has been fully filled and there are none of majors that has not been filled yet.

Now, how about the percentage of students capacity for all UTBK's quota?. Lets check it out.

In [None]:
print(100 * major_univ['passed_count'].sum() / major_univ['utbk_capacity'].sum())

This result is greater than the previous one. It seems that some majors from 40% majors just need a few more students.

Now we can check the passing grade for all majors on UTBK 2019. I also add some columns that indicate the number of students that choose the related majors and the percentage of students can pass the related majors.

In [None]:
major_univ['students_choose_it'] = np.NaN
major_univ['pass_percentage'] = np.NaN
major_univ['passing_grade'] = np.NaN
for major in major_univ.index:
    major_univ.loc[major, 'students_choose_it'] = len(test_score[(test_score['id_first_major'] == major) |(test_score['id_second_major'] == major)])
    
    if major_univ.loc[major, 'students_choose_it'] != 0:
        major_univ.loc[major, 'pass_percentage'] = 100 * major_univ.loc[major, 'utbk_capacity'] / major_univ.loc[major, 'students_choose_it']
        
    major_univ.loc[major, 'passing_grade'] = test_score.loc[test_score['pass_id_major'] == major, 'score_mean'].min()
major_univ.sort_values(by = 'passing_grade', ascending = False, inplace = True)

Lets check our top passing grade for 20 major of both exam type: science and humanities

In [None]:
major_univ.loc[major_univ['type'] == 'science', ['major_name', 'university_name', 'passing_grade']].head(20)

In [None]:
major_univ.loc[major_univ['type'] == 'humanities', ['major_name', 'university_name', 'passing_grade']].head(20)

We know that score for science-type exam seem always greater than humanities-type exam, both the average score and its maximum score. How about its passing grade? We already now that science-type exam has greater score for its maximum score for passing grade. How about its average? Lets check it now.

In [None]:
print("The average of science-type exam's passing grade: ",
      test_score[test_score['test_type'] == 'science'].groupby('pass_id_major')['score_mean'].min().mean())
print("The average of humanities-type exam's passing grade: ",
      test_score[test_score['test_type'] == 'humanities'].groupby('pass_id_major')['score_mean'].min().mean())

We get it now that the average of science-type exam's passing grade is greater than the humanities-type one.

We already have our passing grade. Is there any correlations with the percentage of students that can pass the related majors? For now lets generate column that its value is percentage of students that can pass the majors.

In [None]:
major_univ.corr()

It seems that passing grade of a major has a correlation with the numbers of student that choose that major on UTBK.