# Starting 2019, students who want to register for a Higher Education Entrance Joint Selection (called SBMPTN in Indonesia) must take the Computer-Based Writing Examination (called UTBK in Indonesia). UTBK is implemented by the Higher Education Entrance Test Institution (called LTMPT in Indonesia). LTMPT is an institution under the Ministry of Research, Technology, and Higher Education, which is now the only institution that administers standardized higher education tests in Indonesia.

Information about UTBK registration can be accessed at https://ltmpt.ac.id. UTBK can be followed by students who have graduated in 2017, 2018, and 2019, from secondary education (SMA / MA / SMK) and the equivalent to C's Package graduated in 2017, 2018, and 2019. UTBK uses exam questions designed according to academic rules to predict prospective students in all study programs.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# The first file contains with major ID, major name, and capacity in every majors.

In [None]:
df1 = pd.read_csv('../input/indonesia-college-entrance-examination-utbk-2019/majors.csv')
df1.head()

# The second file contains with university ID, and university name.

In [None]:
df2 = pd.read_csv('../input/indonesia-college-entrance-examination-utbk-2019/universities.csv')
df2.head()

# The third file contains with user ID student, student major choices, and student scores in every humanities subjects.

In [None]:
df3 = pd.read_csv('../input/indonesia-college-entrance-examination-utbk-2019/score_humanities.csv')
df3.head()

# The fourth file contains with user ID student, student major choices, and student scores in every science subjects.

In [None]:
df4 = pd.read_csv('../input/indonesia-college-entrance-examination-utbk-2019/score_science.csv')
df4.head()

# Now, I check for missing values and modify the data.

In [None]:
df1.info()

In [None]:
df2.info()

In [None]:
df3.info()

In [None]:
df4.info()

# There is no missing value, what great data! In the next step, I arrange the data more effectively to read.

In [None]:
df1.rename({'id_major' : 'Major ID', 'id_university' : 'University ID', 'type' : 'Departement', 'major_name' : 'Major Name', 'capacity': 'Capacity'} , inplace = True , axis = 1)

In [None]:
df1.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
df2.rename({'id_university' : 'University ID', 'university_name' : 'University'} , inplace = True , axis = 1)

In [None]:
df2.drop(columns=['Unnamed: 0'], inplace=True)

For humanities(df3) and science(df4) data include 2 choices of major and university. I decide to use the first choice and drop the second choice. It will be easier to see who's pass the UTBK test based on their score.

In [None]:
df3.rename({'id_first_major' : 'Major ID', 'id_first_university' : 'University ID', 'id_user' : 'User ID', 'score_eko' : 'Economy', 'score_geo': 'Geography', 'score_kmb' : 'Reading Comprehension & Writing for Humanities', 'score_kpu': 'General Reasoning for Humanities', 'score_kua' : 'Quantitative Skills for Humanities', 'score_mat': 'Mathematics', 'score_ppu' : 'General Knowledge & Understanding for Humanities', 'score_sej': 'History', 'score_sos' : 'Sociology'} , inplace = True , axis = 1)

In [None]:
df3.drop(columns=['Unnamed: 0', 'id_second_major', 'id_second_university'], inplace=True)

In [None]:
df4.rename({'id_first_major' : 'Major ID', 'id_first_university' : 'University ID', 'id_user' : 'User ID', 'score_bio' : 'Biology', 'score_fis': 'Physics', 'score_kim' : 'Chemistry','score_kmb' : 'Reading Comprehension & Writing for Science', 'score_kpu': 'General Reasoning for Science', 'score_kua' : 'Quantitative Skills for Science', 'score_mat': 'Mathematics', 'score_ppu' : 'General Knowledge & Understanding for Science'} , inplace = True , axis = 1)

In [None]:
df4.drop(columns=['Unnamed: 0', 'id_second_major', 'id_second_university'], inplace=True)

# Voila! It's done. Moving on to the next step, merge the data that I need.

I merge the major name(df1) and universities(df2) for efficiency.

In [None]:
joinUniv = pd.merge(df1,df2, how= 'inner', on = 'University ID')

In [None]:
joinUniv

In [None]:
sns.distplot(joinUniv['Capacity'], label = "Skewness : %.2f"%(joinUniv['Capacity'].skew()))
plt.title('Capacity in University')
plt.legend(loc = 0)
plt.show()

From the plot above, the distribution number is 3.09, which means if the skewness is greater than 1, it's highly skewed. I call it Positive Extreme Right Skewed. This means the capacity that available less than the average capacity of universities. Some universities become student's favorite to choose so they give a bigger capacity for students.

# In this section, I'll tell you what I found from humanities data.

I add two new columns fill with an average score of potency test and score test from all subjects. Then, I merge with universities' data.

In [None]:
df3['Avg Potency Test Humanities'] = ((df3['Reading Comprehension & Writing for Humanities'] + df3['General Reasoning for Humanities'] + df3['Quantitative Skills for Humanities'] + df3['General Knowledge & Understanding for Humanities'])/4).round(2)

In [None]:
df3['Avg Score Student Humanities'] = ((df3['Economy'] + df3['Geography'] + df3['Reading Comprehension & Writing for Humanities'] + df3['General Reasoning for Humanities'] + df3['Quantitative Skills for Humanities'] + df3['Mathematics'] + df3['General Knowledge & Understanding for Humanities'] + df3['History'] + df3['Sociology'])/9).round(2)

In [None]:
Humanities = pd.merge(df3, joinUniv, on=['Major ID'])

In [None]:
Humanities.drop(columns=['University ID_x', 'University ID_y' ], inplace=True)

**I want to see how many students pass and fail for humanities. I add another two columns fill with the result and tally of acceptance based on the score that I found on https://tirto.id/pengumuman-hasil-utbk-sbmptn-2019-skor-secara-nasional-dari-ltmpt-dmU2. I use the median score of humanities(Soshum) as passing grade.**

Humanities Median Score:
* Reading Comprehension & Writing for Humanities = 488
* Quantitative Skills for Humanities = 475
* General Reasoning for Humanities = 483
* General Knowledge & Understanding for Humanities = 488
* Economy = 499
* Geography = 497
* Mathematics = 490
* History = 493
* Sociology = 497

The average score of humanities median is 490.

**Why I chose the Median score? Because it's a middle score from actual data and also good to be an alternative of boundaries between who pass or fail. Usually, the minimum score will be increasing along with how many students register to the major from a university that the students want based on the university's capacity.**

In [None]:
Humanities['Result of Acceptance'] = np.where(Humanities['Avg Score Student Humanities'] <= 490, 'Failed', 'Pass')

I define number 1 as a Failed score and 2 as a Pass score. 

In [None]:
Humanities['Tally of Acceptance'] = np.where(Humanities['Result of Acceptance'] == 'Failed', 1, 2)

In [None]:
Humanities

*Is it nice to be seen?*

In [None]:
sns.distplot(Humanities['Avg Potency Test Humanities'], label = "Skewness : %.2f"%(Humanities['Avg Potency Test Humanities'].skew()))
plt.title('Avg Potency Test Humanities')
plt.legend(loc = 0)
plt.show()

In this distribution plot for Avg Potency Test Humanities, the distribution number is 0.09, which means fairly symmetrical to very close to 0.

In [None]:
sns.distplot(Humanities['Avg Score Student Humanities'], label = "Skewness : %.2f"%(Humanities['Avg Score Student Humanities'].skew()))
plt.title('Avg Score Student Humanities')
plt.legend(loc = 0)
plt.show()

In this distribution plot for Avg Score Student Humanities, the distribution number is 0.16, which means fairly symmetrical.

**Let's count the students who pass or fail.**

In [None]:
passofhum = Humanities['Tally of Acceptance'][Humanities['Result of Acceptance'] == 'Pass'].agg(sum)/2
percpassofhum = (passofhum / len(Humanities['Tally of Acceptance']) * 100).round(1)
print('Student who pass the UTBK Humanities test are', percpassofhum, '%')

In [None]:
failedofhum = Humanities['Tally of Acceptance'][Humanities['Result of Acceptance'] == 'Failed'].agg(sum)
percfailedofhum = (failedofhum / len(Humanities['Tally of Acceptance']) * 100).round(1)
print('Student who failed the UTBK Humanities test are', percfailedofhum, '%')

In [None]:
crosstabhum = pd.crosstab(Humanities['Tally of Acceptance'], Humanities['Result of Acceptance'])
print(crosstabhum)

In [None]:
plt.figure(figsize=(7,7))
ax = Humanities['Result of Acceptance'].value_counts().plot(kind='pie')
plt.legend(["Pass", "Failed"], loc=1, fontsize=13)
plt.title('Humanities')
plt.axis('equal')
plt.show()

*Can you see how many students failed?*

**Is there any correlation between the average potency test between average student score? Check it out!**

Correlation is the statistical summary of the relationship between variables and how to calculate it for different types variables and relationships. I use Pearson’s Correlation Coefficient test to summarize the linear relationship between two variables. In this data, one variable could cause on the values of another variable. The Pearson’s Correlation Coefficient test is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample.

In [None]:
plt.scatter(Humanities['Avg Potency Test Humanities'],Humanities['Avg Score Student Humanities'], marker="*")

This is clear when I review the generated scatter plot where we can see an increasing trend.

# The hypothesis is:

* H0: P0 = P1
* H1: P0 ≠ P1

H0 means there is no correlation between Avg Potency Test Humanities (P0) with Avg Score Student Humanities(P1) and H1 means there is a correlation between Avg Potency Test Humanities (P0) with Avg Score Student Humanities (P1).

In [None]:
from scipy.stats import pearsonr

In [None]:
corr, pval = pearsonr(Humanities['Avg Potency Test Humanities'], Humanities['Avg Score Student Humanities'])

In [None]:
corr

In [None]:
pval < 0.05

In this test, the p-value is not greater than 0.05, so the H0 hypothesis is rejected, which means there is a correlation. I can see that the two variables are positively correlated and the correlation is 0.83. This suggests a high level of correlation.

# In this section, I'll tell you what I found from science data.

I add two new columns fill with an average score of potency test and score test from all subjects. Then, I merge with universities' data.

In [None]:
df4['Avg Potency Test Science'] = ((df4['Reading Comprehension & Writing for Science'] + df4['General Reasoning for Science'] + df4['Quantitative Skills for Science'] + df4['General Knowledge & Understanding for Science'])/4).round(2)

In [None]:
df4['Avg Score Student Science'] = ((df4['Biology'] + df4['Physics'] + df4['Chemistry'] + df4['Reading Comprehension & Writing for Science'] + df4['General Reasoning for Science'] + df4['Quantitative Skills for Science'] + df4['Mathematics'] + df4['General Knowledge & Understanding for Science'])/8).round(2)

In [None]:
Science = pd.merge(df4, joinUniv, on=['Major ID'])

In [None]:
Science.drop(columns=['University ID_x', 'University ID_y' ], inplace=True)

**I want to see how many students pass and fail for science. I add another two columns fill with the result and tally of acceptance based on the score that I found on https://tirto.id/pengumuman-hasil-utbk-sbmptn-2019-skor-secara-nasional-dari-ltmpt-dmU2. I use the median score of science(Saintek) as passing grade.**

Science Median Score:

* Reading Comprehension & Writing for Science = 506
* Quantitative Skills for Science = 507
* General Reasoning for Science = 507
* General Knowledge & Understanding for Science = 506
* Biology = 492
* Physics = 494
* Mathematics = 495
* Chemistry = 494

The average score of science median is 500.13.

**Why I chose the Median score? Because it's a middle score from actual data and also good to be an alternative of boundaries between who pass or fail. Usually, the minimum score will be increasing along with how many students register to the major from a university that the students want based on the university's capacity.**

In [None]:
Science['Result of Acceptance'] = np.where(Science['Avg Score Student Science'] <= 500.13, 'Failed', 'Pass')

I define number 1 as a Failed score and 2 as a Pass score. 

In [None]:
Science['Tally of Acceptance'] = np.where(Science['Result of Acceptance'] == 'Failed', 1, 2)

In [None]:
Science

*Is it nice to be seen?*

In [None]:
sns.distplot(Science['Avg Potency Test Science'], label = "Skewness : %.2f"%(Science['Avg Potency Test Science'].skew()))
plt.title('Avg Potency Test Science')
plt.legend(loc = 0)
plt.show()

In this distribution plot for Avg Potency Test Science, the distribution number is 0.01, which means fairly symmetrical to very close to 0.

In [None]:
sns.distplot(Science['Avg Score Student Science'], label = "Skewness : %.2f"%(Science['Avg Score Student Science'].skew()))
plt.title('Avg Score Student Science')
plt.legend(loc = 0)
plt.show()

In this distribution plot for Avg Score Student Science, the distribution number is 0.34, which means fairly symmetrical.

**Let's count the students who pass or fail.**

In [None]:
passofsci = Science['Tally of Acceptance'][Science['Result of Acceptance'] == 'Pass'].agg(sum)/2
percpassofsci = (passofsci / len(Science['Tally of Acceptance']) * 100).round(1)
print('Student who pass the UTBK Science test are', percpassofsci, '%')

In [None]:
failedofsci = Science['Tally of Acceptance'][Science['Result of Acceptance'] == 'Failed'].agg(sum) 
percfailedofsci = (failedofsci / len(Science['Tally of Acceptance']) * 100).round(1)
print('Student who failed the UTBK Science test are', percfailedofsci, '%')

In [None]:
crosstabsci = pd.crosstab(Science['Tally of Acceptance'], Science['Result of Acceptance'])
print(crosstabsci)

In [None]:
plt.figure(figsize=(7,7))
ax = Science['Result of Acceptance'].value_counts().plot(kind='pie')
plt.legend(["Pass", "Failed"], loc=1, fontsize=13)
plt.title('Science')
plt.axis('equal')
plt.show()

*Can you see how many students passed?*

**Is there any correlation between the average potency test between average student score? Check it out!**

Correlation is the statistical summary of the relationship between variables and how to calculate it for different types variables and relationships. I use Pearson’s Correlation Coefficient test to summarize the linear relationship between two variables. In this data, one variable could cause on the values of another variable. The Pearson’s Correlation Coefficient test is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample.

In [None]:
plt.scatter('Avg Potency Test Science', 'Avg Score Student Science', data = Science, marker="*")

This is clear when I review the generated scatter plot where we can see an increasing trend.

# The hypothesis is:

* H0: P0 = P1
* H1: P0 ≠ P1

H0 means there is no correlation between Avg Potency Test Science (P0) with Avg Score Student Science (P1) and H1 means there is a correlation between Avg Potency Test Science (P0) with Avg Score Student Science (P1).

In [None]:
corr, pval = pearsonr(Science['Avg Potency Test Science'], Science['Avg Score Student Science'])

In [None]:
corr

In [None]:
pval < 0.05

In this test, the p-value is not greater than 0.05, so the H0 hypothesis is rejected, which means there is a correlation. I can see that the two variables are positively correlated and the correlation is 0.88. This suggests a high level of correlation.

# Summary:
* The humanities file contains **61202 user's ID**. But along the process, **4 rows are removed** while I merge in a group by using an inner join.
* **Avg Potency Test Humanities distribution plot number is 0.09**, which means it's **a normal distribution**. Same as well as **Avg Score Student Humanities with distribution plot number is 0.16**.
* A student **who passes the UTBK Humanities test is 81.7 %** with a total of 49971 students and **who failed the UTBK Humanities test is 18.3 %** with a total of 11227 students.
* I use the **Pearson's Correlation Coefficient test** between the Avg Potency Test Humanities and the Avg Score Student Humanities. This is a parametric test because the data are normally distributed. The result is **positively correlated and the correlation is 0.83** which means it is **very closely related**.
* The science file contains **86570 user's ID**. But along the process, **a row is removed** while I merge in a group by using an inner join.
* **Avg Potency Test Science distribution plot number is 0.01**, which means it's **a normal distribution**. Same as well as **Avg Score Student Science with distribution plot number is 0.34**.
* A student **who passes the UTBK Science test is 79.4 %** with a total of 68725 students and **who failed the UTBK Science test is 20.6 %** with a total of 17844 students.
* I use the **Pearson Correlation Coefficient test** between the Avg Potency Test Science and the Avg Score Student Science. This is a parametric test because the data are normally distributed. The result is **positively correlated and the correlation is 0.88** which means it is **very closely related**.

Thank you for reading this notebook. If you found a useful thought, give me some feedback and upvote!