Hannah Smith


This data set contains the academic information of 3,046 STEM students and has 10 features: ID No, Program of Study (ProgCode), Gender, Year of Graduation (YoG), CGPA, CGPA100 (CGPA at end of first year), CGPA200 (CGPA at end of second year), CGPA300 (Cumulatove Grade Point Average at the end of the third year), CGPA400 (CGPA at end of fourth year), and SGPA (Secondary School Cumulative Grade Point Average).

Under the program of study, majors are given acroynyms for brevity. Here is the key as given by the author of the data set:

PROGRAM OF STUDY

BCH - Biochemistry

BLD - Building technology

CEN - Computer Engineering

CHE - Chemical Engineering

CHM - Industrial Chemistry

CIS - Computer Science

CVE - Civil Engineering

EEE - Electrical and Electronics Engineering

ICE - Information and Communication Engineering

MAT - Mathematics

MCB - Microbiology

MCE - Mechanical Engineering

MIS - Management and Information System

PET - Petroleum Engineering

PHYE - Industrial Physics-Electronics and IT Applications

PHYG - Industrial Physics-Applied Geophysics

PHYR - Industrial Physics-Renewable Energy'


The author is Krishnansh Verma and his data set can be found here: https://www.kaggle.com/datasets/krishnanshverma/academic-performance-of-university-student-dataset/data

In [2]:
import pandas as pd
import plotly.express as px

First, I will check the data set for any issues that need resolving before trying to find any insights.

In [3]:
df = pd.read_csv('academic_performance_dataset_V2.csv')
#Check the number of features and students
print(df.shape)

#Check every id is unique
print(len(df['ID No'].unique()))

(3046, 10)
2974


While there are 3046 records, there are only 2,974 unique IDs. Clearly, something off about the ID column. I'll check if there are duplicate records or if the ID column is nonfunctional. 

In [4]:
duplicate_IDs =df['ID No'].duplicated()
print(df[duplicate_IDs].head(1)) #Here we can see one of the duplicated ID numbers is 76075
print(df.loc[df['ID No'] == 76075])

     ID No Prog Code Gender   YoG  CGPA  CGPA100  CGPA200  CGPA300  CGPA400  \
173  76075       MCB   Male  2014  2.32     2.61     1.98     1.77     2.67   

     SGPA  
173  2.68  
     ID No Prog Code Gender   YoG  CGPA  CGPA100  CGPA200  CGPA300  CGPA400  \
6    76075       BCH   Male  2010  3.34     3.68     3.00     3.44     3.28   
173  76075       MCB   Male  2014  2.32     2.61     1.98     1.77     2.67   

     SGPA  
6    3.02  
173  2.68  


Here we see that the two records with the same ID number are not duplicates, just mistakenly assigned the same ID. I can instead give a real ID.

In [5]:
df['ID No'] = df.index

This data set does not appear to have null values or values that represent nulls (like 0 or negative 1). There is only one student that has a 0 value.

In [6]:
from itertools import product
#Find rows that have at least one column = 0
cols = ['YoG', 'CGPA', 'CGPA100', 'CGPA200', 'CGPA300', 'CGPA400', 'SGPA']
for col in cols:
    null_values = df[df[col] <= 0]  # Filter rows where col == 0
    print(null_values)
print(f"Are there null values? : {df.isnull().values.any()}")
    

Empty DataFrame
Columns: [ID No, Prog Code, Gender, YoG, CGPA, CGPA100, CGPA200, CGPA300, CGPA400, SGPA]
Index: []
Empty DataFrame
Columns: [ID No, Prog Code, Gender, YoG, CGPA, CGPA100, CGPA200, CGPA300, CGPA400, SGPA]
Index: []
Empty DataFrame
Columns: [ID No, Prog Code, Gender, YoG, CGPA, CGPA100, CGPA200, CGPA300, CGPA400, SGPA]
Index: []
Empty DataFrame
Columns: [ID No, Prog Code, Gender, YoG, CGPA, CGPA100, CGPA200, CGPA300, CGPA400, SGPA]
Index: []
Empty DataFrame
Columns: [ID No, Prog Code, Gender, YoG, CGPA, CGPA100, CGPA200, CGPA300, CGPA400, SGPA]
Index: []
      ID No Prog Code Gender   YoG  CGPA  CGPA100  CGPA200  CGPA300  CGPA400  \
2713   2713       PET   Male  2011  2.96     2.77     3.44     2.25      0.0   

      SGPA  
2713  2.81  
Empty DataFrame
Columns: [ID No, Prog Code, Gender, YoG, CGPA, CGPA100, CGPA200, CGPA300, CGPA400, SGPA]
Index: []
Are there null values? : False


Possibly this student graduated in three years, but it would be strange that he is the only one. Another explanation is that he could have simply not entered his fourth year's GPA. Since his record was the only one that had any zeros, I am not certain on what caused it. I decided it would be easiest to impute this student's fourth year score since I already have the final GPA and all the other years' GPA.

In [7]:
#Solve for fourth year GPA
def replace_fourth_gpa(df):
    replacement = ((df['CGPA']*30*4)-((df['CGPA100']*30)+(df['CGPA200']*30)+(df['CGPA300']*30)))/30

    #Make sure the GPA isn't invalid
    if (replacement > 5.0) or (replacement < 1):
        print("Error, GPA can't be calculated") 
        return 0

    replacement = round(replacement,2)
    print(replacement) #Out of curiosity

    return replacement

df.loc[df['CGPA400'] == 0, 'CGPA400'] = df[df['CGPA400'] == 0].apply(replace_fourth_gpa, axis=1) #apply to each row where 'CGPA400' == 0

3.38


Now that the invalid values are fixed, I wanted to see if there was any suspicious values or outliers I missed. I did this by looking at an overall distribution of GPAs.

In [8]:
px.histogram(df, x='CGPA').show()
print(f" {round(len(df.loc[df['CGPA'] < 3]) / len(df),4)*100}% of students are graduating with GPAs below 3.0")
print(f" {round(len(df.loc[df['CGPA'] < 2]) / len(df),4)*100}% of students are graduating with GPAs below 2.0")

 24.72% of students are graduating with GPAs below 3.0
 1.81% of students are graduating with GPAs below 2.0


The data has a uniform distribution, which is not the distribution I expected. The fact that a quarter of students had a GPA below 3.0 was suspicious to me. Most internships require at least a 3.0. I also thought students with a GPA below 2.0 fail out of college, so 1% of students graduating with a GPA below 2 is also strange. I decided to filter out suspicious GPAs. Assuming each year a student took between 24 and 36 credits, I checked to see if the overall GPA could be calculated from the yearly GPAs.

In [9]:
def gpa_checker(df):
    actual_CGPA = df['CGPA']
    gpas = [df['CGPA100'], df['CGPA200'], df['CGPA300'], df['CGPA400']]

    #If this college is like GCC, students generally take between 12 and 18 credits per semester
    credit_hours = [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]
    combinations = list(product(credit_hours, repeat=4))

#Check all possible combinations of credit hours to see if their final GPA is possible considering their yearly GPAs
    for combo in combinations:
        sum = combo[0] +combo[1]+combo[2]+combo[3]
        possible_CGPA = round(((combo[0]*gpas[0]) + (combo[1]*gpas[1]) + (combo[2]*gpas[2]) + (combo[3]*gpas[3])) / sum,2)
        if((possible_CGPA >= (actual_CGPA-0.04)) and possible_CGPA <= (actual_CGPA+0.04) ):
            return True
    return False
        
        
sus_values = df[~df.apply(gpa_checker, axis=1)]
print(sus_values)
print(f"{(round(len(sus_values)/len(df),4))*100}% of records have suspicious CGPAs")

      ID No Prog Code  Gender   YoG  CGPA  CGPA100  CGPA200  CGPA300  CGPA400  \
0         0       ICE  Female  2010  3.23     2.88     3.48     2.62     2.90   
2         2       BCH    Male  2010  2.21     1.78     1.98     1.49     2.51   
3         3       BCH    Male  2010  2.70     2.67     2.44     2.00     2.35   
7         7       BCH  Female  2010  2.56     2.30     2.50     2.29     2.77   
69       69       BCH  Female  2012  2.72     2.78     2.56     2.68     2.57   
...     ...       ...     ...   ...   ...      ...      ...      ...      ...   
2924   2924      PHYE    Male  2014  2.20     3.31     1.84     2.79     1.75   
2928   2928      PHYE    Male  2012  4.36     4.20     4.51     4.55     3.75   
2972   2972      PHYE    Male  2014  3.04     3.00     2.32     2.74     3.61   
3019   3019      PHYG    Male  2012  3.51     3.76     3.11     3.39     3.42   
3027   3027      PHYG    Male  2012  3.52     3.88     3.45     2.81     3.12   

      SGPA  
0     3.13  
2

Around a fifth of the records were flagged as having invalid cumulative GPAs. This suggests quite a few students came in with credits and took below 12 credits at least one semester, or took above 18 credits at least one semester. Of course, it is also possible that this college has a different way of calculating GPA than Grove City College considering the presence of 5.00 GPAs. 

Next, I checked to see if there was anything off about the distribution of these suspicious records.

In [10]:
valid_records = df[~df['ID No'].isin(sus_values['ID No'])]
cols = ['CGPA', 'CGPA100', 'CGPA200', 'CGPA300', 'CGPA400', 'YoG']

# *** the following code comes from https://stackoverflow.com/questions/56727843/how-can-i-create-subplots-with-plotly-express , I just made it a method
# and edited it so it would work
import plotly.subplots as sp
def subplot_function(x_col):
    #Create figures in Express
    figure1 = px.histogram(sus_values, x=x_col, title='Invalid Records')
    figure2 = px.histogram(valid_records, x=x_col, title='Valid Records')

    # Extract traces from figures
    figure1_traces = [trace for trace in figure1["data"]]
    figure2_traces = [trace for trace in figure2["data"]]

    # Create a 1x2 subplot
    this_figure = sp.make_subplots(rows=1, cols=2, subplot_titles=['Invalid Records', 'Valid Records'])

    # Add traces to subplot
    for traces in figure1_traces:
        this_figure.add_trace(traces, row=1, col=1)
    for traces in figure2_traces:
        this_figure.add_trace(traces, row=1, col=2)

    #*** I added the code that added x-axises titles
    # Update xaxis properties
    this_figure.update_xaxes(title_text= col, row=1, col=1) 
    this_figure.update_xaxes(title_text= col, row=1, col=2)


    # Show the plot
    this_figure.show()

for col in cols:
    subplot_function(col)

#*** Everything after this is me again

The invalid records do seem to have slightly worse GPA than the valid records, but overall look like a random sample. In the end, there is nothing that jumps out as to what might be the reason for all the strange GPAs. A fifth of the database is too large to drop and would affect later insights, so I am choosing to ignore the strange values.

Next, I was curious to see how correlated a student's high-school GPA is to their final GPA, especially considering the emphasis colleges put on GPA when applying.

In [11]:
px.scatter(df, x= 'CGPA', y= 'SGPA', title = "Little Correlation Between High School GPA and College GPA", labels={'SGPA':'High School GPA', 'CGPA':'College GPA'})

Surprisingly, there is only a weak correlation between high school GPA and college GPA. Even stranger, the weak correlation is not due to students who did well in high school becoming burnt out in college. Many students who graduated high school with a GPA below 3 graduated college with a GPA above 3.5. Of course, this data set does not include everything that might affect a student's GPA. Perhaps these low-GPA students were founding a charity or had a dozen extracurriculars and did not have enough time for their studies. That explanation would also explain why the low-GPA students were still accepted into the college.

Since high school GPA did not have much effect on the college GPA, I was curious if the major was the real determiner of GPA.

In [12]:
px.box(df, x = 'Prog Code', y='CGPA', title = "GPA Distributions by Major", labels={'CGPA':'GPA', 'Prog Code':'Major'})

The major a student had is a bigger indicator of GPA than their GPA in high school. While the min and max GPAs were fairly consistant across programs of study, the median GPAs varied. Building Technology (BLD), Management and Information System (MIS), and Industrial Physics-Renewable Energy (PHYR) all had lower median GPAs compared to the other majors.


 I was curious if, along with the major, the gender of the student influenced GPA. I assumed it would not, but to my surprise, it did. Furthermore, when plotted across all years, women consistently had at least a .2 higher GPA than their male counterparts. Considering that a student's major affects their GPA, I also made a dot plot to add in that variable as well. In Chemistry majors like Microbiology, Industrial Chemistry, Chemical Engineering, and Petroleum Engineering women tended to do about the same as their male counterparts, but for the other majors, there was a significant gap.

In [13]:
grouped_data = df.groupby(['Gender', 'YoG'])['CGPA'].mean().reset_index()
grouped_data["YoG"] = grouped_data["YoG"].astype(str)
px.line(grouped_data, x = 'YoG', y = 'CGPA', title = f'Women Consistantly Graduate with a Higher GPA than Men', color = 'Gender', 
        labels={'CGPA':'GPA', 'YoG':'Year Graduating'}).show()

In [14]:
grouped_data = df.groupby(['Gender', 'Prog Code'])['CGPA'].mean().reset_index()
px.scatter(grouped_data, x = 'Prog Code', y = 'CGPA', title = f'Women Out-Perform Men in Their Major', color = 'Gender', labels={'CGPA':'GPA', 'Prog Code':'Major'}).show()

Considering that fact that women have such good GPAs at this college compared to the men, I thought it would be good to check if the majors that had a higher average GPA really just had a higher percentage of women.

In [39]:
#Find total student counts and how many of those are women
women_by_major = df.groupby(['Prog Code', 'Gender'])['Gender'].count().reset_index(name='Count')
counts_major = df.groupby('Prog Code')['Gender'].count().reset_index(name= 'Total')

#Group into one data frame
total_counts = women_by_major.merge(counts_major, on = 'Prog Code')
women_counts = total_counts.loc[total_counts['Gender'] =='Female'].copy() 
women_counts['Percent Women'] = round((women_counts['Count'] / women_counts['Total']),4) * 100

px.bar(women_counts, x='Prog Code', y= "Percent Women").show()

Turns out, not every women at this college has an excellent GPA. The majors where the women's GPA is similar to the men's GPA are the same majors that have a higher percentage of women - the chemistry majors. The fewer women in other majors also meant fewer underachievers, making the men look worse.

Exploring this data led to many surprising insights. Just from looking at the data, it is not easy to see that the GPA scores are normally distributed, or that women outperform men in regards to graduating GPA. There were multiple trends that I would not have been able to guess without the plots like that high school GPA is a poor indicator of college GPA.

This project gave me a better understanding of the "Academic Performance" data set. Because of my familiarity with the data, I have several ideas on using this data set to build a machine learning model; for example, having CGPA be a label with a student's major and gender being features instead of high school GPA.