
The dataset used for this analysis comes from Kaggle and contains information on the QS Top 100 Universities worldwide as of 2024. It encompasses a range of metrics which are used to evaluate the performance and reputation of these universities. These metrics include the variables rank, university name, overall score, academic reputation, employer reputation, faculty-student ratio, citations per faculty, international faculty ratio, international students ratio, international research network, employment outcomes, sustainability scores, and the endowment funds available to each university (Fundos (US$)). This comprehensive dataset offers insights into the multifaceted nature of university rankings, highlighting not only the academic and research capabilities of institutions but also their global impact, resource allocation, and sustainability efforts.

Source and Data Download: https://www.kaggle.com/datasets/willianoliveiragibin/qs-top-100-universities


Central Question: What factors contribute most significantly to a university's overall ranking in the QS top 100, and how do these factors influence the university's resources?

To work towards answering this question, I will analyze the relationship and correlations of the factors we are given.


In [14]:
import pandas as pd 

df_universities = pd.read_csv('top 100 world university 2024 new.csv')
df_universities

Unnamed: 0,sequence,rank,university,overall_score,academic_reputation,employer_reputation,faculty_student_ratio,citations_per_faculty,international_faculty_ratio,international_students_ratio,international_research_network,employment_outcomes,sustainability,Fundos (US$)
0,0,1,Massachusetts Institute of Technology (MIT),100.0,100.0,100.0,100.0,100.0,100.0,88.2,94.3,100.0,95.2,9.2
1,1,2,University of Cambridge,99.2,100.0,100.0,100.0,92.3,100.0,95.8,99.9,100.0,97.3,7.8
2,2,3,University of Oxford,98.9,100.0,100.0,100.0,90.6,98.2,98.2,100.0,100.0,97.8,6.7
3,3,4,Harvard University,98.3,100.0,100.0,98.3,100.0,84.6,66.8,100.0,100.0,96.7,6.3
4,4,5,Stanford University,98.1,100.0,100.0,100.0,99.9,99.9,51.2,95.8,100.0,94.4,6.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100,100,100,University of Nottingham,60.4,60.7,72.1,32.2,46.5,90.0,75.2,98.4,24.4,80.0,0
101,101,102,University of Wisconsin-Madison,60.0,80.2,47.8,61.3,37.4,30.9,22.8,83.6,73.1,83.7,0
102,102,103,Pontificia Universidad CatÃ³lica de Chile (UC),59.9,92.9,99.5,20.6,11.6,16.4,3.5,56.8,76.3,91.3,0
103,103,104,The University of Sheffield,59.7,58.7,52.3,54.7,46.9,84.0,97.5,96.1,24.9,76.3,0


In [12]:
import altair as alt
scatter_plot = alt.Chart(df_universities).mark_circle(size=60, opacity=0.8).encode(
    x=alt.X('academic_reputation:Q', scale=alt.Scale(domain=(40, 100))),
    y=alt.Y('overall_score:Q', scale=alt.Scale(domain=(50, 100))),
    tooltip=['university', 'academic_reputation', 'overall_score']
).properties(
    title='Academic Reputation vs. Overall Score',
    width=600,
    height=400
)
regression_line = scatter_plot.transform_regression(
    "academic_reputation", "overall_score", method="linear"
).mark_line(color='red')
scatter_plot + regression_line

My first visualization is a scatterplot comparing academic reputation against overall scores. Each dot represents a university, showing how reputation and ranking are related. The plot's regression line has a positive slope, highlighting a likely trend that universities with higher academic reputations tend to have better overall rankings. This simple analysis begins our path into understanding the key factors behind a university's success, emphasizing the impact of academic reputation. Although this seems like a major factor, there are still schools with high reputation that had low overall scores, so I will continue to explore the importance of other factors in these rankings.

In [21]:

numeric_cols = ['overall_score', 'academic_reputation', 'employer_reputation', 'faculty_student_ratio', 'citations_per_faculty', 'international_faculty_ratio', 'international_students_ratio', 'international_research_network', 'employment_outcomes', 'sustainability', 'Fundos (US$)']
df_numeric = df_universities[numeric_cols]

correlation_matrix = df_numeric.corr().reset_index().melt('index')

correlation_matrix.columns = ['Variable 1', 'Variable 2', 'Correlation']

heatmap = alt.Chart(correlation_matrix).mark_rect().encode(
    x=alt.X('Variable 1:N', sort=numeric_cols, title=None),
    y=alt.Y('Variable 2:N', sort=numeric_cols, title=None),
    color=alt.Color('Correlation:Q', scale=alt.Scale(domain=[-1, 1], scheme='redblue')),
    tooltip=['Variable 1', 'Variable 2', 'Correlation:Q']
).properties(
    title='Heatmap of Correlation Between University Ranking Factors',
    width=600,
    height=600
)

text = heatmap.mark_text(baseline='middle').encode(
    text=alt.Text('Correlation:Q', format='.2f'),
    color=alt.condition(
        alt.datum.Correlation > 0.5, 
        alt.value('white'),
        alt.value('black')
    )
)

heatmap + text

  correlation_matrix = df_numeric.corr().reset_index().melt('index')


Since our goal is to understand which of the multiple factors are important towards the overall score/rankings of the universities, rather than making individual plots, it makes sense to use a heat map. Here we are able to see each variables correlations to the overall score metric and exactly how strong the correlation is. Looking at the overall score column, we can see that the main variables that have a stronger correlation include academic reputation (0.7), employer reputation (0.52), citations per faculty (0.51), and employment outcomes (0.59). With this information, we can set aside these factors as contributing significantly to the universities ranks.

In [28]:
cor_data = {
    'Variable': ['Academic Reputation', 'Employer Reputation', 'Faculty Student Ratio', 'Citations per Faculty', 'International Faculty Ratio', 'International Students Ratio', 'International Research Network', 'Employment Outcomes', 'Sustainability'],
    'Correlation': [0.7, 0.52, 0.42, 0.51, 0.31, 0.27, 0.23, 0.59, 0.32]
}
df_correlation = pd.DataFrame(cor_data)

df_corr_sorted = df_correlation.sort_values('Correlation', ascending=False)
bar_chart = alt.Chart(df_corr_sorted).mark_bar().encode(
    x=alt.X('Correlation:Q', title='Correlation with Overall Score'),
    y=alt.Y('Variable:N', sort='-x', title=None),  
    color=alt.value('green'),    
    tooltip=['Variable', 'Correlation']
).properties(
    title='Correlation of Variables with Overall QS Score',
    width=600,
    height=300
)

bar_chart

Since we were able to find the highly correlated factors with the heat map, it helps to visualize the order of the correlations so that we can see the respective order of what correlates the most with a good overall score. This information can be important because it could lead to changes made within universities to optimize their ranking and score. They could prioritize higher impact areas like employment outcomes and sequentially work on the rest of the factors in decreasing order.

Live Link: https://github.com/vivekl2003/M2-Live