# Academic Ranking of World Universities (Focused Monash and Seoul University)
  
This practice was inspired by **MSU vs Top - 7** by **Ayat Ospanov**   
https://www.kaggle.com/ospanoff/msu-vs-top-7  
  
Mainly this practice will focus on two universities (Monash University and Seoul University) with Top 10 univiersities in the world  
  
This practice will contain Korean and English and this was done for personal practice usage

## Introduction

  The data contains the world ranks of the universities which can be used for comparsion.
  Since I am interested in two universities (Monash University and Seoul University).
  * Monash University: I am currently studying at Monash University
  * Seoul University: Since I am from Korea, Seoul University seems interesting to compare.

## Load data and import libraries

In [None]:
import numpy as np
import pandas as pd
import requests
import sys
import json
import time
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode()

In [None]:
#read_csv ARWU (Shanghai data)
data = pd.read_csv('../input/shanghaiData.csv', skiprows=[3897])

## Academic Ranking of World Universities
Here is the link for Academic Ranking of World Universities (ARWU)  

**[CLICK HERE TO SEE THE WEB PAGE](http://www.shanghairanking.com/ARWU-Methodology-2017.html) **  
* **Quality of Education**
    * Alumni (10%): Alumni of an institution winning Nobel Prizes and Fields Medals
* **Quality of Faculty**
    * Award (20%): Staff of an institution winning Nobel Prizes and Fields Medals
    * HiCi (20%): Highly cited researchers in 21 broad subject categories
* **Research Output**
    * N&S (20%): Papers published in Nature and Science
    * PUB (20%): Papers indexed in Science Citation Index-expanded and Social Science Citation Index
* **Per Capita Performance**
    * PCP (10%): Per capita academic performance of an institution
    
*For institutions specialized in humanities and social sciences such as London School of Economics, N&S is not considered, and the weight of N&S is relocated to other indicators.*
    

### Seoul University

In [None]:
data[data['university_name'] == "Seoul National University"]

It was shown that there is no total_score for Seoul National University  
Therefore, we need to count it using ARWU method

### Monash University

In [None]:
data[data['university_name'] == "Monash University"]

Since Monash data also does not contain total_score, we need to count it using ARWU method.

### Harvard University and Oxford University 
*Just to see what about other universities which are ranked top*

In [None]:
data[data['university_name'] == "Harvard University"]

In [None]:
data[data['university_name'] == "University of Oxford"]

From examples shown above proves that total_score is not calculated for below top 100 universities which need to be edited

### Count total_score

In [None]:
#lamda (https://wikidocs.net/64)
data['new_total_score'] = data.apply(lambda x: 0.1 * x[4] + 0.2 * x[5] + 0.2 * x[6] + 0.2 * x[7] + 0.2 * x[8] + 0.1 * x[9], axis=1)

In [None]:
#to check the correlation between old total score and NEW total score
data[['total_score', 'new_total_score']].corr()

Since it shows they are highly correlated, we can drop the old total_score and we can use new_total_score

In [None]:
data.drop('total_score', 1, inplace=True)
data.rename(columns={'new_total_score': 'total_score'}, inplace=True)

In [None]:
#To see the top 10 universities by total_score
data[:10]

In [None]:
#total_score for Monash and Seoul University
data[data.university_name == 'Seoul National University']

In [None]:
data[data.university_name == 'Monash University']

Now both universities have total_score and they are ready to compare

## Data exploration

In [None]:
#last year of the data exploration which is 2015
year = 2015
data_byy = data.groupby('year').get_group(year)
corr = data_byy[['alumni', 'award', 'hici', 'ns', 'pub', 'pcp']].corr()

In [None]:
iplot([
    go.Heatmap(
        z=corr.values[::-1],
        x=['alumni', 'award', 'hici', 'ns', 'pub', 'pcp'],
        y=['alumni', 'award', 'hici', 'ns', 'pub', 'pcp'][::-1]
    )
])

In the correlation it is shown that most of the data are highly related which shows more than 0.5 mostly.  
We will have a detail look for correlation features in the further discussion

In [None]:
#correlation plot setting
def plot_corr(x_name, y_name, year):
    data_sc = [go.Scatter(
        x = data_byy[x_name],
        y = data_byy[y_name],
        text = data_byy['university_name'],
        mode = 'markers',
        marker = dict(
            size = 10,
            color = data_byy['total_score'],
            colorscale = 'Rainbow',
            showscale = True,
            colorbar = dict(
                title = 'total score',
            ),
        ),
    )]

    layout = go.Layout(
        title = '%s World University Rankings' % year,
        hovermode = 'closest',
        xaxis = dict(
            title = x_name,
        ),
        yaxis = dict(
            title = y_name,
        ),
        showlegend = False
    )

    iplot(go.Figure(data=data_sc, layout=layout))

### Correlation between HiCi and N&S (Correlation : 0.866)

In [None]:
#HiCi and N&S
plot_corr('hici', 'ns', year)

It is the highest correlation figure in the data  
This can be explained as the amount of publishing can be related to the amount of citited. As they publish lots of reports and articles which can be used for citation,  
the figure will increase. Therefore, it may produce this high correlation.

### Correlation between Alumni and Award (Correlation: 0.762)

In [None]:
#alumni and award
plot_corr('alumni', 'award', year)

It is the second highest figure in the data  
One thing can be noticed in the data is that, there are 0 x-values and 0 y-values who are only considered in the one category  
If we do not consider them, it might have stronger correlation than current correlation figure.

This can be a further study, how the figure looks like by data cleaning process, to get rid of those values with 0

### Correlation between HiCi and PUB (Correlation: 0.621)

In [None]:
#pub and hici
plot_corr('pub', 'hici', year)

It does not show as strong as previous figures but it can be explained that the more researchers have cited publications, the more they are cited

### Correlation between Award and PUB (Correlation: 0.406)

In [None]:
#award and pub
plot_corr('award', 'pub', year)

This was actually interesting figure that shown in the correlation data.  
I expected that the amount of publishing may affect the award but it is shown that it is not neccessary.  
Therefore, it is important for them to focus on the quality of their papers to win the prize instead of spending more time on publishing more.  

If there is any figures of score for publishing, this can be disccussed with the quality of papers and this can lead further practice on how much the quality of the paper is significant than amount of publishing.

  ## Top 10 Universities compare to Monash University and Seoul National University

In [None]:
%matplotlib inline
import matplotlib.pylab as plt
from matplotlib import colors as mcol

class Radar(object):

    def __init__(self, fig, titles, labels, rect=None):
        if rect is None:
            rect = [0.05, 0.05, 0.95, 0.95]

        self.n = len(titles)
        self.angles = np.arange(0, 360, 360.0/self.n)
        self.axes = [fig.add_axes(rect, projection="polar", label="axes%d" % i) 
                         for i in range(self.n)]

        self.ax = self.axes[0]
        self.ax.set_thetagrids(self.angles, labels=titles, fontsize=14)

        for ax in self.axes[1:]:
            ax.patch.set_visible(False)
            ax.grid("off")
            ax.xaxis.set_visible(False)

        for ax, angle, label in zip(self.axes, self.angles, labels):
            ax.set_rgrids(range(1, 102, 10), angle=angle, labels=label)
            ax.spines["polar"].set_visible(False)
            ax.set_ylim(0, 101)

    def plot(self, values, *args, **kw):
        angle = np.deg2rad(np.r_[self.angles, self.angles[0]])
        values = np.r_[values, values[0]]
        self.ax.plot(angle, values, *args, **kw)

In [None]:
top_univers = ['Harvard University',
               'Stanford University',
               'Massachusetts Institute of Technology (MIT)',
               'University of California, Berkeley', 'University of California-Berkeley',
               'University of Cambridge',
               'Princeton University',
               'California Institute of Technology',
               'Columbia University',
               'University of Chicago',
               'University of Oxford',
               'Monash University',
               'Seoul National University']

In [None]:
Comparison = []

years = list(set(data['year']))
for i, year in enumerate(years):
    tmp = data.groupby('year').get_group(year)
    
    ind = np.where(tmp['university_name'] == top_univers[0])[0]
    univers = tmp.iloc[ind].values
    for un in top_univers[1:]:
        ind = np.where(tmp['university_name'] == un)[0]
        univers = np.append(univers, tmp.iloc[ind].values, axis=0)
    
    Comparison += [univers]
    
Comparison = np.array(Comparison)

In [None]:
#2005 - 2015 data comparison in the radar
titles = ['alumni', 'award', 'hici', 'ns', 'pub', 'pcp']
labels = ['' if i != 1 else range(0, 101, 10) for i in range(len(titles) - 1)]
colors = np.asarray(list(mcol.cnames))
colors = colors[np.random.randint(0, len(mcol.cnames), Comparison[0].shape[0])]

for d in Comparison:
    fig = plt.figure(figsize=(5, 5))
    radar = Radar(fig, titles, labels)
    for i, univ in enumerate(d[d[:, 0].argsort()]):
        radar.plot(univ[3:9], lw=2, c=colors[i], alpha=1, label=univ[1] + ' (' + univ[0] + ')')

    radar.ax.legend(bbox_to_anchor=(1, 1), loc=2);
    plt.show()

### *Notes*
I wanted to sort it out but since universities above 100 is categorized by 50, this cannot be sorted out. Therefore, to make it, need to adjust it with rank.  
However, due to the characteristic of the data, there must be a reason that ARWU categorized them in the place.  
This can be further study to edit them into certain rank by each year to compare completely.

There is not much difference for top 10 universities but we can find that Seoul University and Monash is growing well by every year in every variables that used to compare  
As a result, it shows there is an increase for their world rank.

## Conclusion
Although Monash University figure was showing a bit disappointing number which I was not expecting but it is proven that  
they are improving and Seoul National University values were really impressive which I did not expect at all.

This can be driven to further study such as 
* Comparison within the country
    * Comparison in the national ranking
* Comparison in each variables (e.g. HiCi, Awards ...)
* Adjusted world rank by replacing range figure to certain rank number