## EDA on world university ranking

### Let's start with [Times ranking](https://www.timeshighereducation.com/world-university-rankings), times is considered the most reliable ranking.

We have 2603 universities data here and we have the following columns:
* world_rank                - ranking of the university globally
* university_name           - name of the university
* country                   -  country the university belongs to
* teaching                  - university score for teaching/learning environment
* international             - university score international outlook (staff, students, research)
* research                  - university score for research (volume, income and reputation)
* citations                 - university score for citations (research influence)
* income                    - university score for industry income (knowledge transfer)
* total_score               - total score for university, used to determine rank
* num_students              - number of students at the university
* student_staff_ratio       - studets to faculty ratio
* international_students    - number of international students
* female_male_ratio         - Male to female ratio
* year                      - Year of collected data

In [None]:
import pandas as pd
timesData = pd.read_csv('../input/world-university-rankings/timesData.csv')
print(timesData.shape)
timesData.head()

In [None]:
import matplotlib.pyplot as plt
## Utility
def plot_counts(X, y, xlabel, ylabel, title, figsize=(10,7), xrotate=None, yrotate=None, horizontal=False):
    plt.figure(figsize=figsize)  
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    if xrotate:
        plt.xticks(rotation=xrotate)
    if yrotate:
        plt.yticks(rotation=yrotate) 
    if horizontal:
        plt.barh(X, y)
    else:
        plt.bar(X, y)
    plt.plot();

def convert_int(df, col):
    return df[col].apply(pd.to_numeric)

def remove_dash(df, col):
    df.loc[df[col].str.contains("-"), col] = 0
    df[col] = convert_int(df, col)
    return df

Clean the data - remove NaN values, '-' values and convert string to integers.

In [None]:
## Data cleaning
null_columns = timesData.columns[timesData.isnull().any()].tolist()
## for sake of eda let's replace NaN values with 0
timesData = timesData.fillna(0)
## check if any NaN data left
print(f"Any NaN column left = {timesData.isnull().any().any()}")

In [None]:
## clean all columns having integer values
dash_cols = ['international', 'income', 'total_score']
for col in dash_cols:
    timesData = remove_dash(timesData, col)
timesData.num_students = timesData.num_students.map(lambda x: x.replace(",", "") if type(x) != int and "," in x else 0).apply(pd.to_numeric)
timesData['international_students_percent'] = timesData.international_students.map(lambda x: x.replace("%", "") if type(x) != int and "%" in x else 0).apply(pd.to_numeric)
timesData['num_international_students'] = (timesData.num_students * timesData.international_students_percent) / 100
timesData['female_percent'] = timesData.female_male_ratio.apply(lambda x: x.split(":")[0] if type(x) != int and ":" in x else 0).apply(pd.to_numeric)
timesData['male_percent'] = timesData.female_male_ratio.apply(lambda x: x.split(":")[1] if type(x) != int and ":" in x else 0).apply(pd.to_numeric)
timesData.drop(['international_students', 'female_male_ratio'], axis=1, inplace=True)
timesData.head()

In [None]:
timesData.dtypes

In [None]:
timesData.describe()

### How many years data do we have

* We have data from 2011 to 2016.
* 2016 has more data points than other years.

In [None]:
year_count = timesData.year.value_counts()
plot_counts(year_count.index, year_count.values, 'year', 'data count', 'Data for each year')

### Let's only consider the latest year available ie 2016 and later we can compare with previous years data. All the following questions below would be for 2016 data only. Considering only 2016 years data we should have close to 800 data points to analyze.

In [None]:
year = 2016
timesDataYear = timesData[timesData.year == year]
# timesDataYear.head()

### Which are the top 10 universities according to times ranking

* CalTech is at the top followed by Oxford and Stanford.
* Out of the top 10 universities, 6 are from USA.
* That could be the reason international student prefer studying for their education in the States.
* CalTech also has the highest score in terms of teaching, international students, citations and income.

In [None]:
timesDataYear[['world_rank', 'university_name', 'country', 'total_score', 'year']][:20]

### How many country's data is involved here? And how many university each of these countries have?

* We have data of total 72 countries out of the 195 countries in the world.
* US has more universities (close to 150) than any other countries. Another reason why US is more popular for international students.
* Some countries even have only 1 university in their entire country such as Lebanon, Ghana, Uganda.

In [None]:
country_count = timesDataYear.country.value_counts()
print(f"Total Countries {len(country_count.values)}")
plot_counts(
    country_count.index[:20],
    country_count.values[:20],
    'Country count',
    'Country name',
    'Countries',
    (10, 7),
    None,
    None,
    True
)

## Which university has the best teaching facutly in the world?

* CalTech has the highest teaching score followed by Oxford and Standford.
* Not to our suprise, We see the same list as before (ranking)
* Universities in the top ranks do have a high teaching score as expected from a university.

In [None]:
teaching_count = timesDataYear.sort_values(by='total_score', ascending=False)
teaching_count[['world_rank', 'university_name', 'country', 'teaching', 'year']][:10]

### Which country hosts most of the international students

* US as expected leads the ranks here folowed by UK and Australia, Germany and Canada

In [None]:
timesDataYear[['country', 'num_international_students']].groupby(by='country', axis=0).sum().sort_values('num_international_students', ascending=False).head(10)

### Let's find the universities having most international students

* Woah UAE has 82% of international students, although there are less number of students in the university that could make the international students percentage biger.

In [None]:
timesDataYear.sort_values('international_students_percent', ascending=False)[['world_rank', 'university_name', 'country', 'year', 'international_students_percent']].head(10)

### Now in the US, which is the university that has most international students

* Interesting enough universities not in the top ranks have higher number of international students, example CMU Columbia & Princeton.

In [None]:
timesDataYear[timesDataYear.country == 'United States of America'].sort_values('international_students_percent', ascending=False)[['world_rank', 'university_name', 'country', 'year', 'international_students_percent']].head(10)

### Let's find the largest universities in the world in terms of number of students

* To our surprise the not so common universities have close to 400,000 students.

In [None]:
timesDataYear.sort_values(by='num_students', ascending=False)[['world_rank', 'university_name', 'country', 'year', 'num_students']].head(10)

### Universities which have more female students than males

* Ewha Womans University of South Korea has 100 percent feamles.

In [None]:
timesDataYear.sort_values(by='female_percent', ascending=False)[['world_rank', 'university_name', 'country', 'year', 'female_percent']].head()

### As US is the most favourable country for Higher education, What are the top universities in the United States?

* Top 50 universities in the US hold top 100 ranks in the world.

In [None]:
country = 'United States of America'
country_data = timesDataYear[timesDataYear['country'] == country]
country_data[['world_rank', 'university_name', 'country', 'total_score', 'year']][:50]

### Canada is another destination students consider while opting for higher education. What are the top schools in Canada?

In [None]:
country = 'Canada'
country_data = timesDataYear[timesDataYear['country'] == country]
country_data[['world_rank', 'university_name', 'country', 'total_score', 'year']][:10]

In [None]:
timesData.loc[timesData.university_name == 'California Institute of Technology']