# Times University ranking dataset analysis

In this codealong we are going to analyze a ranking of universities using regression. Specifically, we are going to **predict the university ranking** with the provided predictors.

---

The information provided in the csv contains:

- **world_rank** - world rank for the university. Contains rank ranges and equal ranks (eg. =94 and 201-250).
- **university_name** - name of university.
- **country** - country of each university.
- **teaching** - university score for teaching (the learning environment).
- **international** - university score international outlook (staff, students, research).
- **research** - university score for research (volume, income and reputation).
- **citations** - university score for citations (research influence).
- **income** - university score for industry income (knowledge transfer).
- **total_score** - total score for university, used to determine rank.
- **num_students** - number of students at the university.
- **student_staff_ratio** - Number of students divided by number of staff.
- **international_students** - Percentage of students who are international.
- **female_male_ratio** - Female student to Male student ratio.
- **year** - year of the ranking (2011 to 2016 included).

We are going to predict the **total score**, which directly corresponds to the ranking.

---

### ONLY THE DATA PATH IS PROVIDED!

The analysis is up to you. This is an open ended practice. You are expected to:

- Load the packages you need to do analysis
- Perform EDA on variables of interest
- Form a hypothesis or hypotheses on what is important for the score
- Check your data for problems, clean and munge data into correct formats
- Create new variables from columns if necessary
- Perform statistical analysis with regression and describe the results

---

If you do not know how to do something **check documentation first.** I look up things in documentation all the time. 

**You are not expected to know how to do things by heart. Knowing how to effectively look up the answers on the internet is a critical skill for data scientists!**

In [1]:
uni_data_path = './dataset/timesData.csv'

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [3]:
unidata = pd.read_csv(uni_data_path)

In [4]:
unidata.head(2)

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243,6.9,27%,33 : 67,2011


In [5]:
unidata.dtypes

world_rank                 object
university_name            object
country                    object
teaching                  float64
international              object
research                  float64
citations                 float64
income                     object
total_score                object
num_students               object
student_staff_ratio       float64
international_students     object
female_male_ratio          object
year                        int64
dtype: object

In [6]:
def fill_in(x):
    new_x = []
    for i in x:
        if i == '':
            new_x.append('NaN')
        else:
            new_x.append(float(i))
    return new_x

In [7]:
unidata['international'] = unidata['international'].apply(lambda x: x.replace('-', ''))

In [8]:
unidata['international'] = fill_in(unidata['international'])

In [9]:
unidata['international'] = unidata['international'].apply(lambda x: float(x))

In [10]:
unidata['income'] = unidata['income'].apply(lambda x: x.replace('-', ''))

In [11]:
unidata['income'] = fill_in(unidata['income'])

In [12]:
unidata['income'] = unidata['income'].apply(lambda x: float(x))

In [13]:
unidata['total_score'] = unidata['total_score'].apply(lambda x: x.replace('-', ''))

In [14]:
unidata['total_score'] = fill_in(unidata['total_score'])

In [18]:
unidata['total_score'] = unidata['total_score'].apply(lambda x: float(x))

In [19]:
def num_cat(x):
    cat_list = []
    for i in x:
        cat_list.append(type(i))
    return set(cat_list)

In [20]:
num_cat(unidata['num_students'])

{float, str}

In [21]:
unidata['num_students'] = unidata['num_students'].apply(lambda x: float(x.replace(',', '')) if isinstance(x, str) else x)

In [22]:
num_cat(unidata['num_students'])

{numpy.float64}

In [23]:
unidata['num_students'] = fill_in(unidata['num_students'])

In [24]:
unidata['num_students'] = unidata['num_students'].apply(lambda x: float(x)) ## may not need this

In [25]:
pd.concat([unidata.drop('country',axis=1),pd.get_dummies(unidata['country'])], axis = 1)

Unnamed: 0,world_rank,university_name,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,...,Taiwan,Thailand,Turkey,Uganda,Ukraine,Unisted States of America,United Arab Emirates,United Kingdom,United States of America,Unted Kingdom
0,1,Harvard University,99.7,72.4,98.7,98.8,34.5,96.1,20152.0,8.9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2,California Institute of Technology,97.7,54.6,98.0,99.9,83.7,96.0,2243.0,6.9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,3,Massachusetts Institute of Technology,97.8,82.3,91.4,99.9,87.5,95.6,11074.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,4,Stanford University,98.3,29.5,98.1,99.2,64.3,94.3,15596.0,7.8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5,Princeton University,90.9,70.3,95.4,99.9,,94.2,7929.0,8.4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,6,University of Cambridge,90.5,77.7,94.1,94.0,57.0,91.2,18812.0,11.8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6,6,University of Oxford,88.2,77.2,93.9,95.1,73.5,91.2,19919.0,11.6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
7,8,"University of California, Berkeley",84.2,39.6,99.3,97.8,,91.1,36186.0,16.4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
8,9,Imperial College London,89.2,90.0,94.5,88.3,92.9,90.6,15060.0,11.7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
9,10,Yale University,92.1,59.2,89.7,91.5,,89.5,11751.0,4.4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [13]:
y = unidata['total_score']

In [None]:
pd.concat([df.drop('status',axis=1),pd.get_dummies(df['status'])], axis = 1)