# Times University ranking dataset analysis

In this codealong we are going to analyze a ranking of universities using regression. Specifically, we are going to **predict the university ranking** with the provided predictors.

---

The information provided in the csv contains:

- **world_rank** - world rank for the university. Contains rank ranges and equal ranks (eg. =94 and 201-250).
- **university_name** - name of university.
- **country** - country of each university.
- **teaching** - university score for teaching (the learning environment).
- **international** - university score international outlook (staff, students, research).
- **research** - university score for research (volume, income and reputation).
- **citations** - university score for citations (research influence).
- **income** - university score for industry income (knowledge transfer).
- **total_score** - total score for university, used to determine rank.
- **num_students** - number of students at the university.
- **student_staff_ratio** - Number of students divided by number of staff.
- **international_students** - Percentage of students who are international.
- **female_male_ratio** - Female student to Male student ratio.
- **year** - year of the ranking (2011 to 2016 included).

We are going to predict the **total score**, which directly corresponds to the ranking.

---

### ONLY THE DATA PATH IS PROVIDED!

The analysis is up to you. This is an open ended practice. You are expected to:

- Load the packages you need to do analysis
- Perform EDA on variables of interest
- Form a hypothesis or hypotheses on what is important for the score
- Check your data for problems, clean and munge data into correct formats
- Create new variables from columns if necessary
- Perform statistical analysis with regression and describe the results

---

I will obviously be there in class to help, but if you do not know how to do something **I expect you to check documentation first.** I look up things in documentation all the time. 

**You are not expected to know how to do things by heart. Knowing how to effectively look up the answers on the internet is a critical skill for data scientists!**

In [112]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [113]:
uni_data_path = './dataset/timesData.csv'

In [114]:
unis = pd.read_csv(uni_data_path)

In [115]:
unis.head(2)

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243,6.9,27%,33 : 67,2011


In [116]:
unis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2603 entries, 0 to 2602
Data columns (total 14 columns):
world_rank                2603 non-null object
university_name           2603 non-null object
country                   2603 non-null object
teaching                  2603 non-null float64
international             2603 non-null object
research                  2603 non-null float64
citations                 2603 non-null float64
income                    2603 non-null object
total_score               2603 non-null object
num_students              2544 non-null object
student_staff_ratio       2544 non-null float64
international_students    2536 non-null object
female_male_ratio         2370 non-null object
year                      2603 non-null int64
dtypes: float64(4), int64(1), object(9)
memory usage: 284.8+ KB


In [117]:
unis.international = pd.to_numeric(unis['international'], errors='coerce')

In [118]:
unis.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243,6.9,27%,33 : 67,2011
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,95.6,11074,9.0,33%,37 : 63,2011
3,4,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,94.3,15596,7.8,22%,42 : 58,2011
4,5,Princeton University,United States of America,90.9,70.3,95.4,99.9,-,94.2,7929,8.4,27%,45 : 55,2011


In [119]:
unis = unis[unis['total_score'] != '-']

In [120]:
# unis.head()
unis.tail()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
1998,=196,Newcastle University,United Kingdom,30.9,84.3,27.5,81.5,34.7,49.2,20174,15.2,29%,50 : 50,2016
1999,=196,"St George’s, University of London",United Kingdom,25.6,69.5,18.1,100.0,37.7,49.2,2958,13.4,17%,61 : 39,2016
2000,198,University of Trento,Italy,30.8,55.9,27.4,87.7,47.1,49.1,16841,43.2,8%,51 : 49,2016
2001,199,Paris Diderot University – Paris 7,France,30.5,64.9,22.9,91.0,29.0,48.9,27756,14.8,17%,63 : 37,2016
2002,200,Queen’s University Belfast,United Kingdom,34.1,93.4,33.3,68.9,35.7,48.8,17940,17.9,30%,54 : 46,2016


In [121]:
unis['num_students'] = unis['num_students'].str.replace(",","")

In [122]:
unis['num_students'] = pd.to_numeric(unis['num_students'])

In [123]:
unis.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152.0,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243.0,6.9,27%,33 : 67,2011
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,95.6,11074.0,9.0,33%,37 : 63,2011
3,4,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,94.3,15596.0,7.8,22%,42 : 58,2011
4,5,Princeton University,United States of America,90.9,70.3,95.4,99.9,-,94.2,7929.0,8.4,27%,45 : 55,2011


In [128]:
unis['total_score'] = pd.to_numeric(unis.total_score)

In [129]:
unis.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152.0,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243.0,6.9,27%,33 : 67,2011
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,95.6,11074.0,9.0,33%,37 : 63,2011
3,4,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,94.3,15596.0,7.8,22%,42 : 58,2011
4,5,Princeton University,United States of America,90.9,70.3,95.4,99.9,-,94.2,7929.0,8.4,27%,45 : 55,2011


In [142]:
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

df=unis[['citations','research','total_score']]

df.replace('-',np.nan, inplace=True)
df.replace(np.inf,np.nan,inplace=True)

df=df.dropna()

lr = linear_model.LinearRegression()

cols=['citations','research']

x=df[cols]
y=df['total_score']

X_train,X_test, y_train, y_test = train_test_split(x,y,test_size=0.3)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [143]:
lr.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [145]:
lr.predict(X_test)

array([ 53.26036486,  52.38090243,  82.53581412,  55.77310107,
        77.24159805,  68.59379454,  61.58592641,  53.92079451,
        50.4107813 ,  54.51135306,  73.52653589,  55.45296636,
        63.18657088,  57.38369934,  53.57423596,  57.12288946,
        54.67183795,  64.96993866,  51.08612712,  87.21013454,
        75.85992822,  61.29705159,  52.80923775,  46.13189522,
        63.12870268,  51.95179017,  69.08288224,  52.45514342,
        60.10808794,  53.81147712,  53.31742705,  80.28016852,
        76.26004322,  51.85143611,  49.99686691,  50.64513827,
        47.00140385,  45.52452678,  80.50764034,  47.01036718,
        45.43908889,  56.09999462,  55.58764896,  60.62121054,
        50.58215232,  43.87724009,  46.16237817,  59.2856877 ,
        76.93941908,  81.04064146,  52.58657291,  55.56104066,
        47.58349429,  84.77365936,  55.37379208,  55.07707071,
        58.64690398,  49.82788484,  51.87416976,  52.21291083,
        60.4149185 ,  52.73598722,  78.66733669,  51.97

In [146]:
lr.score(X_test, y_test)

0.96440779921418085

In [147]:
lr.score(X_train,y_train)

0.96182495607822049