# Times University ranking dataset analysis

In this codealong we are going to analyze a ranking of universities using regression. Specifically, we are going to **predict the university ranking** with the provided predictors.

---

The information provided in the csv contains:

- **world_rank** - world rank for the university. Contains rank ranges and equal ranks (eg. =94 and 201-250).
- **university_name** - name of university.
- **country** - country of each university.
- **teaching** - university score for teaching (the learning environment).
- **international** - university score international outlook (staff, students, research).
- **research** - university score for research (volume, income and reputation).
- **citations** - university score for citations (research influence).
- **income** - university score for industry income (knowledge transfer).
- **total_score** - total score for university, used to determine rank.
- **num_students** - number of students at the university.
- **student_staff_ratio** - Number of students divided by number of staff.
- **international_students** - Percentage of students who are international.
- **female_male_ratio** - Female student to Male student ratio.
- **year** - year of the ranking (2011 to 2016 included).

We are going to predict the **total score**, which directly corresponds to the ranking.

---

### ONLY THE DATA PATH IS PROVIDED!

The analysis is up to you. This is an open ended practice. You are expected to:

- Load the packages you need to do analysis
- Perform EDA on variables of interest
- Form a hypothesis or hypotheses on what is important for the score
- Check your data for problems, clean and munge data into correct formats
- Create new variables from columns if necessary
- Perform statistical analysis with regression and describe the results

---

If you do not know how to do something **check documentation first.** I look up things in documentation all the time. 

**You are not expected to know how to do things by heart. Knowing how to effectively look up the answers on the internet is a critical skill for data scientists!**

In [63]:
uni_data_path = './dataset/timesData.csv'

In [64]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [65]:
unidata = pd.read_csv(uni_data_path)

In [66]:
unidata.head(1)

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011


In [67]:
unidata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2603 entries, 0 to 2602
Data columns (total 14 columns):
world_rank                2603 non-null object
university_name           2603 non-null object
country                   2603 non-null object
teaching                  2603 non-null float64
international             2603 non-null object
research                  2603 non-null float64
citations                 2603 non-null float64
income                    2603 non-null object
total_score               2603 non-null object
num_students              2544 non-null object
student_staff_ratio       2544 non-null float64
international_students    2536 non-null object
female_male_ratio         2370 non-null object
year                      2603 non-null int64
dtypes: float64(4), int64(1), object(9)
memory usage: 284.8+ KB


In [68]:
unidata_clean = unidata.copy()

In [69]:
unidata_clean.head(1)

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011


In [70]:
unidata_clean['num_students'] = unidata_clean['num_students'].str.replace(',','')

In [71]:
unidata_clean[unidata_clean['num_students'].isnull()==False]['num_students'].astype(float).mean()

23873.758647798742

In [78]:
unidata_clean.ix[unidata_clean['num_students'].isnull(),['num_students']] = unidata_clean[unidata_clean['num_students'].isnull()==False]['num_students'].astype(float).mean()

In [102]:
unidata_clean['num_students'] = unidata_clean['num_students'].astype(float)

In [104]:
unidata_clean['num_students'].describe()

count      2603.000000
mean      23873.758648
std       17474.397944
min         462.000000
25%       12800.000000
50%       21394.000000
75%       29787.000000
max      379231.000000
Name: num_students, dtype: float64

In [106]:
unidata_clean['total_score'] = pd.to_numeric(unidata_clean['total_score'].str.replace('-',''), errors='coerce')

In [107]:
unidata_clean['total_score'].describe()



count    1201.000000
mean       59.846128
std        12.803446
min        41.400000
25%              NaN
50%              NaN
75%              NaN
max        96.100000
Name: total_score, dtype: float64

In [108]:
unidata_clean['total_score'].isnull().sum()

1402

In [73]:
unidata_clean['international'] = pd.to_numeric(unidata_clean['international'], errors='coerce')

In [74]:
unidata_clean['international'].dtype

dtype('float64')

In [75]:
unidata_clean.ix[unidata_clean['international'].isnull(),['international']] = unidata_clean[unidata_clean['international'].isnull()==False]['international'].mean()

In [80]:
unidata_clean['student_staff_ratio'] = pd.to_numeric(unidata_clean['student_staff_ratio'], errors='coerce')

In [81]:
unidata_clean.ix[unidata_clean['student_staff_ratio'].isnull(),['student_staff_ratio']] = unidata_clean[unidata_clean['student_staff_ratio'].isnull()==False]['student_staff_ratio'].mean()

In [84]:
unidata_clean['international_students'] = unidata_clean['international_students'].str.replace('%','')

In [85]:
unidata_clean['international_students'] = pd.to_numeric(unidata_clean['international_students'], errors='coerce')

In [86]:
unidata_clean.ix[unidata_clean['international_students'].isnull(),['international_students']] = unidata_clean[unidata_clean['international_students'].isnull()==False]['international_students'].mean()

In [142]:
unidata_clean['income'] = pd.to_numeric(unidata_clean['income'], errors='coerce')

In [143]:
unidata_clean.ix[unidata_clean['income'].isnull(),['income']] = unidata_clean[unidata_clean['income'].isnull()==False]['income'].mean()

In [92]:
unidata_clean['female_ratio'] = unidata_clean['female_ratio'].str.split(':', expand=True)[0]

In [95]:
unidata_clean['female_ratio'] = pd.to_numeric(unidata_clean['female_ratio'], errors='coerce')

In [96]:
unidata_clean.ix[unidata_clean['female_ratio'].isnull(),['female_ratio']] = unidata_clean[unidata_clean['female_ratio'].isnull()==False]['female_ratio'].mean()

In [97]:
unidata_clean

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year,female_ratio
0,1,Harvard University,United States of America,99.7,72.40000,98.7,98.8,34.5,96.1,20152,8.9,25.0,,2011,49.683988
1,2,California Institute of Technology,United States of America,97.7,54.60000,98.0,99.9,83.7,96.0,2243,6.9,27.0,33 : 67,2011,33.000000
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.30000,91.4,99.9,87.5,95.6,11074,9.0,33.0,37 : 63,2011,37.000000
3,4,Stanford University,United States of America,98.3,29.50000,98.1,99.2,64.3,94.3,15596,7.8,22.0,42 : 58,2011,42.000000
4,5,Princeton University,United States of America,90.9,70.30000,95.4,99.9,-,94.2,7929,8.4,27.0,45 : 55,2011,45.000000
5,6,University of Cambridge,United Kingdom,90.5,77.70000,94.1,94.0,57.0,91.2,18812,11.8,34.0,46 : 54,2011,46.000000
6,6,University of Oxford,United Kingdom,88.2,77.20000,93.9,95.1,73.5,91.2,19919,11.6,34.0,46 : 54,2011,46.000000
7,8,"University of California, Berkeley",United States of America,84.2,39.60000,99.3,97.8,-,91.1,36186,16.4,15.0,50 : 50,2011,50.000000
8,9,Imperial College London,United Kingdom,89.2,90.00000,94.5,88.3,92.9,90.6,15060,11.7,51.0,37 : 63,2011,37.000000
9,10,Yale University,United States of America,92.1,59.20000,89.7,91.5,-,89.5,11751,4.4,20.0,50 : 50,2011,50.000000


In [6]:
unidata.columns

Index([u'world_rank', u'university_name', u'country', u'teaching',
       u'international', u'research', u'citations', u'income', u'total_score',
       u'num_students', u'student_staff_ratio', u'international_students',
       u'female_male_ratio', u'year'],
      dtype='object')

In [144]:
df = unidata_clean[[u'country', u'teaching',
       u'international', u'research', u'citations', u'income',
       u'num_students', u'student_staff_ratio', u'international_students',
       u'female_ratio', u'year','total_score']]

In [145]:
df = df[unidata_clean['total_score'].isnull()==False]

In [146]:
y = df.pop('total_score')

In [147]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1201 entries, 0 to 2002
Data columns (total 11 columns):
country                   1201 non-null object
teaching                  1201 non-null float64
international             1201 non-null float64
research                  1201 non-null float64
citations                 1201 non-null float64
income                    1201 non-null float64
num_students              1201 non-null float64
student_staff_ratio       1201 non-null float64
international_students    1201 non-null float64
female_ratio              1201 non-null float64
year                      1201 non-null int64
dtypes: float64(9), int64(1), object(1)
memory usage: 112.6+ KB


In [132]:
unidata_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2603 entries, 0 to 2602
Data columns (total 15 columns):
world_rank                2603 non-null object
university_name           2603 non-null object
country                   2603 non-null object
teaching                  2603 non-null float64
international             2603 non-null float64
research                  2603 non-null float64
citations                 2603 non-null float64
income                    2603 non-null object
total_score               1201 non-null float64
num_students              2603 non-null float64
student_staff_ratio       2603 non-null float64
international_students    2603 non-null float64
female_male_ratio         2370 non-null object
year                      2603 non-null int64
female_ratio              2603 non-null float64
dtypes: float64(9), int64(1), object(5)
memory usage: 305.1+ KB


In [133]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1201 entries, 0 to 2002
Data columns (total 11 columns):
country                   1201 non-null object
teaching                  1201 non-null float64
international             1201 non-null float64
research                  1201 non-null float64
citations                 1201 non-null float64
income                    1201 non-null object
num_students              1201 non-null float64
student_staff_ratio       1201 non-null float64
international_students    1201 non-null float64
female_ratio              1201 non-null float64
year                      1201 non-null int64
dtypes: float64(8), int64(1), object(2)
memory usage: 112.6+ KB


In [148]:
df = pd.get_dummies(df)

In [149]:
from sklearn.cross_validation import cross_val_score, cross_val_predict, KFold
from sklearn import metrics
from scipy import stats
import seaborn as sns
from sklearn import datasets, linear_model
from sklearn.cross_validation import train_test_split
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn import pipeline

In [150]:
lm = linear_model.RidgeCV()
predictions = cross_val_predict(lm, df, y, cv=10)

In [151]:
predictions

array([ 95.44091064,  94.9201837 ,  95.02993748, ...,  48.93865854,
        48.75010224,  48.28356098])

In [152]:
np.mean(cross_val_score(lm, df, y, cv=10))

0.99765291857862426

In [153]:
lm = linear_model.RidgeCV().fit(df,y)
zip(df.columns, lm.coef_)

[('teaching', 0.29690793399211302),
 ('international', 0.068747555292635099),
 ('research', 0.3053766792044561),
 ('citations', 0.30664403728056716),
 ('income', 0.02457936206248057),
 ('num_students', 7.0931491791270673e-08),
 ('student_staff_ratio', -0.0020757597961837959),
 ('international_students', 0.00086847818844582036),
 ('female_ratio', -0.0010976853179011925),
 ('year', -0.079129648515955844),
 ('country_Australia', 0.015189370481969409),
 ('country_Austria', -0.0061022186275555314),
 ('country_Belgium', 0.0079700372638402822),
 ('country_Brazil', -0.018917869639111115),
 ('country_Canada', 0.03208915028178333),
 ('country_China', 0.041921432610408919),
 ('country_Denmark', 0.012634309448465275),
 ('country_Egypt', 0.12741296115610881),
 ('country_Finland', -0.013021711092216712),
 ('country_France', 0.013014029557070678),
 ('country_Germany', -0.022775726907027158),
 ('country_Hong Kong', -0.0036283159395645928),
 ('country_Israel', -0.0026096279374199782),
 ('country_Italy'