#### Introduction to Statistical Learning, Exercise 2.1

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# Explore the College Data Set

### A. Reading Data from a CSV file.

Read the college data set from the `College.csv` file into a `pandas` data frame named `college`. Make sure you look for it in the correct directory. Then look at the first few rows of the data table, starting at the sixth row.

In [None]:
import os
import pandas as pd

In [None]:
datasets_dir = '../../datasets'
college_path = os.path.join(datasets_dir, 'College.csv')
college = pd.read_csv(college_path)

In [None]:
college[5:].head()

### B. Accessing Data through the ISLPy Module

We can read a wide range of data formats into `pandas` data frames.  However, to minimise overhead, we will access most of the data via the `islpwf.datasets` module from now on. This has the additional advantage that the module provides documentation for the data sets.

Import `datasets` from `islpwf` and read the documentation of the college data set. Remember you can use tab-completion in code cells.

Then retrieve the data set, assigning the data frame to a variable named `college` and look at the first 20 rows.

You will notice that the first column contains the university names and is not properly named. The university names are not really data points, but would serve well as row names instead of numerical row indices. Rename the first column to 'University' using the `rename()` method and assign this column as the row names. 

In [None]:
from islpwf import datasets

In [None]:
help(datasets.College)

In [None]:
college = datasets.College()

In [None]:
college.head(20)

In [None]:
college.rename({college.columns[0]: 'University'},
               axis='columns', inplace=True)
college.set_index(['University'], inplace=True)

In [None]:
college.head()

### C. Data Exploration & Visualisation

  1. Use the `describe()` method to produce a numerical summary of the variables in the data set.  The `Private` column does not appear in the summary. Explain why this is the case.
  
  2. Use the `pairplot()` function from the `seaborn` library to produce a scatter plot matrix of the *first ten variables* in the data set.  Recall that you can use the `iloc` property for `numpy` style indexing.
  
  3. Use the `boxplot()` function from `seaborn` to produce a side-by-side boxplot of `Outstate` versus `Private`.
  
  4. Create a new qualitative variable `Elite` by *binning* (cutting on) the `Top10perc` variable. We are going to divide the universities into two groups based on whether or not more than 50% of the students come from top 10% of their high school classes. 
  
  The `Elite` variable should `'Yes'`if `Top10perc > 50` and `'No'` otherwise.  It should be added as a new column to the `College` data set.
  
  In general, new columns can be added to a data frame like entries to a dictionary. There are many ways to achieve the desired result. This is the most concise:
  
  ```python
    college['Elite'] = ['Yes' if e else 'No' for e in college.Top10perc > 50]
  ```
  
  This is another approach:
  
  ```python
    college['Elite'] = college.Top10perc > 50
    college['Elite'].replace({True: 'Yes', False: 'No'}, inplace=True)
  ```
  
  Feel free to think about different approaches and try them out. You'll learn that assignments into data frames can be tricky.
  
  Check that the `Elite` column was correctly added and see how many elite universities there are (there are several ways to do this). Then make a box plot of `Outstate` versus `Elite`.
  
  5. Use the `distplot()` function from `seaborn` with `kde=False` to plot several distributions from the `College` data set with a different number of bins. Use the `subplot()` function from `matplotlib.pyplot` to show at least four different distributions in one figure. Use `figsize` keyword argument when creating the subplots to make the figure larger and more readable.
  
  6. Continue to explore the data and provide a brief summary of what you discover.
  
  Hint: write down some expectations you have and then check them.

#### C.1

In [None]:
college.describe()

The `Private` variable is recognised as categorical and it does not make sense to calculate a numerical summary for it.

#### C.2

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
sns.pairplot(data=college.iloc[:, :10])
plt.show()

#### C.3

In [None]:
sns.boxplot(x='Private', y='Outstate', data=college)
plt.show()

#### C.4

In [None]:
college['Elite'] = ['Yes' if e else 'No' for e in college.Top10perc > 50]
college[['Top10perc', 'Elite']].head()

In [None]:
college['Elite'].describe()

So there are 778 - 700 = 78 `'Yes'` entries (elite universities). Directly determining the count is more useful in programs:

In [None]:
(college['Elite'] == 'Yes').sum()  # exploit that True/False maps to 1/0

In [None]:
sns.boxplot(x='Elite', y='Outstate', data=college)
plt.show()

#### C.5

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(8, 6))
sns.distplot(college['Accept'], ax=ax[0][0], kde=False)
sns.distplot(college['Apps'], ax=ax[0][1], kde=False, bins=10)
sns.distplot(college['Books'], ax=ax[1][0], kde=False, bins=20)
sns.distplot(college['Personal'], ax=ax[1][1], kde=False, bins=100)
plt.show()

#### C.6

This exercise has no unique solution; it depends on what aspects of the data set you decided to explore.

We set out to check some preconceptions we have about elite universities. For example, we expect that the fraction of accepted applications is lower for the elite institutions. In order to investigate this we add a column to the data set with the ration of `Accept` and `Apps` and make a box plot (you can also do this without adding the column by providing the ratio to the `boxplot()` function).

Next we look at some correlations using the `relplot()` function.  Again, we have some expectations we want to check.  For example, we expect `Grad.Rate` to be correlated with `S.F.Ratio` and also with `Elite`.

There are many more relations to explore in this data set, but we will leave it at that in this example solution. You can always look at the scatter plot matrix for some inspiration.



In [None]:
college['AccRatio'] = college['Accept'] / college['Apps']
sns.boxplot(x='Elite', y='AccRatio', data=college)
plt.show()

There are several interesting observations here.  As expected, the acceptance ratio is lower for elite universities. On the other hand the spread is much larger than for the non-elite institutions (this could simply be caused by the relatively low number of elite universities). Furthermore, the distribution for the non-elite universities has some outliers towards low acceptance ratios.

In [None]:
sns.set_style('whitegrid')
sns.relplot(x='S.F.Ratio', y='Grad.Rate', data=college, hue='Elite')
plt.show()

We expected a stronger correlation, assuming that a higher faculty to student ratio is beneficial to the quality of tuition and therefore the graduation rate would be higher. That said, there clearly is a trend, albeit weaker than we expected. There are some curious outliers that warrant further investigation.

As expected, the elite universities cluster in the upper left corner of the plot. A possible reason for their high graduation rates might be that the elite universities simply select the better students. After all this is how we *defined* them!