# College Data Exploration and Regression

Coded by Luna McBride

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt #Plotting
%matplotlib inline

from sklearn.model_selection import train_test_split #Split the data into train and test
from sklearn.ensemble import RandomForestRegressor #Forest for regression
from sklearn.metrics import mean_squared_error #Error testing

plt.rcParams['figure.figsize'] = (15,10) #Set the default figure size
plt.style.use('ggplot') #Set the plotting method

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
college = pd.read_csv("../input/us-college-data/College_Data.csv") #Load the college data into a dataframe
college.head() #Take a peek at the data

---

# Fix the Column Names

In [None]:
#Names to rename the columns to
newNames = ["collegeName", "isPrivate", "numApps", "numAccept", "numEnroll", "top10Stud", "top25Stud", "fulltime",
            "parttime", "outstateCost", "roomCost", "bookCost", "personalCost", "teachWithPHD", "teachWithTerminal",
           "studFacRatio", "alumWhoDonate%", "expendPerStud", "gradRate"]

oldNames = college.columns #Get the old columns
nameCombos = dict(zip(oldNames, newNames)) #Combine the old and new into a dictionary
college = college.rename(columns = nameCombos) #Rename the columns
college.head() #Take a peek at the dataset

---

# Check for Null Values

In [None]:
print(college.isnull().any()) #Check for null values

There are no null values.

---

# Create New Fields

## Overall Cost

In [None]:
#GetOverallCosts: Calculates the overall costs based on all cost fields in the dataset (out of state only, there is no in state)
#Input: the college dataframe
#Output: all overall cost values
def getOverallCost(df):
    overallCost = [] #Create a list to hold the overall costs
    
    #For each column, get the college's overall costs
    for index, row in df.iterrows():
        #print(row)
        overallCost.append(row[9] + row[10] 
        + row[11] + row[12]) #Append the sum of all cost fields
        
    return overallCost #Return the list of overall costs
        
college["overallCost"] = getOverallCost(college) #Fill overall costs
college.head() #Take a peek at the dataset

## Acceptance Rate

In [None]:
college["acceptRate"] =  round(college["numAccept"] / college["numApps"], 2) #Get the acceptance rate
college.head() #Take a peek at the dataset

## Set Index as Institutions

In [None]:
college = college.set_index("collegeName") #Set the names to the index
college.head() #Take a peek at the data

---

# Data Exploration

I am also only limiting the fields to a couple entries, as this is to show how the data is structured. It is hard to see that when all colleges are in the image (just by the sheer number)

In [None]:
plt.rcParams['figure.figsize'] = (15,4) #Set the figure size

#For every column in the college dataset, plot a graph with a few entries
for column in college:
    
    #If the column is one of the numeric columns
    if column != "collegeName" and column != "isPrivate":
        col = college[column][:5] #Get the information for this column for the first few colleges
        plt.figure() #Pull the figure so they plot on separate graphs
        col.plot.bar(title = column, rot = 0) #Plot the column
        
    #If the column is "isPrivate", the boolean column. This one needs to be aggregated with the value_counts function.
    elif column == "isPrivate":
        col = college[column] #Get the information for this column
        plt.figure() #Pull the figure so it plots on separate graphs
        col.value_counts().plot.bar(title = column, rot = 0) #Plot the column, aggregating for the no/yes values

In [None]:
#For each college, print the available ranges for the columns
for column in college:
    
    #If the column is not "isPrivate", since we are looking specifically at the numeric columns
    if column != "isPrivate":
        col = college[column] #Pull the data for the column
        maxx = max(col) #Take the max value of the column
        minn = min(col) #Take the min value of the column
        print("The values for the {} column range from {} to {}".format(column, minn, maxx)) #Print the ranges

In [None]:
print(college.loc[college["gradRate"] > 100]) #Show where the graduation rate is greater than 100

In [None]:
print(college.loc[college["acceptRate"] == 1]) #Print the colleges with 100% acceptance rates

The number of applicants range from 81 to 48094 for the period presented, meaning there are both large and small colleges within this dataset. The min and max values for the Acceptance and Enrollment columns are both less than the number of applications, which makes sense given that the colleges will not accept everyone. 

Top10Stud and Top25Stud, representing students from top high schools, make up a small number of applicants to each college. This means that they are usually applying to colleges besides those in this dataset.  

Fulltime and Parttime minimums add up to lower than the lowest total number of applicants, which tells me this is for the college as a whole rather than just incoming students. Both of these values range from small numbers to huge numbers, which tells me that there is dataset holds various colleges that can fit different lifestyles. 

The cost metrics (roomCost, outstateCost, bookCost, and personalCost) here are added together into the overallCost field. There is no in state tuition listed, so the overallCost goes off the assumption that this is the base cost. The dataset does not say whether this cost is yearly or by semester, so I will assume this is a semester tuition by using knowledge of US college prices.
These cost columns come together to an overall range of 6604-29095, which is not really all too crazy in the context of US colleges, sadly. These are actually closer to what is considered affordably in this context, which is actually pretty impressive.

Teachers with terminal degrees/PHD's have some pretty strong ranges here, especially when student to faculty ratios are considered. There is no field here for total teacher numbers, so it is hard to compute the percentage of well educated teachers. Terminal degree teachers do not necessarily mean good teachers, but it can be said that these colleges have some knowledgeable teachers with some pretty direct teaching methods.

The percent of alumni who donate range from 0% to 64%. I should stress is not inherently necessary, as donations are entirely voluntary. Yet, at most 64% of people decided their education was good enough to donate to the college so they can continue to educate. Of course this could be a school culture issue like with places like Harvard, but this is an interesting thing to note nonetheless.

Expenditure per student has a pretty big range here, but this does not necessarily point to good or bad schools. The interesting thing is that this range has both a minimum lower and a maximum higher than the overall cost metric. This means that some schools spend more on their students than what the students are paying, while others are putting less into the students than what they are paying (likely keeping the rest for other projects or profit).

The graduation rate here is strange, since it ranges from 10 to 118 percent. This is quite odd, since what does a 118% graduation rate mean? Are there more people graduating than are enrolled?
Some further digging showed that only Cazenovia College had a rate above 100%, meaning this one specifically either had an error or might be incorrectly reporting their rates for some unknown reason.

Acceptance rates range from 15% to 100%. A smaller rate is not necessarily bad, given that schools like Harvard have acceptance rates in the single digits. The 100% acceptance rate is a bit weird, though, meaning they have no true filter on who gets in.

---

# Regression for Acceptance Rate

In [None]:
#GetChara: Get the most important characteristics to the regression
#Input: the characteristics, the model
#Output: None
def getChara(characteristics, forest):
    attributes = characteristics.columns #Get the tested attributes
    attributes = list(zip(attributes, forest.feature_importances_)) #Zip the attributes together with their coefficient
    sortAtt = sorted(attributes, key = lambda x: x[1], reverse = True) #Sort the zipped attributes by their coefficients

    print("According to the Random Forest, the most important factors for acceptance are: ") #Start printing the most important labels
    i = 0 #Counter variable so only the top five are printed

    #For each attribute in the sorted attributes
    for label, coef in sortAtt:
        if i < 5: #If there has not been five printed yet
            print(label) #Print the label as an important factor
        i += 1 #Increase i by 1

#AcceptanceRegression: Get an idea of what goes into the acceptance rate
#Input: The dataframe for the colleges
#Output: None
def acceptanceRegression(college):
    accept = college["acceptRate"].copy() #Get the acceptance rate for our regression

    characteristics = college.drop(columns = {"acceptRate"}).copy() #Get the characteristics used for the regression
    characteristics = pd.get_dummies(characteristics) #Pad non-numeric characteristics
    
    charaTrain, charaTest, accTrain, accTest = train_test_split(characteristics, accept, test_size = 0.1) #Split the dataset
    
    forest = RandomForestRegressor(n_estimators = 100) #Build a forest
    forest.fit(charaTrain, accTrain) #Fit the forest model
    
    predict = forest.predict(charaTest) #Get a list of predictions
    
    print("Forest Accuracy: ", forest.score(charaTest, accTest)) #Print the accuracy
    print("Root Mean Square Error: ", np.sqrt(mean_squared_error(accTest, predict))) #Print the root mean square error
    
    getChara(characteristics, forest) #Get the important Characteristics

In [None]:
col = college.drop(columns = {"numAccept", "numEnroll"}) #Remove the number accepted, since the number accepted of course correlates to the rate

#Run the regression a few times to show variance
for i in range(0,5):
    acceptanceRegression(col) #Perform the regression

I had this run a few times to show the variety in the models. The best I have been able to get is around a 65% accuracy, which I am going to assume is due to the small number of colleges (relatively speaking) and the amount of factors not in the dataset that could define the acceptance rate (such as staff numbers, budget, and a variety of other factors). The mean square error is very small despite this, so I feel this is the highest possible with the given characteristics. The mean square error represents the distance away from the actual values, so the fact that it is low means the predictions, while not exact, are pretty darn close.

Despite all of this, a few characteristics keep showing up on top. The number of applications, which in itself does not show acceptance, but rather the scale in which people are applying. There is also top10Stud and top25Stud, which says that the locations where higher end students go to have a potential correlation between acceptance rates (likely with lower ones meaning higher end schools). There are others that pop in and out depending on the run, but these seem to be the ones the models are finding consistent.