This dataset contains 1295 records of American colleges and their properties, collected by the [US Department of Education](https://collegescorecard.ed.gov/data/documentation/).

In [3]:
import pandas as pd
import lux

In [4]:
df = pd.read_csv(r'C:\Users\11488\Desktop\lux-master\lux\data\college.csv')
df

Unnamed: 0,Name,PredominantDegree,HighestDegree,FundingModel,Region,Geography,AdmissionRate,ACTMedian,SATAverage,AverageCost,Expenditure,AverageFacultySalary,MedianDebt,AverageAgeofEntry,MedianFamilyIncome,MedianEarnings
0,Alabama A & M University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8989,17,823,18888,7459,7079,19500.0,20.629999,29039.0,27000
1,University of Alabama at Birmingham,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8673,25,1146,19990,17208,10170,16250.0,22.670000,34909.0,37200
2,University of Alabama in Huntsville,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8062,26,1180,20306,9352,9341,16500.0,23.190001,39766.0,41500
3,Alabama State University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.5125,17,830,17400,7393,6557,15854.5,20.889999,24029.5,22400
4,The University of Alabama,Bachelor's,Graduate,Public,Southeast,Small City,0.5655,26,1171,26717,9817,9605,17750.0,20.770001,58976.0,39200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1289,University of Connecticut-Avery Point,Bachelor's,Graduate,Public,New England,Mid-size Suburb,0.5940,24,1020,12946,11730,14803,18983.0,20.120001,86510.0,49700
1290,University of Connecticut-Stamford,Bachelor's,Graduate,Public,New England,Mid-size City,0.4107,21,1017,13028,4958,14803,18983.0,20.120001,86510.0,49700
1291,California State University-Channel Islands,Bachelor's,Graduate,Public,Far West,Mid-size Suburb,0.6443,20,954,22570,12026,8434,12500.0,24.850000,32103.0,35800
1292,DigiPen Institute of Technology,Bachelor's,Graduate,Private For-Profit,Far West,Small City,0.6635,28,1225,37848,5998,7659,19000.0,21.209999,68233.0,72800


We see that the information about ACTMedian and SATAverage has a very strong correlation. This means that we could probably just keep one of the columns and still get about the same information. So let's drop the ACTMedian column.

In [None]:
df = df.drop(columns=["ACTMedian"])
df

From the Category tab, we see that there are few records where `PredominantDegree` is "Certificate". In addition, there are not a lot of colleges with "Private For-Profit" as `FundingModel`.

In [None]:
df[df["PredominantDegree"]=="Certificate"].to_pandas()

Upon inspection, there is only a single record for Certificate, we look at the [webpage for programs offered at Cleveland State Community College](http://catalog.clevelandstatecc.edu/content.php?catoid=2&navoid=90) and it looks like there is a large number of associate as well as certificate degrees offered. So we decide that this is more appropriately labelled as "Associate" for the `PredominantDegree` field.

In [None]:
df.loc[df["PredominantDegree"]=="Certificate","PredominantDegree"] = "Associate"

By inspecting the subset of 9 colleges that are "Private For-Profit", we do not find any commonalities across them, so we can just leave the data as-is for now.

In [None]:
df[df["FundingModel"]=="Private For-Profit"]

Back to looking at the entire dataset:

In [None]:
df

We are interested in picking a college to attend and want to understand the `AverageCost` of attending different colleges and how that relates to other information in the dataset.

In [None]:
df.set_intent(["AverageCost"])
df

We see that there are a large number of colleges that cost around $20000 per year. We also see that Bachelor degree colleges and colleges in New England and large cities tend to have a higher `AverageCost` than its counterparts.

We are interested in the trend of `AverageCost` v.s. `SATAverage` since there is a rough upwards relationship above `AverageCost` of $30000, but below that the trend is less clear.

In [None]:
df.set_intent(["AverageCost","SATAverage"])
df

By adding the `FundingModel`, we see that the cluster of points on the left can clearly be attributed to public colleges, whereas private colleges more or less follow a trend that shows that colleges with higher `SATAverage` tends to have higher `AverageCost`.