## Data quality check / cleaning / preparation 

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.** An example is given below.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
import seaborn as sns
import matplotlib.pyplot as plt

### Data quality check
*By Elton John*

The code below visualizes the distribution of all the variables in the dataset, and their association with the response.

In [7]:
#...Distribution of continuous variables...#

In [8]:
#...Distribution of categorical variables...#

In [9]:
#...Association of the response with the predictors...#

### Data cleaning
*By Peggy Han*

For data cleaning, we performed the following:

1. Some of the columns only have NaN values, so we removed all these variables to make the data set simpler. We also dropped some irrelevant variables that wouldn't help with developing the model.

2. In the second dataset from obtained from College Scorecard, there are about 3000 different variables. We manually selected some variables that might be relevant to be response based on the variable description.

3. When merging the two data sets, we found that some of the rows are duplicated in the resulting data frame because more than one insitution in the second dataset has the same school code, and some schools have different codes in the two data sets, so they have missing columns from the second data set. We did some manual deletion and filling in of the merged data.

4. We identified that some of the institutions were unranked in the Kaggle data set, so we dropped those rows. We also dropped columns with only 1 unique value as they do not provide insight for building the model.

5. Some of the variables from the Collge Scorecard data contains the same information are separately stored into private and public columns. For private schools, the information is stored in the private column and the public column is NaN, and vice versa. We combined these columns to use the variable as a predictor with no NaN values in the column.

6. Both data sets contain the information for average SAT score, enrollment, and admission rate information. We think the data from College Scorecard is more accurate, so we prioritize the College Scorecard data and use the kaggle data to fill in some of the missing values in the College Scorecard column to minimize the number of missing values in the variables.

The code below implements the above cleaning.

In [2]:
df1 = pd.read_json('project_data/schoolInfo.json')

In [3]:
# Dropped columns which only have NaN values
df1.dropna(axis=1, how='all', inplace=True)

# Dropped irrelevant variables
df1 = df1.drop(['nonResponderText', 'nonResponder', 'primaryPhoto', 'primaryPhotoThumb', 'aliasNames', 'urlName'], 
         axis = 1)

In [4]:
df2 = pd.read_csv('project_data/MERGED2017_18_PP.csv')

  df2 = pd.read_csv('project_data/MERGED2017_18_PP.csv')


In [6]:
# Manually selected some variables that seem relevant based on description
df2_slice = df2[['OPEID6','INSTNM','SCH_DEG','NUMBRANCH','PREDDEG','HIGHDEG','REGION','ADM_RATE','SATVR25','SATVR75','SATMT25',
                'SATMT75','SATWR25','SATWR75','SATVRMID','SATMTMID','SATWRMID','ACTCM25','ACTCM75','ACTEN25',
                'ACTEN75','ACTMT25','ACTMT75','ACTWR25','ACTWR75','ACTCMMID','ACTENMID','ACTMTMID','ACTWRMID',
                'SAT_AVG','UGDS','UGDS_WHITE','UGDS_BLACK','UGDS_HISP','UGDS_ASIAN','UGDS_AIAN','UGDS_NHPI',
                'UGDS_2MOR','UGDS_NRA','UGDS_UNKN','PPTUG_EF','NPT4_PUB','NPT4_PRIV','NUM4_PUB','NUM4_PRIV',
                'NUM4_PROG','NUM4_OTHER','COSTT4_A','COSTT4_P','TUITIONFEE_IN','TUITIONFEE_OUT','TUITIONFEE_PROG',
                'TUITFTE','INEXPFTE','AVGFACSAL','PFTFAC']]

# Dropped columns with only NA
df2_slice.dropna(axis=1, how='all', inplace=True)

# Use the school code to create a matching column to merge the two datasets
df2_slice['primaryKey'] = df2_slice['OPEID6']

result = pd.merge(df1, df2_slice, on='primaryKey', how="left", indicator = True)
result.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_slice.dropna(axis=1, how='all', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_slice['primaryKey'] = df2_slice['OPEID6']


(420, 78)

In [7]:
# Identified duplicated rows
duplicated = result[result['primaryKey'].duplicated(keep=False)]
duplicated.shape

(148, 78)

In [17]:
# Downloaded data to perform manual selection
#result.to_csv('processed_data_1.csv', index=False)
#duplicated.to_csv('duplicated_1.csv', index=False)

# Read the manually processed data
data = pd.read_csv('project_data/processed_data1.csv')

In [18]:
# Identified unranked institutions
data.loc[data.rankingDisplayRank == "Unranked"]

Unnamed: 0,act-avg,sat-avg,enrollment,city,sortName,zip,acceptance-rate,rankingDisplayScore,percent-receiving-aid,cost-after-aid,...,COSTT4_A,COSTT4_P,TUITIONFEE_IN,TUITIONFEE_OUT,TUITIONFEE_PROG,TUITFTE,INEXPFTE,AVGFACSAL,PFTFAC,_merge
300,,,1264.0,San Diego,alliantinternationaluniversity,92131,,,,,...,,,,,,,,,,left_only
301,,,,Orange,argosyuniversity,92868,,,,,...,29396.0,,13438.0,13438.0,,16793.0,6133.0,5460.0,0.1774,both
302,,,,San Francisco,californiainstituteofintegralstudies,94103,,,,,...,,,,,,22916.0,13260.0,8190.0,0.3247,both
303,,,,Minneapolis,capellauniversity,55403,,,,,...,19836.0,,14250.0,14250.0,,16533.0,1714.0,6841.0,0.1387,both
304,,,,Pocatello,idahostateuniversity,83209,,,,,...,19592.0,,7166.0,21942.0,,7822.0,11741.0,7156.0,0.938,both
305,,,133.0,San Diego,northcentraluniversity,86314,,,,,...,,,,,,16529.0,2909.0,6347.0,0.2204,left_only
306,,,,Cypress,tridentuniversityinternational,90630,96.0,,,,...,17544.0,,9240.0,9240.0,,9361.0,1828.0,5705.0,0.0614,both
307,,,,Cincinnati,unioninstituteanduniversity,45206,,,,,...,24696.0,,12896.0,12896.0,,16910.0,6638.0,5540.0,0.169,both
308,,,,Phoenix,universityofphoenix,85034,,,,,...,20083.0,,9608.0,9608.0,,13180.0,2042.0,4485.0,0.0462,left_only
309,,,,Minneapolis,waldenuniversity,55401,,,,,...,,,12465.0,12465.0,,10183.0,2854.0,6769.0,0.0674,both


In [19]:
# Removed unranked institutions
data.drop(index = range(300,311), inplace = True)

# Dropped columns with only NA
data.dropna(axis=1, how='all', inplace=True)

# Drop columns with only 1 unique value
cols_to_drop = []
for col in data.columns:
    if data[col].nunique() == 1:
        cols_to_drop.append(col)
data.drop(cols_to_drop, axis=1, inplace = True)

In [20]:
# Combined columns that have the same information but stored separately for public and private institutions
data['NPT4'] = data['NPT4_PUB'].fillna(data['NPT4_PRIV'])
data['NUM4'] = data['NUM4_PUB'].fillna(data['NUM4_PRIV'])

# Dropped already combined columns and _merge
data.drop(['_merge','NPT4_PUB','NPT4_PRIV','NUM4_PUB','NUM4_PRIV'], axis=1, inplace=True)

In [21]:
# Create a new colume that contains SAT Average 
# Filled missing values in College Scorecard SAT average with values of SAT average from the Kaggle data 
# to minimize number of missing values
data['sat_avg'] = data['SAT_AVG'].fillna(data['sat-avg'])

# Dropped the two original columns
data.drop(['SAT_AVG','sat-avg'], axis=1, inplace=True)

In [22]:
# With the same principle, we will use enrollment data from college scorecard instead of Kaggle
# Dropped Kaggle enrollment data
data.drop(['enrollment'], axis=1, inplace=True)

In [23]:
# Rename variables to more interpretable names
data = data.rename(columns = {
    'act-avg': 'act_avg',
    'acceptance-rate': 'acceptance_rate',
    'percent-receiving-aid': 'percent_receiving_aid',
    'cost-after-aid': 'cost_after_aid',
    'hs-gpa-avg': 'hs_gpa_avg', 
    'INSTNM': 'institution_name',
    'NUMBRANCH': 'branches', 
    'REGION': 'region',
    'ADM_RATE': 'admission_rate',
    'SATVR25': 'satCR25', 
    'SATVR75': 'satCR75',
    'SATMT25': 'satmt25',
    'SATMT75': 'satmt75',
    'SATVRMID': 'satcrmid', 
    'SATMTMID': 'satmtmid',
    'ACTCM25': 'actcm25',
    'ACTCM75': 'actcm75',
    'ACTEN25': 'acten25', 
    'ACTEN75': 'acten75',
    'ACTMT25': 'actmt25',
    'ACTMT75': 'actmt75',
    'ACTCMMID': 'actcmmid', 
    'ACTENMID': 'actenmid',
    'ACTMTMID': 'actmtmid',
    'UGDS': 'enrollment', 
    'UGDS_WHITE': 'percent_white',
    'UGDS_BLACK': 'percent_black',
    'UGDS_HISP': 'percent_hispanic',
    'UGDS_ASIAN': 'percent_asian', 
    'UGDS_AIAN': 'percent_aian',
    'UGDS_NHPI': 'percent_nhpi', 
    'UGDS_2MOR': 'percent_twoormore',
    'UGDS_NRA': 'percent_nra',
    'UGDS_UNKN': 'percent_unknown', 
    'PPTUG_EF': 'percent_parttime',
    'COSTT4_A': 'avg_cost',
    'TUITIONFEE_IN': 'instante_tuition', 
    'TUITIONFEE_OUT': 'outstate_tuition',
    'TUITFTE': 'tuition_revenue_per', 
    'INEXPFTE': 'instructional_expenditure_per', 
    'AVGFACSAL': 'avg_faculty_salary', 
    'PFTFAC': 'ft_faculty_rate', 
    'NPT4': 'avg_net_price', 
    'NUM4': 'number_titleIV'
}
                  )

In [24]:
# Identified missing value in admission_rate
missing = data['admission_rate'].isna()
na_rows = data[missing]
na_rows

Unnamed: 0,act_avg,city,sortName,zip,acceptance_rate,rankingDisplayScore,percent_receiving_aid,cost_after_aid,state,rankingSortRank,...,avg_cost,instante_tuition,outstate_tuition,tuition_revenue_per,instructional_expenditure_per,avg_faculty_salary,ft_faculty_rate,avg_net_price,number_titleIV,sat_avg
263,16.0,Nashville,tennesseestateuniversity,37209,53.0,,,,TN,-1,...,19058.0,7776.0,21132.0,6877.0,8732.0,7310.0,0.9707,11083.0,609.0,788.0


In [25]:
# Filled in missing value in College Scorecard admission rate data with Kaggle acceptance rate data
ar = data.loc[263, 'acceptance_rate']/100
data.at[263, 'admission_rate'] = ar
data.drop(['acceptance_rate'], axis=1, inplace=True)

In [None]:
# Downloaded cleaned data
data.to_csv('cleaned_data.csv', index=False)

### Data preparation
*By Sankaranarayanan Balasubramanian and Chun-Li*

The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:

1. Since we need to predict house price, we derived some new predictors *(from existing predictors)* that intuitively seem to be helpuful to predict house price. 

2. We have shuffled the dataset to prepare it for K-fold cross validation.

3. We have created a standardized version of the dataset, as we will use it to develop Lasso / Ridge regression models.

In [3]:
######---------------Creating new predictors----------------#########

#Creating number of bedrooms per unit floor area

#Creating ratio of bathrooms to bedrooms

#Creating ratio of carpet area to floor area

In [None]:
######-----------Shuffling the dataset for K-fold------------#########

In [None]:
######-----Standardizing the dataset for Lasso / Ridge-------#########

## Exploratory data analysis

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

## Developing the model

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

# Developing main model for LOOCV 

*By Mingyi Gong*

We will first develop loocv to classify different observations to different categories. Many observations have invalid/NA rankings. We use two approaches for this problem: (1) we fill those ranking with mean (2) we drop those observations with invalid rankings. We will first run code on dataset from approach (1) and then approach (2).

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

In [2]:
# based on dataset from data cleaning
df = pd.read_csv('~/Desktop/303/newdata.csv')
df= df.fillna(df.mean())
df2 = df.loc[df.rankingSortRank > 0,:]

In [3]:
# selecting columns in int and float type for loocv
X = df[['act_avg', 'sat_avg','percent_receiving_aid',
       'cost_after_aid', 'hs_gpa_avg','businessRepScore', 'tuition',
       'engineeringRepScore','branches', 'admission_rate', 
       'ug_enrollment', 'percent_white', 'percent_black', 'percent_hispanic',
       'percent_asian', 'percent_aian', 'percent_nhpi', 'percent_twoormore',
       'percent_nra', 'percent_unknown', 'percent_parttime', 'avg_cost',
       'instante_tuition', 'outstate_tuition', 'tuition_revenue_per',
       'instructional_expenditure_per', 'avg_faculty_salary',
       'ft_faculty_rate', 'avg_net_price', 'number_titleIV']]
Y = df[['rankingSortRank']]

In [4]:
# use loocv to predict and get accuracy
X = df[['act_avg', 'sat_avg','percent_receiving_aid',
       'cost_after_aid', 'hs_gpa_avg','businessRepScore', 'tuition',
       'engineeringRepScore','branches', 'admission_rate', 
       'ug_enrollment', 'percent_white', 'percent_black', 'percent_hispanic',
       'percent_asian', 'percent_aian', 'percent_nhpi', 'percent_twoormore',
       'percent_nra', 'percent_unknown', 'percent_parttime', 'avg_cost',
       'instante_tuition', 'outstate_tuition', 'tuition_revenue_per',
       'instructional_expenditure_per', 'avg_faculty_salary',
       'ft_faculty_rate', 'avg_net_price', 'number_titleIV']]
y = df[['rankingSortRank']]
loocv = LeaveOneOut()
loocv.get_n_splits(X)

model = LogisticRegression()

true = []
predicted = []

for train_index, test_index in loocv.split(X):

    X_train=X.loc[train_index]
    X_test=X.loc[test_index]
    y_train=y.loc[train_index]
    y_test=y.loc[test_index]
    
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    true.append(y_test['rankingSortRank'].values[0])
    predicted.append(y_pred[0])
    
print(true)
print("-----")
print(predicted)
accuracy = accuracy_score(true, predicted)

print("Accuracy:", accuracy) # accuracy score for specific rank predicting, will do classification accuracy later

[1, 2, 3, 3, 5, 5, 5, 8, 9, 10, 11, 11, 11, 14, 14, 14, 14, 18, 18, 20, 21, 21, 21, 21, 25, 25, 27, 28, 29, 30, 30, 32, 32, 34, 34, 34, 37, 37, 37, 40, 40, 42, 42, 42, 42, 46, 46, 46, 46, 46, 46, 52, 52, 54, 54, 56, 56, 56, 56, 56, 61, 61, 61, 61, 61, 61, 67, 68, 69, 69, 69, 69, 69, 69, 75, 75, 75, 78, 78, 78, 81, 81, 81, 81, 81, 81, 87, 87, 87, 90, 90, 90, 90, 94, 94, 94, 97, 97, 97, 97, 97, 97, 103, 103, 103, 103, 103, 103, 103, 110, 110, 110, 110, 110, 115, 115, 115, 115, 115, 120, 120, 120, 120, 124, 124, 124, 124, 124, 124, 124, 124, 132, 133, 133, 133, 133, 133, 133, 133, 140, 140, 140, 140, 140, 145, 145, 145, 145, 145, 145, 151, 151, 151, 151, 151, 156, 156, 156, 159, 159, 159, 159, 159, 159, 165, 165, 165, 165, 165, 165, 171, 171, 171, 171, 171, 176, 176, 176, 176, 176, 181, 181, 181, 181, 181, 181, 187, 187, 187, 187, 187, 192, 192, 192, 192, 192, 192, 198, 198, 198, 198, 202, 202, 202, 202, 202, 207, 207, 207, 207, 207, 207, 207, 207, 207, 216, 216, 216, 216, 216, 216, 216, 

Using rank 50 as cutoff, higher than that is low-ranking, lower than that is high-ranking.

In [5]:
# use loocv to predict and get accuracy
X = df[['act_avg', 'sat_avg','percent_receiving_aid',
       'cost_after_aid', 'hs_gpa_avg','businessRepScore', 'tuition',
       'engineeringRepScore','branches', 'admission_rate', 
       'ug_enrollment', 'percent_white', 'percent_black', 'percent_hispanic',
       'percent_asian', 'percent_aian', 'percent_nhpi', 'percent_twoormore',
       'percent_nra', 'percent_unknown', 'percent_parttime', 'avg_cost',
       'instante_tuition', 'outstate_tuition', 'tuition_revenue_per',
       'instructional_expenditure_per', 'avg_faculty_salary',
       'ft_faculty_rate', 'avg_net_price', 'number_titleIV']]
y = df[['rankingSortRank']]
loocv = LeaveOneOut()
loocv.get_n_splits(X)

model = LogisticRegression()

true = []
predicted = []

high_df = []
low_df = []
high_predicted = []
low_predicted = []

for train_index, test_index in loocv.split(X):

    X_train=X.loc[train_index]
    X_test=X.loc[test_index]
    y_train=y.loc[train_index]
    y_test=y.loc[test_index]
    
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    true.append(y_test['rankingSortRank'].values[0])
    predicted.append(y_pred[0])
    
    #classifying into high and low
    
    if y_test['rankingSortRank'].values[0] < 50:
        high_df.append(test_index[0])
    else:
        low_df.append(test_index[0])
    if y_pred[0] < 50:
        high_predicted.append(test_index[0])
    else:
        low_predicted.append(test_index[0])

print(high_df)
print("-----")
print(low_df)
print("prediction below-----------------")
print(high_predicted)
print("-----")
print(low_predicted)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299]
-----
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 15

After classifying into high and low categories, we want to see the accuracy for only divide into two categories.

In [6]:
correct_h = 0
for i in high_df:
    for j in high_predicted:
        if i ==j:
            correct_h = correct_h + 1 
correct_h  

101

In [7]:
correct_l = 0
for i in low_df:
    for j in low_predicted:
        if i ==j:
            correct_l = correct_l + 1 
correct_l   

134

In [8]:
# accuracy rate for classifying into 2 categories
(correct_h +correct_l)/df.shape[0]

0.7833333333333333

In [None]:
# dataset for submodel
predicted_l = df.iloc[[14, 21, 29, 32, 34, 38, 39, 43, 44, 46, 49, 50, 51, 52, 53, 54, 57, 58, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 85, 86, 89, 90, 91, 92, 93, 95, 96, 97, 98, 99, 100, 102, 104, 105, 106, 108, 109, 110, 111, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 125, 126, 127, 129, 130, 131, 133, 134, 135, 136, 137, 138, 139, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 153, 154, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166, 167, 169, 171, 172, 173, 174, 176, 178, 179, 181, 182, 183, 184, 185, 186, 189, 190, 192, 193, 195, 196, 198, 199, 205, 207, 209, 210, 212, 214, 218, 220, 227, 229, 231, 232, 235, 237, 256, 272, 281, 285],:]
predicted_l.to_csv('predicted_l.csv')

In [None]:
# dataset for submodel
predicted_h = df.iloc[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 30, 31, 33, 35, 36, 37, 40, 41, 42, 45, 47, 48, 55, 56, 59, 60, 80, 87, 88, 94, 101, 103, 107, 112, 124, 128, 132, 140, 152, 155, 161, 168, 170, 175, 177, 180, 187, 188, 191, 194, 197, 200, 201, 202, 203, 204, 206, 208, 211, 213, 215, 216, 217, 219, 221, 222, 223, 224, 225, 226, 228, 230, 233, 234, 236, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 273, 274, 275, 276, 277, 278, 279, 280, 282, 283, 284, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299],:]
predicted_h.to_csv('predicted_h.csv')

Now for dataset based on approach(1), we set 3 cutoffs and classify model into 4 categories directly, which we will compare with submodel later.

In [10]:
# use loocv to predict and get accuracy through 3 cutoff in one model
X = df[['act_avg', 'sat_avg','percent_receiving_aid',
       'cost_after_aid', 'hs_gpa_avg','businessRepScore', 'tuition',
       'engineeringRepScore','branches', 'admission_rate', 
       'ug_enrollment', 'percent_white', 'percent_black', 'percent_hispanic',
       'percent_asian', 'percent_aian', 'percent_nhpi', 'percent_twoormore',
       'percent_nra', 'percent_unknown', 'percent_parttime', 'avg_cost',
       'instante_tuition', 'outstate_tuition', 'tuition_revenue_per',
       'instructional_expenditure_per', 'avg_faculty_salary',
       'ft_faculty_rate', 'avg_net_price', 'number_titleIV']]
y = df[['rankingSortRank']]
loocv = LeaveOneOut()
loocv.get_n_splits(X)

model = LogisticRegression()

true = []
predicted = []

high_df = []
medium_high_df = []
medium_low_df = []
low_df = []
high_predicted = []
medium_high_predicted = []
medium_low_predicted = []
low_predicted = []


for train_index, test_index in loocv.split(X):

    X_train=X.loc[train_index]
    X_test=X.loc[test_index]
    y_train=y.loc[train_index]
    y_test=y.loc[test_index]
    
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)

    true.append(y_test['rankingSortRank'].values[0])
    predicted.append(y_pred[0])
    
    #classifying into high and low
    
    if y_test['rankingSortRank'].values[0] < 50:
        if y_test['rankingSortRank'].values[0] < 25:
            high_df.append(test_index[0])
        else:
            medium_high_df.append(test_index[0])
    else:
        if y_test['rankingSortRank'].values[0] > 200:
            low_df.append(test_index[0])
        else:
            medium_low_df.append(test_index[0])
    if y_pred[0] < 50:
        if y_pred[0] < 25:
            high_predicted.append(test_index[0])
        else:
            medium_high_predicted.append(test_index[0])
    else:
        if y_pred[0] > 150:
            low_predicted.append(test_index[0])
        else:
            medium_low_predicted.append(test_index[0])
modelll=model.fit(X,y)
print(high_df)
print("-----")
print(medium_high_df)
print("-----")
print(medium_low_df)
print("-----")
print(low_df)
print("prediction below-----------------")
print(high_predicted)
print("-----")
print(medium_high_predicted)
print("-----")
print(medium_low_predicted)
print("-----")
print(low_predicted)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299]
-----
[24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
-----
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 

calculate accuracy for method that classify 4 categories

In [11]:
correct_h = 0
for i in high_df:
    for j in high_predicted:
        if i ==j:
            correct_h = correct_h + 1 
correct_h  

82

In [12]:
correct_medium_h = 0
for i in medium_high_df:
    for j in medium_high_predicted:
        if i ==j:
            correct_medium_h = correct_medium_h + 1 
correct_medium_h 

13

In [13]:
correct_medium_l = 0
for i in medium_low_df:
    for j in medium_low_predicted:
        if i ==j:
            correct_medium_l = correct_medium_l + 1 
correct_medium_l 

80

In [14]:
correct_l = 0
for i in low_df:
    for j in low_predicted:
        if i ==j:
            correct_l = correct_l + 1 
correct_l  

11

In [15]:
# accuracy rate for classifying through 3 cutoff in 1 model
(correct_h +correct_medium_h+correct_medium_l +correct_l)/df.shape[0]

0.62

Now we use dataset based on approach 2 to run the model

In [16]:
# use loocv to predict and get accuracy
X = df2[['act_avg', 'sat_avg','percent_receiving_aid',
       'cost_after_aid', 'hs_gpa_avg','businessRepScore', 'tuition',
       'engineeringRepScore','branches', 'admission_rate', 
       'ug_enrollment', 'percent_white', 'percent_black', 'percent_hispanic',
       'percent_asian', 'percent_aian', 'percent_nhpi', 'percent_twoormore',
       'percent_nra', 'percent_unknown', 'percent_parttime', 'avg_cost',
       'instante_tuition', 'outstate_tuition', 'tuition_revenue_per',
       'instructional_expenditure_per', 'avg_faculty_salary',
       'ft_faculty_rate', 'avg_net_price', 'number_titleIV']]
y = df2[['rankingSortRank']]
loocv = LeaveOneOut()
loocv.get_n_splits(X)

model = LogisticRegression()

true = []
predicted = []

high_df = []
low_df = []
high_predicted = []
low_predicted = []

for train_index, test_index in loocv.split(X):

    X_train=X.loc[train_index]
    X_test=X.loc[test_index]
    y_train=y.loc[train_index]
    y_test=y.loc[test_index]
    
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    true.append(y_test['rankingSortRank'].values[0])
    predicted.append(y_pred[0])
    
    #classifying into high and low
    
    if y_test['rankingSortRank'].values[0] < 50:
        high_df.append(test_index[0])
    else:
        low_df.append(test_index[0])
    if y_pred[0] < 50:
        high_predicted.append(test_index[0])
    else:
        low_predicted.append(test_index[0])

print(high_df)
print("-----")
print(low_df)
print("prediction below-----------------")
print(high_predicted)
print("-----")
print(low_predicted)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
-----
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 22

In [None]:
# dataset for submodel-drop -1 ranking 
predicted_h = df.iloc[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 30, 31, 33, 35, 36, 37, 40, 41, 42, 45, 47, 48, 55, 56, 59, 60, 80, 85, 87, 94, 101, 103, 112, 128, 140, 147, 161, 168],:]
predicted_h.to_csv('predicted_h_230rows.csv')
predicted_l = df.iloc[[14, 21, 29, 32, 34, 38, 39, 43, 44, 46, 49, 50, 51, 52, 53, 54, 57, 58, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 86, 88, 89, 90, 91, 92, 93, 95, 96, 97, 98, 99, 100, 102, 104, 105, 106, 107, 108, 109, 110, 111, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 141, 142, 143, 144, 145, 146, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166, 167, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229],:]
predicted_l.to_csv('predicted_l_230rows.csv')

In [17]:
correct_h = 0
for i in high_df:
    for j in high_predicted:
        if i ==j:
            correct_h = correct_h + 1 
correct_h   

39

In [18]:
correct_l = 0
for i in low_df:
    for j in low_predicted:
        if i ==j:
            correct_l = correct_l + 1 
correct_l 

164

In [19]:
# accuracy rate for classifying through 3 cutoff in 1 model
(correct_h +correct_l)/df2.shape[0]

0.8826086956521739

In [20]:
# use loocv to predict and get accuracy through 3 cutoff in one model
X = df2[['act_avg', 'sat_avg','percent_receiving_aid',
       'cost_after_aid', 'hs_gpa_avg','businessRepScore', 'tuition',
       'engineeringRepScore','branches', 'admission_rate', 
       'ug_enrollment', 'percent_white', 'percent_black', 'percent_hispanic',
       'percent_asian', 'percent_aian', 'percent_nhpi', 'percent_twoormore',
       'percent_nra', 'percent_unknown', 'percent_parttime', 'avg_cost',
       'instante_tuition', 'outstate_tuition', 'tuition_revenue_per',
       'instructional_expenditure_per', 'avg_faculty_salary',
       'ft_faculty_rate', 'avg_net_price', 'number_titleIV']]
y = df2[['rankingSortRank']]
loocv = LeaveOneOut()
loocv.get_n_splits(X)

model = LogisticRegression()

true = []
predicted = []

high_df = []
medium_high_df = []
medium_low_df = []
low_df = []
high_predicted = []
medium_high_predicted = []
medium_low_predicted = []
low_predicted = []


for train_index, test_index in loocv.split(X):

    X_train=X.loc[train_index]
    X_test=X.loc[test_index]
    y_train=y.loc[train_index]
    y_test=y.loc[test_index]
    
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    true.append(y_test['rankingSortRank'].values[0])
    predicted.append(y_pred[0])
    
    #classifying into high and low
    
    if y_test['rankingSortRank'].values[0] < 50:
        if y_test['rankingSortRank'].values[0] < 25:
            high_df.append(test_index[0])
        else:
            medium_high_df.append(test_index[0])
    else:
        if y_test['rankingSortRank'].values[0] > 200:
            low_df.append(test_index[0])
        else:
            medium_low_df.append(test_index[0])
    if y_pred[0] < 50:
        if y_pred[0] < 25:
            high_predicted.append(test_index[0])
        else:
            medium_high_predicted.append(test_index[0])
    else:
        if y_pred[0] > 150:
            low_predicted.append(test_index[0])
        else:
            medium_low_predicted.append(test_index[0])

            
print(high_df)
print("-----")
print(medium_high_df)
print("-----")
print(medium_low_df)
print("-----")
print(low_df)
print("prediction below-----------------")
print(high_predicted)
print("-----")
print(medium_high_predicted)
print("-----")
print(medium_low_predicted)
print("-----")
print(low_predicted)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
-----
[24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
-----
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200]
-----
[201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217

In [21]:
correct_h = 0
for i in high_df:
    for j in high_predicted:
        if i ==j:
            correct_h = correct_h + 1 
correct_h  

21

In [22]:
correct_medium_h = 0
for i in medium_high_df:
    for j in medium_high_predicted:
        if i ==j:
            correct_medium_h = correct_medium_h + 1 
correct_medium_h 

13

In [23]:
correct_medium_l = 0
for i in medium_low_df:
    for j in medium_low_predicted:
        if i ==j:
            correct_medium_l = correct_medium_l + 1 
correct_medium_l 

79

In [24]:
correct_l = 0
for i in low_df:
    for j in low_predicted:
        if i ==j:
            correct_l = correct_l + 1 
correct_l  

26

In [25]:
# accuracy rate for classifying through 3 cutoff in 1 model - drop invalid ranking
(correct_h +correct_medium_h+correct_medium_l +correct_l)/df.shape[0]

0.4633333333333333

### Developing the Plain Model
*By Peggy Han*

We developed a plain model that simply uses all variables as predictors with no transformation or interaction terms included to serve as a baseline model.

In [34]:
# Plain Model
df = pd.read_csv('newdata.csv')
df = df.fillna(df.mean())

# Ranking of -1 indicates the school is ranked in the range of #231 - #300
# So I replaced them all with ranking of 300
df['rankingSortRank'] = df['rankingSortRank'].replace(-1, 300)

  df = df.fillna(df.mean())


In [35]:
# Create categories for universities based on the original rankings
bins = [0, 75, 150, 225, float('inf')]
labels = ['high', 'med_high', 'med_low', 'low']
df['categories'] = pd.cut(df['rankingSortRank'], bins=bins, labels=labels)

In [28]:
X = df[['act_avg', 'percent_receiving_aid', 'cost_after_aid',
        'hs_gpa_avg', 'businessRepScore', 'tuition',
       'engineeringRepScore','branches', 'region', 'admission_rate',
       'ug_enrollment', 'percent_white', 'percent_black', 'percent_hispanic',
       'percent_asian', 'percent_aian', 'percent_nhpi', 'percent_twoormore',
       'percent_nra', 'percent_unknown', 'percent_parttime', 'avg_cost',
       'instante_tuition', 'outstate_tuition', 'tuition_revenue_per',
       'instructional_expenditure_per', 'avg_faculty_salary',
       'ft_faculty_rate', 'avg_net_price', 'number_titleIV', 'sat_avg']]

#### Using all data for train and predict on trian

In [29]:
plain_model = sm.ols(formula = 'rankingSortRank~' + '+'.join(X.columns),data = df).fit()
plain_model.summary()

0,1,2,3
Dep. Variable:,rankingSortRank,R-squared:,0.894
Model:,OLS,Adj. R-squared:,0.881
Method:,Least Squares,F-statistic:,72.63
Date:,"Tue, 14 Mar 2023",Prob (F-statistic):,3.63e-112
Time:,15:31:17,Log-Likelihood:,-1464.0
No. Observations:,300,AIC:,2992.0
Df Residuals:,268,BIC:,3110.0
Df Model:,31,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-9885.2509,2.52e+04,-0.393,0.695,-5.95e+04,3.97e+04
act_avg,-3.7605,1.803,-2.086,0.038,-7.309,-0.212
percent_receiving_aid,0.2265,0.249,0.911,0.363,-0.263,0.716
cost_after_aid,0.0008,0.001,1.580,0.115,-0.000,0.002
hs_gpa_avg,-38.3910,13.057,-2.940,0.004,-64.099,-12.683
businessRepScore,-21.4875,5.852,-3.672,0.000,-33.010,-9.965
tuition,0.0009,0.001,0.857,0.392,-0.001,0.003
engineeringRepScore,4.3522,5.223,0.833,0.405,-5.932,14.636
branches,-1.7699,1.304,-1.358,0.176,-4.336,0.797

0,1,2,3
Omnibus:,8.964,Durbin-Watson:,1.495
Prob(Omnibus):,0.011,Jarque-Bera (JB):,8.93
Skew:,0.384,Prob(JB):,0.0115
Kurtosis:,3.354,Cond. No.,3400000000.0


In [31]:
prediction = plain_model.predict(df)

In [36]:
# Create bins for universities based on the predicted rankings
bins = [-float('inf'), 75, 150, 225, float('inf')]
labels = ['high', 'med_high', 'med_low', 'low']
categories = pd.cut(prediction, bins=bins, labels=labels)
df['pred_category'] = categories

In [37]:
# Compute the accuracy using the number of matching rankings
num_matches = df['categories'].eq(df['pred_category']).value_counts(normalize=True)[True] * len(df)
num_matches/len(df)

0.82

#### Splitting into Train and Test

In [39]:
test = df.sample(n=50, random_state=1)
train = df.drop(test.index)

In [40]:
plain_model_1 = sm.ols(formula = 'rankingSortRank~' + '+'.join(X.columns),data = train).fit()
plain_model_1.summary()

0,1,2,3
Dep. Variable:,rankingSortRank,R-squared:,0.896
Model:,OLS,Adj. R-squared:,0.881
Method:,Least Squares,F-statistic:,60.41
Date:,"Tue, 14 Mar 2023",Prob (F-statistic):,5.4699999999999995e-90
Time:,15:40:58,Log-Likelihood:,-1219.5
No. Observations:,250,AIC:,2503.0
Df Residuals:,218,BIC:,2616.0
Df Model:,31,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.129e+04,2.9e+04,0.735,0.463,-3.58e+04,7.84e+04
act_avg,-2.9711,2.026,-1.466,0.144,-6.964,1.022
percent_receiving_aid,0.1447,0.275,0.526,0.600,-0.398,0.687
cost_after_aid,0.0006,0.001,1.018,0.310,-0.001,0.002
hs_gpa_avg,-33.4451,14.271,-2.344,0.020,-61.571,-5.319
businessRepScore,-18.0375,6.467,-2.789,0.006,-30.783,-5.292
tuition,0.0008,0.001,0.732,0.465,-0.001,0.003
engineeringRepScore,5.9757,5.825,1.026,0.306,-5.504,17.455
branches,-4.9690,2.011,-2.471,0.014,-8.933,-1.005

0,1,2,3
Omnibus:,7.094,Durbin-Watson:,1.49
Prob(Omnibus):,0.029,Jarque-Bera (JB):,6.851
Skew:,0.384,Prob(JB):,0.0325
Kurtosis:,3.263,Cond. No.,3510000000.0


In [42]:
y_pred = plain_model_1.predict(test)

In [44]:
# Create bins for universities based on the predicted rankings
bins = [-float('inf'), 75, 150, 225, float('inf')]
labels = ['high', 'med_high', 'med_low', 'low']
categories = pd.cut(y_pred, bins=bins, labels=labels)
test['pred_category'] = categories

In [45]:
# Compute the accuracy using the number of matching rankings
matches = test['categories'].eq(test['pred_category']).value_counts(normalize=True)[True] * len(df)
matches/len(df)

0.78

### Code fitting the final model

Put the code(s) that fit the final model(s) in separate cell(s), i.e., the code with the `.ols()` or `.logit()` functions.

This is model fit for main model and we use df2 for this.

In [27]:
predictors = ['act_avg', 'sat_avg','percent_receiving_aid',
       'cost_after_aid', 'hs_gpa_avg','businessRepScore', 'tuition',
       'engineeringRepScore','branches', 'admission_rate', 
       'ug_enrollment', 'percent_white', 'percent_black', 'percent_hispanic',
       'percent_asian', 'percent_aian', 'percent_nhpi', 'percent_twoormore',
       'percent_nra', 'percent_unknown', 'percent_parttime', 'avg_cost',
       'instante_tuition', 'outstate_tuition', 'tuition_revenue_per',
       'instructional_expenditure_per', 'avg_faculty_salary',
       'ft_faculty_rate', 'avg_net_price', 'number_titleIV']

In [34]:
model = sm.ols('rankingSortRank~act_avg*admission_rate+businessRepScore*admission_rate+percent_parttime*avg_cost+instante_tuition*businessRepScore+' + '+'.join(predictors),data = df2).fit()
model.summary()

0,1,2,3
Dep. Variable:,rankingSortRank,R-squared:,0.931
Model:,OLS,Adj. R-squared:,0.919
Method:,Least Squares,F-statistic:,76.97
Date:,"Tue, 14 Mar 2023",Prob (F-statistic):,3.83e-95
Time:,16:24:34,Log-Likelihood:,-982.14
No. Observations:,230,AIC:,2034.0
Df Residuals:,195,BIC:,2155.0
Df Model:,34,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.713e+04,1.62e+04,-1.056,0.292,-4.91e+04,1.49e+04
act_avg,-5.3288,1.760,-3.028,0.003,-8.799,-1.858
admission_rate,217.1199,68.479,3.171,0.002,82.064,352.175
act_avg:admission_rate,-2.7438,2.669,-1.028,0.305,-8.008,2.520
businessRepScore,4.0317,10.578,0.381,0.704,-16.830,24.894
businessRepScore:admission_rate,-42.8449,13.045,-3.285,0.001,-68.571,-17.118
percent_parttime,275.8040,63.612,4.336,0.000,150.349,401.259
avg_cost,-0.0019,0.001,-2.339,0.020,-0.004,-0.000
percent_parttime:avg_cost,-0.0035,0.002,-2.140,0.034,-0.007,-0.000

0,1,2,3
Omnibus:,6.911,Durbin-Watson:,1.8
Prob(Omnibus):,0.032,Jarque-Bera (JB):,10.426
Skew:,-0.13,Prob(JB):,0.00545
Kurtosis:,4.01,Cond. No.,5290000000.0


## Conclusions and Recommendations to stakeholder(s)

You may or may not have code to put in this section. Delete this section if it is irrelevant.