## College Finding Algorithm

Below are the functions we user input through to get our finalized list of colleges. Initially we import our library pandas, and import the filtered dataset, new_data_csv_book.ipyn is the notebook that creates the filtered dataframe file. We also import the Dictionary that maps our majors to the respective column names from our dataframe for future use.

In [1]:
import pandas as pd

# Import filtered college dataframe
college = pd.read_csv('new_college.csv')
# Import key holding mapping between major column name and column description
dictionary = pd.read_csv('CollegeScorecard_Raw_Data/CollegeScorecardDataDictionary-09-12-2015.csv')

### General Reference List

Below, we create a list that holds the index values for each college. In the csv imported, there are 7804 colleges each mapped from an index from 0 to 7803. Here we create a list that spans that range and the list allows us to explore through the pandas dataframe effectively later on.

In [2]:
# Creating List allcollegelist to continuously refer back to

allcollegelist = []
for i in range(len(college)):
    allcollegelist.append(i)

Below are all the columns of the imported dataframe csv. They are divided up for convience, and are present for reference. They will not be extensively used below.

In [3]:
# Columns from datafram we will used, divided up into useful sub lists

# general columns used to filter out final college list 
alg_categories = ['CONTROL', 'ADM_RATE',
                  'SATVR25', 'SATVR75', 'SATMT25', 'SATMT75', 'SATWR25', 'SATWR75',
                  'NPT4_PUB', 'NPT4_PRIV']

# racial proportions and total population columns used to filter out final college list
alg_race = ['UGDS','UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN',
            'UGDS_AIAN', 'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN',
            'UGDS_WHITENH', 'UGDS_API', 'UGDS_API']

# major percentage at each school given in each columns
# each column is a specific major, mapped in the original dataset dictionary
alg_majors = ['PCIP01', 'PCIP03', 'PCIP04', 'PCIP05', 'PCIP09', 'PCIP10', 'PCIP11',
              'PCIP12', 'PCIP13', 'PCIP14', 'PCIP15', 
              'PCIP16', 'PCIP19', 'PCIP22', 'PCIP23', 'PCIP24', 'PCIP25', 'PCIP26', 'PCIP27',
              'PCIP29', 'PCIP30', 'PCIP31', 'PCIP38', 'PCIP39', 'PCIP40', 'PCIP41', 'PCIP42',
              'PCIP43', 'PCIP44', 'PCIP45', 'PCIP46',
              'PCIP47', 'PCIP48', 'PCIP49', 'PCIP50', 'PCIP51', 'PCIP52', 'PCIP54']

# columns kept in final csv, but used for just displaying info after college filter runs
disp_categories = ['OPEID','INSTNM','CITY','STABBR', 'INSTURL', 'HIGHDEG', 'LATITUDE', 'LONGITUDE']

### Mapping Majors to Description
Below, we create a python dictionary from the csv file that maps information between the column name in the dataframe and the description of the column. We will allow users to type in desired majors, and will search through the descriptions to find hits.

In [4]:
#creating a dictionary where the key is the major variable and the value is the label (name of the major)
# Created from the major column name to description map file
dmajor = {}
for i in range(len(dictionary)):
    if str(dictionary.label[i]) != 'nan':
        dmajor[dictionary['VARIABLE NAME'][i]] = dictionary.label[i]

### Filtering Functions

Below are functions that filter the list of colleges made above (allcollegelist) and narrows down subsequently. We initally put the allcollegelist in one function, and put the output into the next function and the next, and finally get the final recommended college list.

In [5]:
def each_sat(score, column, collegelist, percentile):
    """
    Sub Function for section_sat
    determines which colleges the specific SAT subject is greater/less than
    Compares to one percentile at a time
    """
    filteredlist = []
    
    if percentile == 25:
        for i in collegelist:
            if score >= college[column][i]-50:
                filteredlist.append(i)
    
    if percentile == 75:
        for i in collegelist:
            if score <= college[column][i]+50:
                filteredlist.append(i)
        
    return filteredlist

def section_sat(score, section, collegelist):
    """
    Runs section but both percentiles
    section = 'VR' or 'WR' or 'MT'
    """
    column25 = 'SAT'+section+'25'
    column75 = 'SAT'+section+'75'
    
    list25 = each_sat(score, column25, collegelist, 25)
    list75 = each_sat(score, column75, list25, 75)
    
    return list75

def sat(vrscore, wrscore, mtscore, collegelist):
    """
    Gets SAT Score range of suited college
    vrscore: reading score (int)
    wrscore: writing score (int)
    mtscore: math score (int)
    collegelist (list)
    """
    listvr = section_sat(vrscore, 'VR', collegelist)
    listwr = section_sat(wrscore, 'WR', listvr)
    listmt = section_sat(mtscore, 'MT', listwr)
    
    return listmt

def diversity(pref, collegelist):
    """
    Takes user input in how much racial diversity they want at school
    pref = 'low' or 'medium', or 'high'
    """
    r = alg_race[1:]
    high = []
    medium = collegelist[:]
    low = []
    for i in collegelist:
        highmarker = 1
        for race in r:
            if college[race][i] >= 0.6:
                if i not in low:
                    low.append(i)
                    medium.remove(i)
            if college[race][i] >= 0.35:
                highmarker = 0
        if highmarker == 1:
            high.append(i)
            medium.remove(i)
    if pref == 'high':
        return high
    if pref == 'medium':
        return medium
    if pref == 'low':
        return low

def public_private(collegelist, preference):
    """
    Outputs schools that users want either public, private, or both
    Preference:
        both = 0
        public = 1
        private = 2
    """
    output_list = []
    if preference == 0: # Both Types
        output_list = collegelist
    if preference == 1: # Public College
        for i in collegelist:
            if str(college['NPT4_PUB'][i]) != 'nan':
                output_list.append(i)
    if preference == 2: # Private College
        for i in collegelist:
            if str(college['NPT4_PRIV'][i]) != 'nan':
                output_list.append(i)
    return output_list

def cost(collegelist, max_cost):
    """
    filters out based on how much people want to pay
    """
    output_list = []
    for i in collegelist:
        if college['NPT4_PUB'][i] < max_cost + 5000:
            output_list.append(i)
        if college['NPT4_PRIV'][i] < max_cost + 5000:
            output_list.append(i)
    return output_list

def each_major(searchstr, collegelist):
    """
    Analyses one major at a time to get college list
    """
    filteredlist = []
    for key in dmajor:
        if searchstr.lower() in dmajor[key].lower():
            for i in collegelist:
                if college[key][i] > 0.05:
                    if i not in filteredlist:
                        filteredlist.append(i)
                    
    return filteredlist

def major(search, collegelist):
    """
    Analyses many different majors and appends them
    Uses each_major
    """
    searchlst = search.split(';')
    filteredlist = []
    for searchstr in searchlst:
        searchlist = each_major(searchstr, collegelist)
        for i in searchlist:
            if i not in filteredlist:
                filteredlist.append(i)
    return filteredlist

def pref_popul(popul, collegelist, option):
    """
    option = 'more' or 'less'
    """
    filteredlist = []
    if option == 'more':
        for i in collegelist:
            if college['UGDS'][i] > popul:
                if i not in filteredlist:
                    filteredlist.append(i)
    if option == 'less':
        for i in collegelist:
            if college['UGDS'][i] < popul:
                if i not in filteredlist:
                    filteredlist.append(i)
    
    return filteredlist

### Using the Algorithm

Below we have an example run of the algorithm to get college recommendations. We run through each function, typing in our preferences. 

In [6]:
"""Test run of functions"""

# Filter by SAT score
college_list = sat(650,650,650,allcollegelist)

# Filter by racial diversity
college_list = diversity("low", college_list)

# Filter by public (1) or private (2) college or both (3)
college_list = public_private(college_list, 2)

# Filter by average tuition (including scholarships and financial aid)
college_list = cost(college_list, 20000)

# Filter by all desired majors separated by ';'
college_list = major("engineering; english",college_list)

# Filter by maximum or minimum population
college_list = pref_popul(10000, college_list, 'less')

# Print out college names recommended
for i in college_list:
        print college['INSTNM'][i]

Bradley University
University of Evansville
Cedarville University
Ohio Northern University
