## Table of Contents

### 1. Importing libraries and data
### 2. Geospatial Analysis
##### 1. Number of higher cost institutions
##### 2. Average cost of attendance at public institutions
##### 3. Average cost of attendance at private institutions
##### 4. Average family income

# 01. Importing libraries and data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import os
import folium

In [2]:
# Ensure charts appear in notebook
%matplotlib inline

In [3]:
# Define path variable
path = r'/Users/taraperrigeold/Documents/Documents - Tara Perrige’s MacBook Pro/CareerFoundry/College Cost Analysis'

In [4]:
# Check output
path

'/Users/taraperrigeold/Documents/Documents - Tara Perrige’s MacBook Pro/CareerFoundry/College Cost Analysis'

In [5]:
# Import json file
country_geo = os.path.join(path, '02 Data', 'Original Data', 'us-states.json')

In [6]:
# Check output
country_geo

'/Users/taraperrigeold/Documents/Documents - Tara Perrige’s MacBook Pro/CareerFoundry/College Cost Analysis/02 Data/Original Data/us-states.json'

In [7]:
# Import public institutions data set
public = pd.read_pickle(os.path.join(path, '02 Data','Prepared Data', 'public_data.pkl'))

In [8]:
# Check head
public.head()

Unnamed: 0,UNITID,NAME,CITY,STATE,ZIP,REGION,LOCALE,COSTT4_A,DEBT_MDN,FAMINC,ADM_RATE,SAT_AVG,UGDS,RET_FT4_POOLED,UGDS_WHITE,COST_CATEGORY
0,100654,Alabama A & M University,Normal,AL,35762,5,12.0,22489.0,15500.0,32362.826114,0.8986,957.0,4990.0,0.5978,0.0186,Higher cost
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,5,12.0,24347.0,15000.0,51306.674306,0.9211,1220.0,13186.0,0.8303,0.5717,Higher cost
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,5,12.0,23441.0,14476.0,61096.588949,0.8087,1314.0,7458.0,0.8269,0.7167,Higher cost
3,100724,Alabama State University,Montgomery,AL,36104-0271,5,12.0,21476.0,18679.0,31684.382188,0.9774,972.0,3903.0,0.5898,0.0167,Higher cost
4,100751,The University of Alabama,Tuscaloosa,AL,35487-0100,5,12.0,29424.0,17500.0,91846.749624,0.5906,1252.0,32177.0,0.8748,0.7774,Higher cost


In [9]:
# Check shape
public.shape

(1449, 16)

In [10]:
# Import private, non-profit institutions data set - excluding outliers
private = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'private_data.pkl'))

In [11]:
# Check head
private.head()

Unnamed: 0,UNITID,NAME,CITY,STATE,ZIP,REGION,LOCALE,COSTT4_A,NPT4_PRIV,DEBT_MDN,FAMINC,ADM_RATE,SAT_AVG,UGDS,RET_FT4_POOLED,UGDS_WHITE,COST_CATEGORY
0,100937,Birmingham-Southern College,Birmingham,AL,35254,5,12.0,52176.0,25494.0,18500.0,86672.871041,0.5666,1232.0,1265.0,0.7769,0.7858,Middle cost
1,101189,Faulkner University,Montgomery,AL,36109-3390,5,12.0,33944.0,25557.0,14925.0,36952.206116,0.5227,1069.0,2079.0,0.5611,0.4238,Middle cost
2,101365,Herzing University-Birmingham,Birmingham,AL,35209,5,21.0,26128.0,17906.0,12233.0,26184.228503,0.95,,544.0,0.5,0.2813,Middle cost
3,101435,Huntingdon College,Montgomery,AL,36106-2148,5,12.0,35685.0,20136.0,16250.0,53792.633136,0.5841,1100.0,1078.0,0.6602,0.6503,Middle cost
4,101541,Judson College,Marion,AL,36756,5,43.0,31735.0,16619.0,14112.0,28123.817955,0.482,1054.0,259.0,0.5973,0.6988,Middle cost


In [12]:
# Check shape
private.shape

(1402, 17)

# 02. Geospatial Analysis

## 01. Number of higher cost institutions

In [13]:
# Append private onto public to create one large dataframe on which to look at COST_CATEGORY column when doing geospatial analysis
pub_priv = public.append(private)

In [14]:
# Check shape
pub_priv.shape

(2851, 17)

In [15]:
# Check output
pub_priv

Unnamed: 0,UNITID,NAME,CITY,STATE,ZIP,REGION,LOCALE,COSTT4_A,DEBT_MDN,FAMINC,ADM_RATE,SAT_AVG,UGDS,RET_FT4_POOLED,UGDS_WHITE,COST_CATEGORY,NPT4_PRIV
0,100654,Alabama A & M University,Normal,AL,35762,5,12.0,22489.0,15500.0,32362.826114,0.8986,957.0,4990.0,0.5978,0.0186,Higher cost,
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,5,12.0,24347.0,15000.0,51306.674306,0.9211,1220.0,13186.0,0.8303,0.5717,Higher cost,
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,5,12.0,23441.0,14476.0,61096.588949,0.8087,1314.0,7458.0,0.8269,0.7167,Higher cost,
3,100724,Alabama State University,Montgomery,AL,36104-0271,5,12.0,21476.0,18679.0,31684.382188,0.9774,972.0,3903.0,0.5898,0.0167,Higher cost,
4,100751,The University of Alabama,Tuscaloosa,AL,35487-0100,5,12.0,29424.0,17500.0,91846.749624,0.5906,1252.0,32177.0,0.8748,0.7774,Higher cost,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1397,491710,Yeshiva Gedolah of Cliffwood,Keyport,NJ,07735-5105,2,21.0,14150.0,16500.0,58552.450000,0.5854,,84.0,0.9500,1.0000,Lower cost,7150.0
1398,491765,Yeshivas Emek Hatorah,Howell,NJ,07731-2444,2,21.0,27190.0,16500.0,58552.450000,,,46.0,0.7527,1.0000,Middle cost,20141.0
1399,491817,Seminary Bnos Chaim,Lakewood,NJ,08701-2336,2,13.0,19886.0,16500.0,58552.450000,0.8056,,159.0,0.7527,1.0000,Lower cost,14145.0
1400,492801,Drury University-College of Continuing Profess...,Springfield,MO,65802,4,12.0,19707.0,13275.0,58552.450000,,,1419.0,0.7568,0.8168,Lower cost,14368.0


In [16]:
# Reset index for pub_priv
pub_priv = pub_priv.reset_index(drop=True)

In [17]:
# Check output
pub_priv

Unnamed: 0,UNITID,NAME,CITY,STATE,ZIP,REGION,LOCALE,COSTT4_A,DEBT_MDN,FAMINC,ADM_RATE,SAT_AVG,UGDS,RET_FT4_POOLED,UGDS_WHITE,COST_CATEGORY,NPT4_PRIV
0,100654,Alabama A & M University,Normal,AL,35762,5,12.0,22489.0,15500.0,32362.826114,0.8986,957.0,4990.0,0.5978,0.0186,Higher cost,
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,5,12.0,24347.0,15000.0,51306.674306,0.9211,1220.0,13186.0,0.8303,0.5717,Higher cost,
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,5,12.0,23441.0,14476.0,61096.588949,0.8087,1314.0,7458.0,0.8269,0.7167,Higher cost,
3,100724,Alabama State University,Montgomery,AL,36104-0271,5,12.0,21476.0,18679.0,31684.382188,0.9774,972.0,3903.0,0.5898,0.0167,Higher cost,
4,100751,The University of Alabama,Tuscaloosa,AL,35487-0100,5,12.0,29424.0,17500.0,91846.749624,0.5906,1252.0,32177.0,0.8748,0.7774,Higher cost,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2846,491710,Yeshiva Gedolah of Cliffwood,Keyport,NJ,07735-5105,2,21.0,14150.0,16500.0,58552.450000,0.5854,,84.0,0.9500,1.0000,Lower cost,7150.0
2847,491765,Yeshivas Emek Hatorah,Howell,NJ,07731-2444,2,21.0,27190.0,16500.0,58552.450000,,,46.0,0.7527,1.0000,Middle cost,20141.0
2848,491817,Seminary Bnos Chaim,Lakewood,NJ,08701-2336,2,13.0,19886.0,16500.0,58552.450000,0.8056,,159.0,0.7527,1.0000,Lower cost,14145.0
2849,492801,Drury University-College of Continuing Profess...,Springfield,MO,65802,4,12.0,19707.0,13275.0,58552.450000,,,1419.0,0.7568,0.8168,Lower cost,14368.0


In [18]:
# Drop NPT4_PRIV column
pub_priv = pub_priv.drop(columns = ['NPT4_PRIV'])

In [19]:
# Check output
pub_priv.head()

Unnamed: 0,UNITID,NAME,CITY,STATE,ZIP,REGION,LOCALE,COSTT4_A,DEBT_MDN,FAMINC,ADM_RATE,SAT_AVG,UGDS,RET_FT4_POOLED,UGDS_WHITE,COST_CATEGORY
0,100654,Alabama A & M University,Normal,AL,35762,5,12.0,22489.0,15500.0,32362.826114,0.8986,957.0,4990.0,0.5978,0.0186,Higher cost
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,5,12.0,24347.0,15000.0,51306.674306,0.9211,1220.0,13186.0,0.8303,0.5717,Higher cost
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,5,12.0,23441.0,14476.0,61096.588949,0.8087,1314.0,7458.0,0.8269,0.7167,Higher cost
3,100724,Alabama State University,Montgomery,AL,36104-0271,5,12.0,21476.0,18679.0,31684.382188,0.9774,972.0,3903.0,0.5898,0.0167,Higher cost
4,100751,The University of Alabama,Tuscaloosa,AL,35487-0100,5,12.0,29424.0,17500.0,91846.749624,0.5906,1252.0,32177.0,0.8748,0.7774,Higher cost


In [20]:
# Check value counts for COST_CATEGORY
pub_priv['COST_CATEGORY'].value_counts(dropna = False)

Middle cost    1832
Higher cost     630
Lower cost      389
Name: COST_CATEGORY, dtype: int64

In [21]:
# Drop missing values
pub_priv = pub_priv[pub_priv['COST_CATEGORY'].notnull()]

In [22]:
# Check shape
pub_priv.shape

(2851, 16)

In [23]:
# Check output
pub_priv

Unnamed: 0,UNITID,NAME,CITY,STATE,ZIP,REGION,LOCALE,COSTT4_A,DEBT_MDN,FAMINC,ADM_RATE,SAT_AVG,UGDS,RET_FT4_POOLED,UGDS_WHITE,COST_CATEGORY
0,100654,Alabama A & M University,Normal,AL,35762,5,12.0,22489.0,15500.0,32362.826114,0.8986,957.0,4990.0,0.5978,0.0186,Higher cost
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,5,12.0,24347.0,15000.0,51306.674306,0.9211,1220.0,13186.0,0.8303,0.5717,Higher cost
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,5,12.0,23441.0,14476.0,61096.588949,0.8087,1314.0,7458.0,0.8269,0.7167,Higher cost
3,100724,Alabama State University,Montgomery,AL,36104-0271,5,12.0,21476.0,18679.0,31684.382188,0.9774,972.0,3903.0,0.5898,0.0167,Higher cost
4,100751,The University of Alabama,Tuscaloosa,AL,35487-0100,5,12.0,29424.0,17500.0,91846.749624,0.5906,1252.0,32177.0,0.8748,0.7774,Higher cost
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2846,491710,Yeshiva Gedolah of Cliffwood,Keyport,NJ,07735-5105,2,21.0,14150.0,16500.0,58552.450000,0.5854,,84.0,0.9500,1.0000,Lower cost
2847,491765,Yeshivas Emek Hatorah,Howell,NJ,07731-2444,2,21.0,27190.0,16500.0,58552.450000,,,46.0,0.7527,1.0000,Middle cost
2848,491817,Seminary Bnos Chaim,Lakewood,NJ,08701-2336,2,13.0,19886.0,16500.0,58552.450000,0.8056,,159.0,0.7527,1.0000,Lower cost
2849,492801,Drury University-College of Continuing Profess...,Springfield,MO,65802,4,12.0,19707.0,13275.0,58552.450000,,,1419.0,0.7568,0.8168,Lower cost


In [24]:
# Get counts of cost category by state
count_cost_cat = pub_priv.groupby(['STATE', 'COST_CATEGORY']).size()

In [25]:
# Check output
count_cost_cat

STATE  COST_CATEGORY
AK     Lower cost        2
       Middle cost       4
AL     Higher cost       9
       Lower cost        9
       Middle cost      22
                        ..
WI     Middle cost      51
WV     Higher cost       3
       Lower cost        1
       Middle cost      25
WY     Middle cost       8
Length: 151, dtype: int64

In [26]:
# Turn into data frame
count_cost_cat = count_cost_cat.reset_index()

In [27]:
# Check output
count_cost_cat

Unnamed: 0,STATE,COST_CATEGORY,0
0,AK,Lower cost,2
1,AK,Middle cost,4
2,AL,Higher cost,9
3,AL,Lower cost,9
4,AL,Middle cost,22
...,...,...,...
146,WI,Middle cost,51
147,WV,Higher cost,3
148,WV,Lower cost,1
149,WV,Middle cost,25


In [28]:
# Change name of column 0
count_cost_cat.rename(columns = {0 : 'COUNT'}, inplace = True)

In [29]:
# Check output
count_cost_cat

Unnamed: 0,STATE,COST_CATEGORY,COUNT
0,AK,Lower cost,2
1,AK,Middle cost,4
2,AL,Higher cost,9
3,AL,Lower cost,9
4,AL,Middle cost,22
...,...,...,...
146,WI,Middle cost,51
147,WV,Higher cost,3
148,WV,Lower cost,1
149,WV,Middle cost,25


In [30]:
# Select higher cost as layer
hist_indicator = 'Higher cost'

In [31]:
# Create mask for map
mask1 = count_cost_cat['COST_CATEGORY'].str.contains(hist_indicator)

In [32]:
# Apply mask
stage = count_cost_cat[mask1]

In [33]:
# Check head
stage.head()

Unnamed: 0,STATE,COST_CATEGORY,COUNT
2,AL,Higher cost,9
5,AR,Higher cost,7
8,AZ,Higher cost,6
11,CA,Higher cost,48
14,CO,Higher cost,15


In [34]:
# Create data frame with just states and counts (of higher cost institutions)
data_to_plot = stage[['STATE', 'COUNT']]

In [35]:
# Check head
data_to_plot.head()

Unnamed: 0,STATE,COUNT
2,AL,9
5,AR,7
8,AZ,6
11,CA,48
14,CO,15


In [36]:
# Look at values for STATE
data_to_plot['STATE'].value_counts(dropna = False)

KY    1
MD    1
NC    1
MS    1
VA    1
NY    1
MN    1
IA    1
ME    1
ID    1
PA    1
IN    1
NH    1
DE    1
TX    1
KS    1
CA    1
FL    1
NJ    1
WV    1
NE    1
WA    1
UT    1
MT    1
ND    1
CO    1
TN    1
OH    1
MI    1
OR    1
LA    1
WI    1
NM    1
HI    1
AL    1
IL    1
MO    1
AR    1
AZ    1
DC    1
MA    1
CT    1
RI    1
VT    1
NV    1
OK    1
SC    1
GA    1
SD    1
Name: STATE, dtype: int64

In [37]:
# Setup a folium map at a high-level zoom
map = folium.Map(location = [100,0], zoom_start = 1.5)

# Create the choropleth map
map.choropleth(geo_data = country_geo, data = data_to_plot,
              columns = ['STATE', 'COUNT'],
              key_on = 'feature.id',
              fill_color = 'YlOrBr', fill_opacity = 0.6, line_opacity = 0.1,
              legend_name = 'Number of Higher-cost institutions')



In [38]:
# Save map
map.save('plot_data_highcost.html')

# Import the folium interactive HTML file
from IPython.display import HTML
HTML('<iframe src=plot_data_highcost.html width=700 height=450></iframe>')



In [39]:
# Check what's up with Alaska
data_to_plot[data_to_plot['STATE'] == 'AK']

Unnamed: 0,STATE,COUNT


In [40]:
# Check what's up with Wyoming
data_to_plot[data_to_plot['STATE'] == 'WY']

Unnamed: 0,STATE,COUNT


This map shows that Pennsylvania, followed by New York, then California and Massachusetts, have the highest number of 'higher-cost' institutions. There are very few in all the other states, and none in Alaska, Wyoming, and Washington DC.

This map kind of answers one of my research questions. One of them asked whether institutions with higher costs had common characteristics; it does seem as though a handful of states contain the highest concentrations of higher cost institutions. However, it doesn't look like location would be a strong enough indicator for whether an institution is higher cost. It would probably have been better to look at percentage of higher institutions per each state; Pennsylvania, California, New York, Massachusetts, and Texas are likely to have the largest number of higher education institutions and therefore have more higher cost institutions.

## 02. Average cost of attendance at public institutions

In [41]:
# Check for missing values of COSTT4_A for public institutions
public['COSTT4_A'].isnull().sum()

0

In [42]:
# Create dataframe of just STATE and COSTT4_A
pub_cost = public[['STATE', 'COSTT4_A']]

In [43]:
# Check output
pub_cost.head()

Unnamed: 0,STATE,COSTT4_A
0,AL,22489.0
1,AL,24347.0
2,AL,23441.0
3,AL,21476.0
4,AL,29424.0


In [44]:
# Get average of COSTT4_A for each state
pub_cost_avg = pub_cost.groupby('STATE').agg({'COSTT4_A':'mean'})

In [45]:
# Check output
pub_cost_avg

Unnamed: 0_level_0,COSTT4_A
STATE,Unnamed: 1_level_1
AK,18435.0
AL,17251.28
AR,15299.62069
AZ,15590.565217
CA,15440.371212
CO,19799.703704
CT,16488.8
DC,22928.0
DE,17808.5
FL,14148.891892


In [46]:
# Reset index
pub_cost_avg = pub_cost_avg.reset_index()

In [47]:
# Check output
pub_cost_avg.head()

Unnamed: 0,STATE,COSTT4_A
0,AK,18435.0
1,AL,17251.28
2,AR,15299.62069
3,AZ,15590.565217
4,CA,15440.371212


In [48]:
# Check values in STATE column
pub_cost_avg['STATE'].value_counts(dropna=False)

SC    1
PR    1
TX    1
HI    1
AL    1
NE    1
ID    1
AR    1
LA    1
RI    1
OK    1
GA    1
ND    1
MT    1
DC    1
WI    1
VA    1
NM    1
AZ    1
ME    1
NC    1
CT    1
NH    1
VT    1
CA    1
MN    1
MI    1
SD    1
CO    1
KY    1
OR    1
PA    1
IL    1
MO    1
TN    1
VI    1
OH    1
WV    1
MA    1
AK    1
MD    1
MS    1
NY    1
IN    1
IA    1
DE    1
KS    1
NV    1
GU    1
FL    1
NJ    1
WY    1
UT    1
WA    1
Name: STATE, dtype: int64

In [49]:
# Remove rows for US territories: GU
pub_cost_avg = pub_cost_avg[pub_cost_avg['STATE'] != 'GU']

In [50]:
# Remove rows for US territories: PR
pub_cost_avg = pub_cost_avg[pub_cost_avg['STATE'] != 'PR']

In [51]:
# Remove rows for US territories: VI
pub_cost_avg = pub_cost_avg[pub_cost_avg['STATE'] != 'VI']

In [52]:
# Check shape
pub_cost_avg.shape

(51, 2)

In [53]:
# Set up folium map at high-level zoom
map2 = folium.Map(location = [100,0], zoom_start = 1.5)

# Create choropleth map
map2.choropleth(geo_data = country_geo, data = pub_cost_avg,
              columns = ['STATE', 'COSTT4_A'],
              key_on = 'feature.id',
              fill_color = 'YlOrBr', fill_opacity=0.6, line_opacity=0.1,
              legend_name = 'Average cost of public institutions')



In [54]:
# Save map
map2.save('plot_data_pub_avgcost.html')

# Import the Folium interactive html file
from IPython.display import HTML
HTML('<iframe src=plot_data_pub_avgcost.html width=700 height=450></iframe>')



This map shows that states with the highest average costs for public institutions are Vermont and Pennsylvania. This leads me to wonder why the public colleges in those specific states are a lot more expensive. One theory is that they enroll more out-of-state students, who have to pay much higher tuition than in-state students, and therefore push the average cost of attendance higher.

This is also related to my research question of whether institutions with higher costs have common characteristics. I could look further into whether the institutions from the states listed above have other common characteristics.

## 03. Average cost of attendance at private institutions

In [55]:
# Check for missing values of COSTT4_A for private institutions
private['COSTT4_A'].isnull().sum()

0

In [56]:
# Create dataframe of just STATE and COSTT4_A
priv_cost = private[['STATE', 'COSTT4_A']]

In [57]:
# Check output
priv_cost

Unnamed: 0,STATE,COSTT4_A
0,AL,52176.0
1,AL,33944.0
2,AL,26128.0
3,AL,35685.0
4,AL,31735.0
...,...,...
1397,NJ,14150.0
1398,NJ,27190.0
1399,NJ,19886.0
1400,MO,19707.0


In [58]:
# Get average of COSTT4_A for each state
priv_cost_avg = priv_cost.groupby('STATE').agg({'COSTT4_A':'mean'})

In [59]:
# Check output
priv_cost_avg

Unnamed: 0_level_0,COSTT4_A
STATE,Unnamed: 1_level_1
AK,26788.666667
AL,33139.866667
AR,30917.066667
AZ,37335.5
CA,45712.595745
CO,45615.4
CT,54050.266667
DC,49278.5
DE,35213.75
FL,35802.877193


In [60]:
# Reset index
priv_cost_avg = priv_cost_avg.reset_index()

In [61]:
# Check output
priv_cost_avg.head()

Unnamed: 0,STATE,COSTT4_A
0,AK,26788.666667
1,AL,33139.866667
2,AR,30917.066667
3,AZ,37335.5
4,CA,45712.595745


In [62]:
# Check values for STATE column
priv_cost_avg['STATE'].value_counts(dropna = False)

SC    1
DC    1
HI    1
AL    1
NE    1
ID    1
AR    1
LA    1
RI    1
OK    1
GA    1
ND    1
PR    1
MT    1
WI    1
VA    1
NM    1
AZ    1
ME    1
NC    1
CT    1
NH    1
VT    1
CA    1
MN    1
TX    1
MI    1
SD    1
MS    1
OR    1
PA    1
IL    1
MO    1
TN    1
OH    1
WV    1
MA    1
AK    1
MD    1
NY    1
CO    1
IN    1
IA    1
DE    1
KS    1
NV    1
GU    1
FL    1
NJ    1
UT    1
KY    1
WA    1
Name: STATE, dtype: int64

In [63]:
# Remove rows for US territories: PR
priv_cost_avg = priv_cost_avg[priv_cost_avg['STATE'] != 'PR']

In [64]:
# Remove rows for US territories: GU
priv_cost_avg = priv_cost_avg[priv_cost_avg['STATE'] != 'GU']

In [65]:
# Check shape
priv_cost_avg.shape

(50, 2)

In [66]:
# Set up folium map at high-level zoom
map3 = folium.Map(location = [100,0], zoom_start = 1.5)

# Create choropleth map
map3.choropleth(geo_data = country_geo, data = priv_cost_avg,
              columns = ['STATE', 'COSTT4_A'],
              key_on = 'feature.id',
              fill_color = 'YlOrBr', fill_opacity=0.6, line_opacity=0.1,
              legend_name = 'Average cost of private institutions')



In [67]:
# Save map
map3.save('plot_data_priv_avgcost.html')

# Import the Folium interactive html file
from IPython.display import HTML
HTML('<iframe src=plot_data_priv_avgcost.html width=700 height=450></iframe>')



In [68]:
# Chck what's going on with Wyoming
priv_cost[priv_cost['STATE'] == 'WY']

Unnamed: 0,STATE,COSTT4_A


This map shows that the states with the highest average costs for private institutions are Vermont, Massachusetts and Connecticut (with no data for Wyoming). Curiously enough, Vermont also had high average costs for its public institutions. This makes me wonder why the institutions in these states have higher average costs. Since these are private colleges, are the ones in these states a lot more selective than in other states? Or maybe it's related to cost of living in these states, and costs are higher because salaries and expenses having to do with various facilities are much higher (including taxes on land, although that is definitely outside of the scope of my data).

This also has to do with my research question concerning common characteristics of higher cost institutions. I could look further into whether the institutions from the states listed above have other common characteristics.

## 04. Average family income

In [69]:
# Append private onto public to create one large dataframe on which to look at family income when doing geospatial analysis
pub_priv2 = public.append(private)

In [70]:
# Check output
pub_priv2

Unnamed: 0,UNITID,NAME,CITY,STATE,ZIP,REGION,LOCALE,COSTT4_A,DEBT_MDN,FAMINC,ADM_RATE,SAT_AVG,UGDS,RET_FT4_POOLED,UGDS_WHITE,COST_CATEGORY,NPT4_PRIV
0,100654,Alabama A & M University,Normal,AL,35762,5,12.0,22489.0,15500.0,32362.826114,0.8986,957.0,4990.0,0.5978,0.0186,Higher cost,
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,5,12.0,24347.0,15000.0,51306.674306,0.9211,1220.0,13186.0,0.8303,0.5717,Higher cost,
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,5,12.0,23441.0,14476.0,61096.588949,0.8087,1314.0,7458.0,0.8269,0.7167,Higher cost,
3,100724,Alabama State University,Montgomery,AL,36104-0271,5,12.0,21476.0,18679.0,31684.382188,0.9774,972.0,3903.0,0.5898,0.0167,Higher cost,
4,100751,The University of Alabama,Tuscaloosa,AL,35487-0100,5,12.0,29424.0,17500.0,91846.749624,0.5906,1252.0,32177.0,0.8748,0.7774,Higher cost,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1397,491710,Yeshiva Gedolah of Cliffwood,Keyport,NJ,07735-5105,2,21.0,14150.0,16500.0,58552.450000,0.5854,,84.0,0.9500,1.0000,Lower cost,7150.0
1398,491765,Yeshivas Emek Hatorah,Howell,NJ,07731-2444,2,21.0,27190.0,16500.0,58552.450000,,,46.0,0.7527,1.0000,Middle cost,20141.0
1399,491817,Seminary Bnos Chaim,Lakewood,NJ,08701-2336,2,13.0,19886.0,16500.0,58552.450000,0.8056,,159.0,0.7527,1.0000,Lower cost,14145.0
1400,492801,Drury University-College of Continuing Profess...,Springfield,MO,65802,4,12.0,19707.0,13275.0,58552.450000,,,1419.0,0.7568,0.8168,Lower cost,14368.0


In [71]:
# Reset index
pub_priv2 = pub_priv2.reset_index(drop=True)

In [72]:
# Check output
pub_priv2

Unnamed: 0,UNITID,NAME,CITY,STATE,ZIP,REGION,LOCALE,COSTT4_A,DEBT_MDN,FAMINC,ADM_RATE,SAT_AVG,UGDS,RET_FT4_POOLED,UGDS_WHITE,COST_CATEGORY,NPT4_PRIV
0,100654,Alabama A & M University,Normal,AL,35762,5,12.0,22489.0,15500.0,32362.826114,0.8986,957.0,4990.0,0.5978,0.0186,Higher cost,
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,5,12.0,24347.0,15000.0,51306.674306,0.9211,1220.0,13186.0,0.8303,0.5717,Higher cost,
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,5,12.0,23441.0,14476.0,61096.588949,0.8087,1314.0,7458.0,0.8269,0.7167,Higher cost,
3,100724,Alabama State University,Montgomery,AL,36104-0271,5,12.0,21476.0,18679.0,31684.382188,0.9774,972.0,3903.0,0.5898,0.0167,Higher cost,
4,100751,The University of Alabama,Tuscaloosa,AL,35487-0100,5,12.0,29424.0,17500.0,91846.749624,0.5906,1252.0,32177.0,0.8748,0.7774,Higher cost,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2846,491710,Yeshiva Gedolah of Cliffwood,Keyport,NJ,07735-5105,2,21.0,14150.0,16500.0,58552.450000,0.5854,,84.0,0.9500,1.0000,Lower cost,7150.0
2847,491765,Yeshivas Emek Hatorah,Howell,NJ,07731-2444,2,21.0,27190.0,16500.0,58552.450000,,,46.0,0.7527,1.0000,Middle cost,20141.0
2848,491817,Seminary Bnos Chaim,Lakewood,NJ,08701-2336,2,13.0,19886.0,16500.0,58552.450000,0.8056,,159.0,0.7527,1.0000,Lower cost,14145.0
2849,492801,Drury University-College of Continuing Profess...,Springfield,MO,65802,4,12.0,19707.0,13275.0,58552.450000,,,1419.0,0.7568,0.8168,Lower cost,14368.0


In [73]:
# Check shape
pub_priv2.shape

(2851, 17)

In [74]:
# Drop NPT4_PRIV column
pub_priv2 = pub_priv2.drop(columns = ['NPT4_PRIV'])

In [75]:
# Check for missing values for family income
pub_priv2['FAMINC'].isnull().sum()

0

In [76]:
# Create data frame with just states and average family income
avg_faminc = pub_priv2[['STATE', 'FAMINC']]

In [77]:
# Check output
avg_faminc

Unnamed: 0,STATE,FAMINC
0,AL,32362.826114
1,AL,51306.674306
2,AL,61096.588949
3,AL,31684.382188
4,AL,91846.749624
...,...,...
2846,NJ,58552.450000
2847,NJ,58552.450000
2848,NJ,58552.450000
2849,MO,58552.450000


In [78]:
# Get average of FAMINC for each state
avg_faminc_agg = avg_faminc.groupby('STATE').agg({'FAMINC':'mean'})

In [79]:
# Check output
avg_faminc_agg

Unnamed: 0_level_0,FAMINC
STATE,Unnamed: 1_level_1
AK,43016.561668
AL,45094.289108
AR,38740.143545
AZ,37321.404426
CA,39442.840757
CO,52171.221491
CT,64443.354356
DC,71693.499163
DE,50260.252741
FL,41220.37387


In [80]:
# Reset index
avg_faminc_agg = avg_faminc_agg.reset_index()

In [81]:
# Check output
avg_faminc_agg.head()

Unnamed: 0,STATE,FAMINC
0,AK,43016.561668
1,AL,45094.289108
2,AR,38740.143545
3,AZ,37321.404426
4,CA,39442.840757


In [82]:
# Check values for STATE column
avg_faminc_agg['STATE'].value_counts(dropna = False)

SC    1
PR    1
TX    1
HI    1
AL    1
NE    1
ID    1
AR    1
LA    1
RI    1
OK    1
GA    1
ND    1
MT    1
DC    1
WI    1
VA    1
NM    1
AZ    1
ME    1
NC    1
CT    1
NH    1
VT    1
CA    1
MN    1
MI    1
SD    1
CO    1
KY    1
OR    1
PA    1
IL    1
MO    1
TN    1
VI    1
OH    1
WV    1
MA    1
AK    1
MD    1
MS    1
NY    1
IN    1
IA    1
DE    1
KS    1
NV    1
GU    1
FL    1
NJ    1
WY    1
UT    1
WA    1
Name: STATE, dtype: int64

In [83]:
# Remove rows for US territories: PR
avg_faminc_agg = avg_faminc_agg[avg_faminc_agg['STATE'] != 'PR']

In [84]:
# Remove rows for US territories: VI
avg_faminc_agg = avg_faminc_agg[avg_faminc_agg['STATE'] != 'VI']

In [85]:
# Remove rows for US territories: GU
avg_faminc_agg = avg_faminc_agg[avg_faminc_agg['STATE'] != 'GU']

In [86]:
# Check shape
avg_faminc_agg.shape

(51, 2)

In [87]:
# Set up folium map at high-level zoom
map4 = folium.Map(location = [100,0], zoom_start = 1.5)

# Create choropleth map
map4.choropleth(geo_data = country_geo, data = avg_faminc_agg,
              columns = ['STATE', 'FAMINC'],
              key_on = 'feature.id',
              fill_color = 'YlOrBr', fill_opacity=0.6, line_opacity=0.1,
              legend_name = 'Average family income of students')



In [88]:
# Save map
map4.save('plot_data_avg_faminc.html')

# Import the Folium interactive html file
from IPython.display import HTML
HTML('<iframe src=plot_data_avg_faminc.html width=700 height=450></iframe>')



This map shows that the following states have the highest average family incomes for students enrolled in institutions there: Vermont, New Hampshire, Massachusetts, Rhode Island, and Pennsylvania. There is some overlap here with states with higher average costs: Vermont (public & private), Massachusetts (private), New Mexico (private), and Pennsylvania (public). First of all, South Dakota seems odd to me; however, it's possible that young people tend to not go to college in South Dakota unless their family is more well-off. One way to tell would probably be to look at percentage of young people who go to college, which is outside the scope of my data. As for Vermont and Massachusetts, this probably lends credence to my thought earlier that these states have higher costs of living, considering incomes there tend to be higher. However, I know for a fact that California and Washington have extremely high costs of living, but average family incomes of college students don't seem to be at the highest end of the spectrum. So while that might be one aspect of it, it's certainly not the only reason. It could also have to do with inequality; for example, California and Massachusetts might both have very high costs of living, but California might have a lot more inequality (spread between lowest and highest incomes) compared to Massachusetts (where the lowest incomes might be closer to the highest incomes). However, my data set does not include anything about equality or inequality between states.