Data Cleaning for our data set, derived from data found at:
https://nces.ed.gov/ipeds/use-the-data

Authors:
- Tina Jin
- Virginia Weston
- Jeffrey Bradley
- Taylor Tucker

Import Statements

In [46]:
import numpy as np
import pandas as pd

Importing the dataset as a Pandas DataFrame.

In [47]:
df = pd.read_csv("./school_data.csv")
print(df.head())
columns = df.columns

   UnitID     Institution Name  Carnegie Classification 2018: Basic (HD2019)  \
0  138600  Agnes Scott College                                            21   
1  168546       Albion College                                            21   
2  210571     Albright College                                            21   
3  210669    Allegheny College                                            21   
4  217624     Allen University                                            21   

   Number of students receiving a Bachelor's degree (DRVC2019)  \
0                                              208.0             
1                                              310.0             
2                                              398.0             
3                                              382.0             
4                                               61.0             

   Percent of full-time first-time undergraduates awarded any financial aid (SFA1819)  \
0                                

Printing the column name and the number of null data points to get a good idea of what we are working with.

In [48]:
for c in columns:
    print(c, df[c].isna().sum())

UnitID 0
Institution Name 0
Carnegie Classification 2018: Basic (HD2019) 0
Number of students receiving a Bachelor's degree (DRVC2019) 1
Percent of full-time first-time undergraduates awarded any financial aid (SFA1819) 9
Average amount of federal  state  local or institutional grant aid awarded (SFA1819) 9
Total price for out-of-state students living on campus 2018-19 (DRVIC2018) 13
Percent admitted - men (DRVADM2018_RV) 32
Percent admitted - women (DRVADM2018_RV) 19
Full-time retention rate  2018 (EF2018D) 4
Undergraduate enrollment (DRVEF2018) 0
Percent of undergraduate enrollment that are women (DRVEF2018) 1
Percent of first-time undergraduates - in-state (DRVEF2018) 2
Percent of first-time undergraduates - out-of-state (DRVEF2018) 2
Percent of first-time undergraduates - foreign countries (DRVEF2018) 2
Percent of first-time undergraduates - residence unknown (DRVEF2018) 2
Graduation rate  total cohort (DRVGR2018_RV) 4
Percent full-time first-time receiving an award - 4 years (DRVO

It seems the columns associated with admittance rates for men and women have the highest, by far, instance of containing
a null value. This is likely due to either a lack of reporting or the prevalence of single-sex universities.
Therefore, we are going to drop those columns and add the total admissions rate.

In [49]:
df.drop(["Percent admitted - men (DRVADM2018_RV)", "Percent admitted - women (DRVADM2018_RV)"], axis=1, inplace=True)
print(df)

     UnitID             Institution Name  \
0    138600          Agnes Scott College   
1    168546               Albion College   
2    210571             Albright College   
3    210669            Allegheny College   
4    217624             Allen University   
..      ...                          ...   
233  107877  Williams Baptist University   
234  168342             Williams College   
235  206525        Wittenberg University   
236  218973              Wofford College   
237  141361         Young Harris College   

     Carnegie Classification 2018: Basic (HD2019)  \
0                                              21   
1                                              21   
2                                              21   
3                                              21   
4                                              21   
..                                            ...   
233                                            21   
234                                            

New DataFrame containing the total admissions rate, also found at
https://nces.ed.gov/ipeds/use-the-data.

In [50]:
admissions_rate = pd.read_csv("./total_admissions_rate.csv")

Concatenating our two DataFrames to make a final one

In [51]:
df = pd.concat([df, admissions_rate], axis=1)
print(df.columns)

Index(['UnitID', 'Institution Name',
       'Carnegie Classification 2018: Basic (HD2019)',
       'Number of students receiving a Bachelor's degree (DRVC2019)',
       'Percent of full-time first-time undergraduates awarded any financial aid (SFA1819)',
       'Average amount of federal  state  local or institutional grant aid awarded (SFA1819)',
       'Total price for out-of-state students living on campus 2018-19 (DRVIC2018)',
       'Full-time retention rate  2018 (EF2018D)',
       'Undergraduate enrollment (DRVEF2018)',
       'Percent of undergraduate enrollment that are women (DRVEF2018)',
       'Percent of first-time undergraduates - in-state (DRVEF2018)',
       'Percent of first-time undergraduates - out-of-state (DRVEF2018)',
       'Percent of first-time undergraduates - foreign countries (DRVEF2018)',
       'Percent of first-time undergraduates - residence unknown (DRVEF2018)',
       'Graduation rate  total cohort (DRVGR2018_RV)',
       'Percent full-time first-time 

dropping weird "Unnamed:" columns and the "Institution Name" and "UnitID" columns

In [52]:
df.drop(["Unnamed: 25", "Unnamed: 3", "Institution Name", "UnitID"], axis=1, inplace=True)
print(df.columns)

Index(['Carnegie Classification 2018: Basic (HD2019)',
       'Number of students receiving a Bachelor's degree (DRVC2019)',
       'Percent of full-time first-time undergraduates awarded any financial aid (SFA1819)',
       'Average amount of federal  state  local or institutional grant aid awarded (SFA1819)',
       'Total price for out-of-state students living on campus 2018-19 (DRVIC2018)',
       'Full-time retention rate  2018 (EF2018D)',
       'Undergraduate enrollment (DRVEF2018)',
       'Percent of undergraduate enrollment that are women (DRVEF2018)',
       'Percent of first-time undergraduates - in-state (DRVEF2018)',
       'Percent of first-time undergraduates - out-of-state (DRVEF2018)',
       'Percent of first-time undergraduates - foreign countries (DRVEF2018)',
       'Percent of first-time undergraduates - residence unknown (DRVEF2018)',
       'Graduation rate  total cohort (DRVGR2018_RV)',
       'Percent full-time first-time receiving an award - 4 years (DRVOM20

Checking once more the prevalence of missing data and printing the number of examples

In [55]:
for c in df.columns:
    print(c, df[c].isna().sum())

print("Number of features:", len(df.columns))
print("Number of examples:", len(df.iloc[:, 0]))

Carnegie Classification 2018: Basic (HD2019) 0
Number of students receiving a Bachelor's degree (DRVC2019) 1
Percent of full-time first-time undergraduates awarded any financial aid (SFA1819) 9
Average amount of federal  state  local or institutional grant aid awarded (SFA1819) 9
Total price for out-of-state students living on campus 2018-19 (DRVIC2018) 13
Full-time retention rate  2018 (EF2018D) 4
Undergraduate enrollment (DRVEF2018) 0
Percent of undergraduate enrollment that are women (DRVEF2018) 1
Percent of first-time undergraduates - in-state (DRVEF2018) 2
Percent of first-time undergraduates - out-of-state (DRVEF2018) 2
Percent of first-time undergraduates - foreign countries (DRVEF2018) 2
Percent of first-time undergraduates - residence unknown (DRVEF2018) 2
Graduation rate  total cohort (DRVGR2018_RV) 4
Percent full-time first-time receiving an award - 4 years (DRVOM2018_RV) 5
Total FTE staff (DRVHR2018) 2
Instructional FTE (DRVHR2018) 2
Student and Academic Affairs and Other E

Now we have a relatively coherent data set to work with around 21 features and 1 target variable, and around 238
examples.
