# Barred Gates: College Admissions Predictions

## Introduction

Every year, four million students apply to college in the United States. The market for admission related services is $7 Billion ([NYTimes](http://boss.blogs.nytimes.com/2009/11/18/tune-in-start-up-drop-out/?_r=0)). The process
is fraught with anxiety and uncertainty for students and their parents. A large source of
this anxiety is the lack of reasonably accurate forecasts for successful admissions, in spite
of expensive consultants and experienced high school college counselors. 

There are some rational reasons behind why forecasting college admissions is notoriously inaccurate. Top colleges have such an abundance of choice that the final selections among many qualified candidates can be somewhat random. University admission boards try to balance for diversity among ethnic, economic and geographic groups so the choices are dependent, year to year, on application pools. 

The lack of forecasting prowess is exacerbated by organizations like the [College Board](https://www.collegeboard.org/) which has a financial incentive to offer expensive 
standardized tests and study guides that offer little predictive ability into the applicability of a given student for a given college.

Through this project, we hope to provide some clarity, precision and reasonably accurate forecasting of a student's probability of being accepted into a given college. We will make this predictive model available in a publicly accessible website [www.chanceme.info](www.chanceme.info), which is particularly timely given that the vast majority of college applications occur in the Fall.

## Project Objectives
First and foremost, we want to know whether there is any rhyme or reason behind college admissions among the cream of the crop US universities. Colleges are surprisingly opaque about admissions criteria. All of us would like to find out whether this is because they are trying to be secretive about their algorithms, or, more unsettlingly, if admissions decisions come down to unprincipled guidelines on the whim of the selection committee, or individual admissions officier. 

Secondly, if we can find clear profiles of admitted students, we want to test the common consensus and our own intuitions about which factors matter the most. Is it the case that we've all been so heavily focused on SAT scores that we've ignored the importance of breadth in AP courses? 

Finally, we aim to supplement our own technical skills by tackling a project with several moving parts. We will attempt to source data from both unstructured messages and by scraping the structured contents of public websites. We will need to confront the issues of missing data and selection bias. The people who report their statistics on web forums are probably not a representative population! We will learn how to evaluate and test several different models we haven't tried before, such as random forest with regression. 

Lastly, we'll put our new-found visualization and communication skills to test by designing a reactive web application for our results. 

## What data?
We have two main sources of data. One will provide student-based data, giving us the credentials of students who were either accepted or rejected from our target colleges, while the other gives us state information about each college (details like admission rate and financial aid status). We will scrape this data from the website CollegeData.com. The college-based information will come from the College Board and from U.S. News and World Report lists of top schools. This part of the data will be small, as we only aim to support the top 25 schools in the US with our app.

### Assumptions

We have made the following assumptions about the collected data:

* Independence of data. We assume that there is no collusion between the colleges to barter acceptances behind the scenes.
* Constant Variance. Variances of sub-populations are all equal.
* Normality: sub-population of responses are Normally distributed around the estimated mean.


### Caveats and Limitations

Since the only reliables sources of data we found are self-reported and unverified, we have the risk of selection bias in our analysis. This is somewhat mitigated by the volume we collected, which is approximately 5000 students with 13000 complete applications. Clearly this data is somewhat as our baseline acceptance rate is 60%, whereas the typical acceptance rates for the top 25 schools ranges from 5-26% ([source](http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data)). 

The structured data we collected provides a narrower view of the candidate than an admissions officer would see. We are not performing any qualitative analysis. We have no input from alumni interviews or application essays. We are not interpreting significant non-academic achievements or weighing extensive extracurricular activities. Exceptional outliers candidates would not be scored accurately by this model. As an anecdote, we know of one seventeen year old student would brought electrical power to his village in India. Our model would likely predict a low admission probability even though mitigating circumstances clearly offset typical criteria like standardized test scores.

We aim for a model that represents the majority of candidates based on the data we obtained and expect errors in prediction at the tails of this distribution.

## Related Research

We attempted to find other research related to prediction of college admissions. Most research we found focused only on regression based on one or two criteria such as GPA or Standardized Test Scores. See Appendix A. We were unable to locate any academic research that predicted probabilities on the multitude of factors that we examined.

# Acknowledgements

We'd like to thank the professors, TAs and our fellow students in [CS109](www.cs109.org) Fall 2015 for all their instruction, help and patience during this course. We have all learned a tremendous amount and feel confident that we will be able to use these skills for many years to come. In particular, we would like to express gratitude to Andy for his guidance.

We'd also like to thank Professor Kevin Rader of the Harvard Statistics department who assisted with the Linear Mixture Models.

Regards,

Morgan, Lauren, Kiran and David

December 2015