## Problem Statement

Every year approximately 10,000 students apply for graduate schools abroad (United States, Canada, United Kingdom etc.) from different Indian institutes. [These](https://www.mbacrystalball.com/blog/2016/03/21/number-of-indian-students-in-usa-statistics-analysis/) [articles](https://in.usembassy.gov/u-s-hosts-million-international-students-second-consecutive-year/) clearly indicate an exponential rise in the number of Indian students abroad enrolled in undergraduate, graduate and doctoral programs. With such a monumental student population, one would expect an interpretable and coherent system to analyse graduate student intake, measures correlations between the different factors related to admissions, corroborate to the noise and entropy in the admission process, and predict the chances of admission into a given university. However, we could not find such a holistic system that addressed these issues.

We therefore wanted to try and disentangle the process of graduate school admissions by analysing the various factors at stake, and apply our prior inductive biases about the randomness in the process to build predictive models. Our problem statement involves exploring and comprehending the following aspects of graduate school admissions:
- Studying patterns in student profiles applying to different graduate schools across geographic locations.
- Comparing, differentiating and gaining insights about the admission process.
- Inferring and leveraging definitive existing correlations.
- Drawing conclusions about admit/reject ratios, strength of an applicant profile, probability of admission, appropriate universities for a specific profile


## Data Collection

## Analysis

## Predictive Modelling

We performed our predictive modelling experiments on three major case studies:
- Carnegie Mellon University
- University of Illinois, Urbana Champaign
- University of California, Los Angeles

### Classification

As a first step, we consider the prediction problem as a deterministic binary classification problem. Our data cleaning and pre-processing steps have been explained above. For the binary classification task, we considered simple models so that we can interpret the results well. We also ensured that we could retrieve the importance of every factor so that a comprehensive analysis could be performed.

We use the following classifiers:
- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosting
- Support Vector Machine (SVM)

A concise explanations of our results follows:

- Carnegie Mellon University

| Classifier | Logistic Regression | Decision Tree | Random Forest | Gradient Boosting | SVM |
| ------ | ------ | ------ | ------ | ------ | ------ |
| Accuracy | 0.72 | 0.65 | 0.7 | 0.67 | 0.75 |
| Precision (0) | 0.66 | 0.38 | 0 | 0.33 | 0.78 |
| Recall (0) | 0.27 | 0.29 | 0 | 0.13 | 0.3 |
| Precision (1) | 0.73 | 0.73 | 0.71 | 0.71 | 0.75 |
| Recall (1) | 0.93 | 0.81 | 1 | 0.89 | 0.96 |

The average importance of features predicted by our classifier are depicted by the figure below.

<p align="center">
  <img width="400" height="300" src="images/cmu_feature_imp.png">
</p>

We can infer that the GPA factor takes utmost precedence when it comes to the admit decision. Two other important factors include the undergraduate major and GRE work experience. A slightly surprising that we notice is the low importance given to the numner of publications. A common notion is that publications play a vital role in getting admitted to any university, but our analysis shows otherwise. 

- University of Illinois, Urbana Champaign

| Classifier | Logistic Regression | Decision Tree | Random Forest | Gradient Boosting | SVM |
| ------ | ------ | ------ | ------ | ------ | ------ |
| Accuracy | 0.74 | 0.844 | 0.726 | 0.87 | 0.75 |
| Precision (0) | 0.71 | 0.75 | 0 | 0.78 | 1 |
| Recall (0) | 0.09 | 0.66 | 0 | 0.74 | 0.1 |
| Precision (1) | 0.74 | 0.88 | 0.73 | 0.9 | 0.75 |
| Recall (1) | 0.99 | 0.92 | 1 | 0.92 | 1 |

<p align="center">
  <img width="400" height="300" src="images/uiuc_feature_imp.png">
</p>

We immediately see the huge spike in the importance of the work experience factor. This could be slightly counter-intuitive to our understanding of the admissions process, but we believe that these sorts of biases differ from college to college, and hence a uniform criterion cannot be applied. We also see that publications are an important factor for an admit to UIUC, clearly orthogonal to our results for CMU. 

- University of California, Los Angeles

| Classifier | Logistic Regression | Decision Tree | Random Forest | Gradient Boosting | SVM |
| ------ | ------ | ------ | ------ | ------ | ------ |
| Accuracy | 0.73 | 0.76 | 0.77 | 0.75 | 0.74 |
| ROC Score | 0.51 | 0.66 | 0.58 | 0.65 | 0.53 |
| F1 Score | 0.63 | 0.75 | 0.71 | 0.74 | 0.66 |

<p align="center">
  <img width="400" height="300" src="images/ucla_feature_imp.png">
</p>

Here too, GPA takes the driver's seat when it comes to importance of factors. It outweighs every factor by almost a double margin, and this result might be slightly concerning to students with a low GPA but a strong profile otherwise. We also see that work experience is the second most important factor for getting an admit. Similar to the case of CMU, publications are not as important a factor as they are made out to be.


On the basis of the above classification experiments, we can state with some confidence that the different factors weigh distinctly for the admission criteria for different universities. Hence, students looking to target particular universities, cannot blindly follow a uniform 'gold standard' trajectory for getting an admit. Each university needs to be researched independently and a student must portray his/her profile to meet the specific requirements of each university.

### Regression

As a follow-up experiment, we wanted to explore the exact chance of admit as a function of the student profile. For this, we use the public [UCLA admission dataset](https://stats.idre.ucla.edu/stat/data/binary.csv) to perform a regression analysis. 

The data contains the following fields:
- <u>GRE scores</u>:
The GRE scores range from 260 to 340. The GRE scores were distributed as:

<p align="center">
  <img width="400" height="300" src="images/ucla_gre.png">
</p>

- <u>TOEFL scores</u>:
The TOEFL scores have a range of 0 to 120, however our dataset has aleast TOEFL score of 93. The TOEFL scores were distributed as:

<p align="center">
  <img width="400" height="300" src="images/ucla_toefl.png">
</p>

- <u>CGPA</u>:
The CGPA distribution of the candidates in the dataset is:

<p align="center">
  <img width="400" height="300" src="images/ucla_gpa.png">
</p>

- <u>University rating</u>:
This field gives the rating of the undergraduate university of the student on a relative scale of 1 to 5.

- <u>SoP and LoR</u>:
These two fields give relative indicators about the strength of the candidate's statement of purpose and recommendations.

- <u>Research</u>:
This is a binary attribute about the candidate's research experience. If the candidate has done research before it take a value of 1, else 0.

We use the following regressors:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Bayesian Ridge Regression
- AdaBoost Regression
- Gradient Boosting Regression

The results we obtain are:

| Regressor | Linear | Ridge | Lasso | Bayesian Ridge | Ada Boost | Gradient Boost |
| ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| R2 score | 0.80 | 0.8 | 0.21 | 0.8 | 0.77 | 0.78 |
| RMSE | 0.005 | 0.005 | 0.02 | 0.005 | 0.006 | 0.006 |
| MAE | 0.05 | 0.05 | 0.12 | 0.05 | 0.06 | 0.05 |

We analyse the correlations between the predicted and actual dependent variables to check the goodness of fit. The correlation plots obtained were:

<p align="center">
  <img width="400" height="300" src="images/ucla_corr.png">
</p>

<p align="right">
  <img width="400" height="300" src="images/ucla_corr_.png">
</p>

Therefore, we purport that such a regression based predictive model can be used to extend the previously reported classification technique. We believe that such a combined representation of predictive models could be a viable option for mitigating the randomness in the grad school admissions process.  

## Conclusions