# Predict College Admission Likelihood with SAT Scores and Gender Using Logistic Regression

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

#Apply a fix to the statsmodels library
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

## Load the data

In [2]:
raw_data = pd.read_csv('2.02.+Binary+predictors.csv')
raw_data

Unnamed: 0,SAT,Admitted,Gender
0,1363,No,Male
1,1792,Yes,Female
2,1954,Yes,Female
3,1653,No,Male
4,1593,No,Male
...,...,...,...
163,1722,Yes,Female
164,1750,Yes,Male
165,1555,No,Male
166,1524,No,Male


In [8]:
data = raw_data.copy()
data['Admitted'] = data['Admitted'].map({'Yes': 1, 'No': 0})
data['Gender'] = data['Gender'].map({'Female': 1, 'Male': 0})
data

Unnamed: 0,SAT,Admitted,Gender
0,1363,0,0
1,1792,1,1
2,1954,1,1
3,1653,0,0
4,1593,0,0
...,...,...,...
163,1722,1,1
164,1750,1,0
165,1555,0,0
166,1524,0,0


## Declare the dependent and the independent variables

In [22]:
y = data['Admitted']
x1 = data[['SAT','Gender']]

## Regression

In [10]:
x = sm.add_constant(x1)
reg_log = sm.Logit(y,x)
results_log = reg_log.fit()
results_log.summary()

Optimization terminated successfully.
         Current function value: 0.120117
         Iterations 10


0,1,2,3
Dep. Variable:,Admitted,No. Observations:,168.0
Model:,Logit,Df Residuals:,165.0
Method:,MLE,Df Model:,2.0
Date:,"Sat, 02 Sep 2023",Pseudo R-squ.:,0.8249
Time:,23:03:40,Log-Likelihood:,-20.18
converged:,True,LL-Null:,-115.26
Covariance Type:,nonrobust,LLR p-value:,5.1180000000000006e-42

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-68.3489,16.454,-4.154,0.000,-100.598,-36.100
SAT,0.0406,0.010,4.129,0.000,0.021,0.060
Gender,1.9449,0.846,2.299,0.022,0.287,3.603


Model Summary:

Pseudo R-squared (0.8249): This Pseudo R-squared value indicates that your model explains a substantial portion (about 82.49%) of the variation in the admission outcomes. A higher value suggests a better fit.

Log-Likelihood (-20.180): The log-likelihood is a measure of how well the model fits the data. In this case, the negative log-likelihood is provided, and lower values indicate a better fit.

Converged (True): The "True" value indicates that the optimization algorithm successfully converged to a solution during the estimation process.

Coefficient Summary:

Intercept (const): The intercept represents the log-odds of the event (admission) when all predictor variables are zero. In this case, it is approximately -68.35.

SAT: The coefficient for SAT is 0.0406, which means that for each one-unit increase in SAT score, the log-odds of being admitted increase by 0.0406. Since the coefficient is positive and statistically significant (p < 0.001), higher SAT scores are associated with a higher likelihood of admission.

Gender: The coefficient for Gender is 1.9449, indicating that being female (assuming a binary coding where 1 represents female and 0 represents male, for example) is associated with a log-odds increase of 1.9449 in the likelihood of being admitted. This coefficient is statistically significant (p = 0.022), suggesting that gender has a significant effect on admission outcomes.

In [11]:
np.exp(1.9449)

6.992932526814459

An odds ratio of approximately 6.993 means that, all else being equal, the odds of being admitted for females (compared to males, assuming a binary coding where 1 represents female and 0 represents male) are roughly 6.993 times higher.

## Accuracy

In [29]:
# An array containing the TRUE (actual) values
np.array(data['Admitted'])

array([0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0], dtype=int64)

In [26]:
# A prediction table (confusion matrix) showing the 
results_log.pred_table()

array([[69.,  5.],
       [ 4., 90.]])

In [27]:
# Some neat formatting to read the table (better when seeing it for the first time)
cm_df = pd.DataFrame(results_log.pred_table())
cm_df.columns = ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,69.0,5.0
Actual 1,4.0,90.0


In [28]:
# Create an array (so it is easier to calculate the accuracy)
cm = np.array(cm_df)
# Calculate the accuracy of the model
accuracy_train = (cm[0,0]+cm[1,1])/cm.sum()
accuracy_train

0.9464285714285714