# Graduation Lab (Week 6)

## Instructions:

Let's build a kNN model using the college completion data.
The data is messy and you have a degrees of freedom problem, as in, we have too many features.

You've done most of the hard work already, so you should be ready to move forward with building your model.

1. Use the question/target variable you submitted and
build a model to answer the question you created for this dataset (make sure it is a classification problem, convert if necessary).
2. Build a kNN model to predict your target variable using 3 nearest neighbors. Make sure it is a classification problem, meaning
if needed changed the target variable.
3. Create a dataframe that includes the test target values, test predicted values,
and test probabilities of the positive class.
4. No code question: If you adjusted the k hyperparameter what do you think would
happen to the threshold function? Would the confusion look the same at the same threshold
levels or not? Why or why not?
5. Evaluate the results using the confusion matrix. Then "walk" through your question, summarize what
concerns or positive elements do you have about the model as it relates to your question?
6. Create two functions: One that cleans the data & splits into training|test and one that
allows you to train and test the model with different k and threshold values, then use them to
optimize your model (test your model with several k and threshold combinations). Try not to use variable names
in the functions, but if you need to that's fine. (If you can't get the k function and threshold function to work in one
function just run them separately.)
7. How well does the model perform? Did the interaction of the adjusted thresholds and k values help the model? Why or why not?
8. Choose another variable as the target in the dataset and create another kNN model using the two functions you created in
step 7.

### 1. Use the question/target variable you submitted and build a model to answer the question you created for this dataset (make sure it is a classification problem, convert if necessary).

In [1]:
import pandas as pd
data = pd.read_csv('/workspaces/DS-3021/data/cc_institution_details.csv')

print(data.head())

print(data.columns)
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

#combine private for profit and not for profit together
data['inst_type'] = data['control'].apply(lambda x: 'public' if x == 1 else 'private')
#create two categories for aid_value, first find the median, if it's above the median then it's high aid, vice versa
median_aid = data['aid_value'].median()
data['high_aid'] = (data['aid_value'] > median_aid).astype(int)


#make sure control and aid are numbers
data['control'] = pd.to_numeric(data['control'], errors='coerce')
data['aid_value'] = pd.to_numeric(data['aid_value'], errors='coerce')

#categorizing them into binary
data['inst_type'] = data['control'].apply(lambda x: 0 if x == 1 else 1).astype(int)
median_aid = data['aid_value'].median()
data['high_aid'] = (data['aid_value'] > median_aid).astype(int)

#clean data 
data = data.dropna(subset=['inst_type', 'aid_value', 'high_aid'])

#using institution type as the predictor to build the LRM here 
X = data[['inst_type']].astype(float)
X = sm.add_constant(X) 
y = data['high_aid'].astype(float)
logit_model = sm.Logit(y, X)
result = logit_model.fit(disp=0)

print(result.summary())

#predict proabailities
data['pred_prob'] = result.predict(X)
data['predicted'] = (data['pred_prob'] >= 0.5).astype(int)

#use confusion matrix and classification report
cm = confusion_matrix(y, data['predicted'])
print(cm)
print(classification_report(y, data['predicted']))

#print first 5 rows to check the actual and predicted data
print(data[['inst_type', 'aid_value', 'high_aid', 'pred_prob', 'predicted']].head(5))

   index  unitid                            chronname        city    state  \
0      0  100654               Alabama A&M University      Normal  Alabama   
1      1  100663  University of Alabama at Birmingham  Birmingham  Alabama   
2      2  100690                   Amridge University  Montgomery  Alabama   
3      3  100706  University of Alabama at Huntsville  Huntsville  Alabama   
4      4  100724             Alabama State University  Montgomery  Alabama   

    level                 control  \
0  4-year                  Public   
1  4-year                  Public   
2  4-year  Private not-for-profit   
3  4-year                  Public   
4  4-year                  Public   

                                               basic hbcu flagship  ...  \
0  Masters Colleges and Universities--larger prog...    X      NaN  ...   
1  Research Universities--very high research acti...  NaN      NaN  ...   
2            Baccalaureate Colleges--Arts & Sciences  NaN      NaN  ...   
3  Resea

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 2. Build a kNN model to predict your target variable using 3 nearest neighbors. Make sure it is a classification problem, meaning if needed changed the target variable.

In [3]:
features = ['inst_type', 'student_count', 'endow_value', 'fte_value'] #use a couple other features to predict the aid value
data_clean = data.dropna(subset=features + ['high_aid'])

X = data_clean[features]
y = data_clean['high_aid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
#make knn classifier 
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)

y_pred = knn.predict(X_test_scaled)
y_prob = knn.predict_proba(X_test_scaled)[:, 1]

results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred, 'Probability': y_prob})
print(results.head(10))
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(classification_report(y_test, y_pred))

      Actual  Predicted  Probability
2171       1          1     0.666667
1223       0          0     0.000000
2229       1          0     0.000000
2391       1          1     1.000000
2575       0          0     0.000000
705        1          1     1.000000
1333       1          1     0.666667
1417       1          1     1.000000
1419       1          1     1.000000
1273       0          1     0.666667
[[154  67]
 [ 74 402]]
              precision    recall  f1-score   support

           0       0.68      0.70      0.69       221
           1       0.86      0.84      0.85       476

    accuracy                           0.80       697
   macro avg       0.77      0.77      0.77       697
weighted avg       0.80      0.80      0.80       697



### 3. Create a dataframe that includes the test target values, test predicted values, and test probabilities of the positive class.

In [4]:
y_pred = knn.predict(X_test_scaled)

#get predicted probabilities for positive class
y_prob = knn.predict_proba(X_test_scaled)[:, 1]

#make df
results_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred,
    'Probability_Positive': y_prob
})
print(results_df.head())

      Actual  Predicted  Probability_Positive
2171       1          1              0.666667
1223       0          0              0.000000
2229       1          0              0.000000
2391       1          1              1.000000
2575       0          0              0.000000


### 4. No code question: If you adjusted the k hyperparameter what do you think would happen to the threshold function? Would the confusion look the same at the same threshold levels or not? Why or why not?

##### I think changing the k hyperparameter would change the distribution of the predicted probabilities. if it's a lower k, the model will be more sensitive to noise; a higher k will smooth out the local changes (because you are taking into consideration of more neighbors), hence more stable probabilities. if you keep the same threshold, the confusion matrix will look different because the predicted probabilities are different.

### 5. Evaluate the results using the confusion matrix. Then "walk" through your question, summarize what concerns or positive elements do you have about the model as it relates to your question?

##### from the confusion matrix we can tell that the model finds 154 universities as low aid schools and 402 as high aid schools. but it also misidentified 67 schools that are actually low aid as high aid and vice versa for 74 schools. the overall performance is pretty good. my questions examines whether private (inclding for profit and not for profit) universities or public schools tend to offer more financial aid. the model predicts that public schools are more likely to offer high aid, which is consistent with my expectation. yet, the 74 false negatives are a concern, suggesting that we may need more variables to accurately identify low aid schools.

### 6. Create two functions: One that cleans the data & splits into training|test and one that allows you to train and test the model with different k and threshold values, then use them to optimize your model (test your model with several k and threshold combinations). Try not to use variable names in the functions, but if you need to that's fine. (If you can't get the k function and threshold function to work in one function just run them separately.)

In [5]:
#function 1

def data_clean_split(data, control_col, aid_col, features, test_size=0.3, random_state=42):
    #same as above
    data[control_col] = pd.to_numeric(data[control_col], errors='coerce')
    data[aid_col] = pd.to_numeric(data[aid_col], errors='coerce')
    data['inst_type'] = data[control_col].apply(lambda x: 0 if x == 1 else 1)
    median_val = data[aid_col].median()
    data['high_aid'] = (data[aid_col] > median_val).astype(int)
    required_cols = features + ['inst_type', 'high_aid']
    cleaned_data = data.dropna(subset=required_cols)
    
    #create the features and target
    X = cleaned_data[features + ['inst_type']] 
    y = cleaned_data['high_aid']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=test_size, 
                                                        random_state=random_state)
    #scale X for knn 
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled, y_train, y_test
print(data_clean_split)


<function data_clean_split at 0xffff302d71a0>


In [6]:
#function 2
import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score

def knn_train_test(X_train_scaled, X_test_scaled, y_train, y_test, k_values, thresholds):
    results = []

    for k in k_values:
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train_scaled, y_train)

        #predicted probabilities for the positive class 
        y_prob = knn.predict_proba(X_test_scaled)[:, 1]

        for thresh in thresholds:
            y_pred_thresh = (y_prob >= thresh).astype(int)
            cm = confusion_matrix(y_test, y_pred_thresh)
            acc = accuracy_score(y_test, y_pred_thresh)

            results.append({
                'k': k,
                'threshold': thresh,
                'confusion_matrix': cm,
                'accuracy': acc
            })

    return results

#define columns
control_col = 'control'
aid_col = 'aid_value'
features = ['student_count', 'endow_value', 'fte_value']  # Example features
X_train_scaled, X_test_scaled, y_train, y_test = data_clean_split(data, control_col, aid_col, features)

#try with multiple k and threshold values
k_list = [1, 3, 5, 7, 9]
threshold_list = [0.3, 0.5, 0.7]
results = knn_train_test(X_train_scaled, X_test_scaled, y_train, y_test, k_list, threshold_list)
for r in results:
    print(r)

{'k': 1, 'threshold': 0.3, 'confusion_matrix': array([[152,  69],
       [ 93, 383]]), 'accuracy': 0.7675753228120517}
{'k': 1, 'threshold': 0.5, 'confusion_matrix': array([[152,  69],
       [ 93, 383]]), 'accuracy': 0.7675753228120517}
{'k': 1, 'threshold': 0.7, 'confusion_matrix': array([[152,  69],
       [ 93, 383]]), 'accuracy': 0.7675753228120517}
{'k': 3, 'threshold': 0.3, 'confusion_matrix': array([[ 73, 148],
       [ 22, 454]]), 'accuracy': 0.7560975609756098}
{'k': 3, 'threshold': 0.5, 'confusion_matrix': array([[154,  67],
       [ 74, 402]]), 'accuracy': 0.7977044476327116}
{'k': 3, 'threshold': 0.7, 'confusion_matrix': array([[193,  28],
       [186, 290]]), 'accuracy': 0.6929698708751794}
{'k': 5, 'threshold': 0.3, 'confusion_matrix': array([[106, 115],
       [ 29, 447]]), 'accuracy': 0.793400286944046}
{'k': 5, 'threshold': 0.5, 'confusion_matrix': array([[144,  77],
       [ 69, 407]]), 'accuracy': 0.7905308464849354}
{'k': 5, 'threshold': 0.7, 'confusion_matrix': ar

### 7. How well does the model perform? Did the interaction of the adjusted thresholds and k values help the model? Why or why not?

##### From the output, it’s clear that the model’s performance (measured by accuracy and the confusion matrix) changes depending on the values of k and the threshold, meaning that tweaking these parameters can actually impact how well the classifier differentiates between classes. A smaller k (like 1 or 3) makes the model more sensitive to local data points, which can lead to higher variance in predictions. a larger k (like 7 or 9) smooths things out by averaging over more neighbors. Meanwhile, adjusting the threshold affects the trade-off between getting more true positives (with a lower threshold) and avoiding false positives (with a higher threshold). If certain k and threshold combinations consistently improve accuracy, that’s a good sign that tuning these parameters is worth it. But if the improvements are inconsistent, it might mean that adding more features or trying a different modeling approach would be a better way to boost performance.

### 8. Choose another variable as the target in the dataset and create another kNN model using the two functions you created in step 7.

In [7]:
#using pell
def data_clean_split_pell(data, control_col, pell_col, features, test_size=0.3, random_state=42):
    data[control_col] = pd.to_numeric(data[control_col], errors='coerce')
    data[pell_col] = pd.to_numeric(data[pell_col], errors='coerce')
    data['inst_type'] = data[control_col].apply(lambda x: 0 if x == 1 else 1)
    median_val = data[pell_col].median()
    data['high_pell'] = (data[pell_col] > median_val).astype(int)
    required_cols = features + ['inst_type', 'high_pell']
    cleaned_data = data.dropna(subset=required_cols)
    X = cleaned_data[features + ['inst_type']]
    y = cleaned_data['high_pell']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled, y_train, y_test

#knn training/testing function (same as above)
def knn_train_test(X_train_scaled, X_test_scaled, y_train, y_test, k_values, thresholds):
    results = []
    
    for k in k_values:
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train_scaled, y_train)
        y_prob = knn.predict_proba(X_test_scaled)[:, 1]
        
        for thresh in thresholds:
            y_pred_thresh = (y_prob >= thresh).astype(int)

            cm = confusion_matrix(y_test, y_pred_thresh)
            acc = accuracy_score(y_test, y_pred_thresh)
            tn, fp, fn, tp = cm.ravel()
            results.append({
                'k': k,
                'threshold': thresh,
                'confusion_matrix': cm,
                'accuracy': acc,
                'true_neg': tn,
                'false_pos': fp,
                'false_neg': fn,
                'true_pos': tp
            })
    
    return results

control_col = 'control'
pell_col = 'pell_value'
features = ['student_count', 'endow_value', 'fte_value']
X_train_scaled_pell, X_test_scaled_pell, y_train_pell, y_test_pell = data_clean_split_pell(data, control_col, pell_col, features)

#here we define the lists of k and threshold values used to test
k_list = [1, 3, 5, 7]
threshold_list = [0.3, 0.5, 0.7]
pell_results = knn_train_test(X_train_scaled_pell, X_test_scaled_pell, y_train_pell, y_test_pell, k_list, threshold_list)
for res in pell_results:
    print(res)


{'k': 1, 'threshold': 0.3, 'confusion_matrix': array([[339, 131],
       [122, 105]]), 'accuracy': 0.6370157819225251, 'true_neg': np.int64(339), 'false_pos': np.int64(131), 'false_neg': np.int64(122), 'true_pos': np.int64(105)}
{'k': 1, 'threshold': 0.5, 'confusion_matrix': array([[339, 131],
       [122, 105]]), 'accuracy': 0.6370157819225251, 'true_neg': np.int64(339), 'false_pos': np.int64(131), 'false_neg': np.int64(122), 'true_pos': np.int64(105)}
{'k': 1, 'threshold': 0.7, 'confusion_matrix': array([[339, 131],
       [122, 105]]), 'accuracy': 0.6370157819225251, 'true_neg': np.int64(339), 'false_pos': np.int64(131), 'false_neg': np.int64(122), 'true_pos': np.int64(105)}
{'k': 3, 'threshold': 0.3, 'confusion_matrix': array([[194, 276],
       [ 49, 178]]), 'accuracy': 0.533715925394548, 'true_neg': np.int64(194), 'false_pos': np.int64(276), 'false_neg': np.int64(49), 'true_pos': np.int64(178)}
{'k': 3, 'threshold': 0.5, 'confusion_matrix': array([[360, 110],
       [125, 102]]),