### Preprocessing for KNN

This step prepares the dataset for a distance-based classifier. Rows containing "?" are removed because KNN cannot operate with unknown values. Categorical columns are one-hot encoded. Numerical columns are standardized. The data is then split into training and testing sets using a stratified split.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
df = pd.read_csv("adult.csv")

df_clean = df.replace("?", pd.NA).dropna()

X = df_clean.drop("income", axis=1)
y = df_clean["income"].apply(lambda x: 1 if x == ">50K" else 0)

categorical_cols = X.select_dtypes(include=["object"]).columns.tolist()
numeric_cols = X.select_dtypes(exclude=["object"]).columns.tolist()

preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
    ("num", StandardScaler(), numeric_cols)
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### KNN Model Training

This step builds a pipeline that applies preprocessing and then fits a K-Nearest Neighbors classifier. K is set to 5 and distance weighting is used. The model learns patterns from the processed training data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, weights="distance")

model = Pipeline([
    ("prep", preprocessor),
    ("knn", knn)
])

model.fit(X_train, y_train)

### KNN Model Evaluation

This step evaluates the trained KNN model on the test set. The reported metrics include accuracy, precision, recall, and F1 score. A confusion matrix is plotted to summarize prediction performance across the two income classes.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt


y_pred = model.predict(X_test)

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1 Score:", f1)

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("KNN Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()

### KNN Model Report

**Current Results**
- Accuracy: 0.83
- Precision: 0.67
- Recall: 0.62
- F1 Score: 0.65
- Confusion matrix shows strong performance on the <=50K class and weaker recall on the >50K class.

**Fit Assessment**
- KNN shows moderate performance with signs of underfitting on the >50K class.
- High dimensionality after one-hot encoding limits KNN effectiveness.
- Imbalance in the target distribution affects recall for the >50K class.

**Next Steps**
- Tune KNN hyperparameters (k values, distance metrics).
- Compare against stronger models such as Logistic Regression, Random Forest, XGBoost, and SVM.
- Consider alternative handling of missing values or feature selection to reduce dimensionality.