## KNN Modeling: Further Data Preprocessing

We implemented a KNN-based recommender, where the user selects a reference county, and the algorithm recommends the *k* nearest counties with the smallest Euclidean distance among the featuers:

$$
d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}
$$

While we did consider and test other metircs like cosine similarity, we figured directionality was less important than overall proximity for this particular problem. The data is continuous and densely contained, so that the magnitude of each feature vector should be captured.

Since we are working with Euclidean distance calculations, normalizing the data is crucial so that one feature does not contribute significantly more than another. We applied a min-max scaler to transform each feature to a range of $[0,1]$:

$$
x' = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}
$$


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Normalizing the data
numeric_cols = ["Poverty_Percent","Bachelor_Or_Higher","Unemployment_Rate","Median_Income","Avg_Temp","Avg_Precipitation","Crime_Rate_Per_100000","Walkability","Population_Density"]
clean_data_norm = combined_data.copy()
clean_data_norm.drop(columns=["Population_Estimate","Land_Area"], inplace=True)
clean_data_norm[numeric_cols] = MinMaxScaler().fit_transform(combined_data[numeric_cols])
print(clean_data_norm.head())

## County-to-County Recommender

After normalizing the data, we are ready to build and run the model. Note the choice of using `metric = "euclidean"`, reasoned earlier, and the choice of `algorithm = "brute"`, an exhaustive search algorithm which computes the distance between the input and every vector in the data set. This simple and straightforward approach performs better with high dimensionality, and is affordable given our relatively small size of the data set (~3000 counties). Our first system benefits from being intuitive (finding the most numerically similar counties) and easy to implement.

In [None]:

from sklearn.neighbors import NearestNeighbors
import warnings
warnings.filterwarnings('ignore')


# Fitting the KNN model
data_matrix = clean_data_norm[numeric_cols]
model_knn = NearestNeighbors(metric = "euclidean", algorithm = "brute")
model_knn.fit(data_matrix)

# Testing the model to return top 5 county recommendations (n_neighbors = 6 but we disregard the first recommendation which is the input county itself).
np.random.seed(1)
query_no = np.random.choice(clean_data_norm.shape[0]) # random county index
print(f"We will find recommendations for the county {clean_data_norm.iloc[query_no]['area_name'].title()}, {clean_data_norm.iloc[query_no]['state']}.")
distances, indices = model_knn.kneighbors(data_matrix.iloc[query_no, :].values.reshape(1, -1), n_neighbors=6)

While nice to have, it is clear that this model lacks flexibility. Our pure similarity model treats all features as equally important. However, different users may prioritize certain attributes over others (ex. a retired person may not care at all for the state of economy like `Median_Income` or `Unemployment_Rate`). In addition, users may not always have a preferred reference county to work with.

## User Preference-based Recommender

### Creating User Embeddings

To fix this weakness, we extended our system to incorporate user-defined preferences over different feature categories, as a **weighted** KNN model. Through UI-based survey questions, users build their imagined ideal county by indicating their preferences for a "higher quantity" of each feature on a scale of 1-5. These scores are mapped back to their corresponding features and linearly transformed to $[0,1]$, the same normalized space as in the data set:

| Original Score (1–5) | Normalized (0–1) |
|----------------------|------------------|
| 1                    | 0.00             |
| 2                    | 0.25             |
| 3                    | 0.50             |
| 4                    | 0.75             |
| 5                    | 1.00             |


In doing so, we have constructed a user embedding that can be imagined as a new county vector tailored to the user's interests. It requires no original county, and offers much more personalization and flexibility, and still integrates smoothly into the old KNN model.

Furthermore, users provide weights through similar survey questions asking the importance of each feature on a scale of 1-5, and the answers are stored internally as a weight vector.

### Adding Weights

In similar fashion, users provide weights for each feature as an importance score from 1-5. Storing this internally as a weight vector mapped to each feature, we can then multiply each feature's numeric value by its corresponding user-defined weight before calculating the Euclidean distance in the model. Without a practical method of evaluating our models, we settled on an arbitrary and simple weighted Euclidean distance function as follows:

$$
d(p, q) = \sqrt{\sum_{i=1}^{n} \sqrt{w_i} (p_i - q_i)^2}
$$

where $w_i$ is also linearly scaled down to [0,1]. Higher-priority (high $w_i$) features contribute more to the summated distance, allowing users to place an emphasis on certain features. To incorporate this into the model, we simply multiply both the user embedding and the entire data set by the square root of the weight vector, and that accomplishes the same thing mathematically. Full implementation below:

In [None]:
# Builds the user embedding and weight vector
def get_user_preferences():    
# Map columns to user-friendly questions
    features = {
        "Poverty_Percent": ("I prefer living in an area with a lower poverty rate.", "poverty rate"),
        "Bachelor_Or_Higher": ("I prefer living in areas with highly educated populations", "higher education"),
        "Unemployment_Rate": ("I prefer living in regions with lower unemployment rates", "unemployment rate"),
        "Median_Income": ("I prefer living in generally more affluent areas", "median income"),
        "Avg_Temp": ("I prefer warmer climates", "temperature"),
        "Avg_Precipitation": ("I prefer seeing less rain and less snow", "rain/snow"),
        "Crime_Rate_Per_100000": ("I prefer living in an area with a lower crime rate", "crime rate"), 
        "Walkability": ("I prefer walking over other modes of transportation", "walkability"),
        "Population_Density": ("I prefer living in more densely populated regions", "urban lifestyle")
    }

    pref_values = []
    importance_values = []

    for col, (pref_question, feature) in features.items():
        while True:
            try:
                # Preference question
                pref = int(input(f"On a scale of 1-5, how much do you agree with the statement: {pref_question}"))
                if 1 <= pref <= 5:
                    # We have to invert to make sense with some of our questions.
                    if col in ["Poverty_Percent","Unemployment_Rate","Avg_Precipitation","Crime_Rate_Per_100000"]:
                        pref = 6 - pref
                    break
                else:
                    print("Please enter a number from 1 to 5.")
            except ValueError:
                print("Invalid input. Please enter an integer.")
        
        while True:
            try:
                # Preference question
                importance = int(input(f"On a scale of 1-5, how important is {feature} when choosing a place to live?"))
                if 1 <= importance <= 5:
                    break
                else:
                    print("Please enter a number from 1 to 5.")
            except ValueError:
                print("Invalid input. Please enter an integer.")
        
        pref_values.append(pref)
        importance_values.append(importance)

    # Normalize both to [0,1]
    pref_values = np.array(pref_values).reshape(-1, 1)
    importance_values = np.array(importance_values).reshape(-1, 1)

    user_embedding = (pref_values.flatten() - 1) / 4  # 1 maps to 0, 5 maps to 1
    weight_vector = (importance_values.flatten() - 1) / 4

    # Show output
    print("\nUser Target Embedding (1 = wanting more of):")
    for (col, _), val in zip(features.items(), user_embedding):
        print(f"{col:30}: {val:.3f}")

    print("\nWeight Vector (how much each feature matters):")
    for (col, _), val in zip(features.items(), weight_vector):
        print(f"{col:30}: {val:.3f}")

    return user_embedding, weight_vector

# Feeds user embedding and weight vector into model.
def get_user_based_recommendations(target_embedding, weight_vector,n_neighbors=6):
    # Scale features by sqrt of weights
    sqrt_weights = np.sqrt(weight_vector)
    weighted_data = data_matrix * sqrt_weights  # broadcasted element-wise multiplication

    # Fit new KNN model on the weighted data
    model_knn = NearestNeighbors(n_neighbors=n_neighbors, metric='euclidean')
    model_knn.fit(weighted_data.values)
    user_embedding_weighted = target_embedding * np.sqrt(weight_vector)
    distances, indices = model_knn.kneighbors(user_embedding_weighted.reshape(1, -1), n_neighbors=n_neighbors)
    no = []
    name = []
    state = []
    distance = []
    population = []
    poverty = []
    education = []
    unemployment = []
    crime_rate = []
    income = []
    walkability = []
    for i in range(0, len(distances.flatten())):
        if i == 0:
            print(f"Recommendations for {clean_data_norm.iloc[query_no]['area_name'].title()} residents:\n")
        else:
            no.append(i)
            name.append(county['area_name'][indices.flatten()[i]].title())
            state.append(county['state'][indices.flatten()[i]])
            distance.append(distances.flatten()[i])
            population.append(county['Population_Estimate'][indices.flatten()[i]])
            poverty.append(county['Poverty_Percent'][indices.flatten()[i]])
            education.append(county['Bachelor_Or_Higher'][indices.flatten()[i]])
            unemployment.append(county['Unemployment_Rate'][indices.flatten()[i]])
            crime_rate.append(county['Crime_Rate_Per_100000'][indices.flatten()[i]])
            income.append(county['Median_Income'][indices.flatten()[i]])
            walkability.append(county['Walkability'][indices.flatten()[i]])
    dic = {"No": no, "County Name": name, "State": state, "Distance": distance,
        "Population Estimate": population, "Poverty Percent": poverty,
        "Bachelor's Degree or Higher": education,
        "Unemployment Rate (%)": unemployment,
        "Crime Rate per 100,000": crime_rate,
        "Median Income": income, "Walkability Index": walkability}
    recommendation = pd.DataFrame(data=dic)
    recommendation.set_index("No", inplace=True)
    return recommendation.style.set_properties(**{"background-color": "white", "color": "black", "border": "1.5px solid black"})

def get_random_embeddings():
    # Simulate a user giving 1-5 responses
    random_prefs = np.random.randint(1, 6, size=9)
    random_importance = np.random.randint(1, 6, size=9)

    # Normalize to [0,1]
    user_embedding = (random_prefs - 1) / 4
    weight_vector = (random_importance - 1) / 4

    print("\nUser Target Embedding (1 = want more of):")
    for col, val in zip(numeric_cols, user_embedding):
        print(f"{col:30}: {val:.3f}")

    print("\nWeight Vector (how much each feature matters):")
    for col, val in zip(numeric_cols, weight_vector):
        print(f"{col:30}: {val:.3f}")

    return user_embedding, weight_vector

user_embedding, weight_vector = get_random_embeddings()
get_user_based_recommendations(user_embedding, weight_vector)