In [1]:
#YC95
# HO2

QN 7.1

Calculating Distance with Categorical Predictors. This exercise with a tiny
dataset illustrates the calculation of Euclidean distance, and the creation of binary
dummies. The online education company Statistics.com segments its customers and
prospects into three main categories: IT professionals (IT), statisticians (Stat), and other
(Other). It also tracks, for each customer, the number of years since first contact
(years). Consider the following customers; information about whether they have taken

a course or not (the outcome to be predicted) is included:

Customer 1: Stat, 1 year, did not take course

Customer 2: Other, 1.1 year, took course


a. Consider now the following new prospect:

Prospect 1: IT, 1 year

Using the above information on the two customers and one prospect, create one

dataset for all three with the categorical predictor variable transformed into 2 binaries,

and a similar dataset with the categorical predictor variable transformed into 3
binaries.

In [2]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

In [3]:
# a. Creating the datasets
# Dataset with customers and prospect
data = {
    'Profession': ['Stat', 'Other', 'IT'],
    'Years': [1, 1.1, 1],
    'TookCourse': [0, 1, None]  # None for the prospect as we don't know if they took the course
}

In [4]:
df = pd.DataFrame(data)

In [5]:
df

Unnamed: 0,Profession,Years,TookCourse
0,Stat,1.0,0.0
1,Other,1.1,1.0
2,IT,1.0,


In [6]:
# Two binaries
df['Is_Stat'] = (df['Profession'] == 'Stat').astype(int)
df['Is_Other'] = (df['Profession'] == 'Other').astype(int)


In [7]:
# Three binaries
df['Is_IT'] = (df['Profession'] == 'IT').astype(int)


In [8]:
print("Dataset with 2 binaries:\n", df[['Is_Stat', 'Is_Other', 'Years']])
print("\nDataset with 3 binaries:\n", df[['Is_IT', 'Is_Stat', 'Is_Other', 'Years']])

Dataset with 2 binaries:
    Is_Stat  Is_Other  Years
0        1         0    1.0
1        0         1    1.1
2        0         0    1.0

Dataset with 3 binaries:
    Is_IT  Is_Stat  Is_Other  Years
0      0        1         0    1.0
1      0        0         1    1.1
2      1        0         0    1.0


In [11]:
# Transforming the 'Profession' categorical variable into binaries
df_with_2_binaries = pd.get_dummies(df, columns=['Profession'], drop_first=True)  # Two binaries
df_with_3_binaries = pd.get_dummies(df, columns=['Profession'])  # Three binaries

In [13]:
print(f"Distances with 2 binaries: {distances_2_binaries}")
print(f"Distances with 3 binaries: {distances_3_binaries}")

Distances with 2 binaries: [(0, nan), (1, nan)]
Distances with 3 binaries: [(0, nan), (1, nan)]


b.

For each derived dataset, calculate the Euclidean distance between the prospect and
each of the other two customers.

(Note: While it is typical to normalize data for k-
NN, this is not an iron-clad rule and you may proceed here without normalization.)

In [9]:
# b. Calculating Euclidean distances
def calculate_euclidean_distances(dataset, prospect_index):
    prospect = dataset.iloc[prospect_index, :-1].to_numpy()  # excluding the 'TookCourse' column
    distances = []
    for i, row in dataset.iterrows():
        if i != prospect_index:
            customer = row[:-1].to_numpy()  # excluding the 'TookCourse' column
            distance = np.linalg.norm(customer - prospect)
            distances.append((i, distance))
    return distances

In [12]:
distances_2_binaries = calculate_euclidean_distances(df_with_2_binaries, 2)
distances_3_binaries = calculate_euclidean_distances(df_with_3_binaries, 2)


In [14]:
print("Distances with 2 binaries:", distances_2_binaries)
print("Distances with 3 binaries:", distances_3_binaries)

Distances with 2 binaries: [(0, nan), (1, nan)]
Distances with 3 binaries: [(0, nan), (1, nan)]


c.

Using k-NN with k = 1, classify the prospect as taking or not taking a course

using each of the two derived datasets. Does it make a difference whether you use
two or three dummies?

In [15]:
# c. Classifying the prospect
def classify_with_knn(dataset, distances):
    # The prospect will be classified based on the nearest neighbor
    nearest_neighbor_index = min(distances, key=lambda x: x[1])[0]
    return dataset.iloc[nearest_neighbor_index, -1]  # TookCourse of the nearest neighbor

In [16]:
classification_2_binaries = classify_with_knn(df_with_2_binaries, distances_2_binaries)
classification_3_binaries = classify_with_knn(df_with_3_binaries, distances_3_binaries)

In [17]:
print("Classification with 2 binaries:", "Took course" if classification_2_binaries else "Did not take course")
print("Classification with 3 binaries:", "Took course" if classification_3_binaries else "Did not take course")

Classification with 2 binaries: Took course
Classification with 3 binaries: Took course


In [18]:
# HO1