# Clasificador de Vinos con KNN

Entrena un modelo de K-Vecinos más Cercanos (KNN) para predecir la calidad de un vino tinto a partir de sus características químicas. ¿Podría una IA ayudarte a elegir un vino digno de sommelier?

Utilizaremos el siguiente dataset de vinos tintos extraido de Wine Quality Data Set - UCI: 

https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/refs/heads/main/winequality-red.csv

### Descripción de las columnas

Cada fila representa un vino. Las columnas describen su composición química:
* fixed acidity, volatile acidity, citric acid
* residual sugar, chlorides
* free sulfur dioxide, total sulfur dioxide
* density, pH, sulphates, alcohol

La columna objetivo es label:
* 0 = Baja calidad
* 1 = Calidad media
* 2 = Alta calidad


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
url = 'https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/refs/heads/main/winequality-red.csv'
df = pd.read_csv(url, sep=';')

In [3]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
# df
variables_ind = [
    "fixed acidity", "volatile acidity", "citric acid",
    "residual sugar", "chlorides",
    "free sulfur dioxide", "total sulfur dioxide",
    "density", "pH", "sulphates", "alcohol"
]
variable_obj = "label"

# verificación de columnas
missing = [c for c in variables_ind+[variable_obj] if c not in df.columns]
if missing:
    print(f"Faltan columnas en df: {missing}")

Faltan columnas en df: ['label']


In [5]:
# ok, es que tenemos que predecir eso oc

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [7]:
X = df[variables_ind].copy()

# train test split
X_train, X_test = train_test_split(
    X, test_size=0.2, random_state=24
)

In [8]:
# escalado
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

In [9]:
# etiquetas con kmeans (3 labels i.e. k=3)
kmeans = KMeans(n_clusters=3, n_init=20, random_state=24)
kmeans.fit(X_train_s)

In [10]:
# etiquetas de train (sin ordenar aún)
y_train_raw = kmeans.labels_

In [11]:
# para obtener etiquetas de test, predecimos con el kmeans entrenado en train
y_test_raw = kmeans.predict(X_test_s)

### Regla: calidad sube con alcohol, citric acid, sulphates; baja con volatile acidity, chlorides.
* Alcohol: En general, un mayor contenido alcohólico se asocia a vinos con fermentación más completa y un sabor más redondo, lo que suele correlacionar con una mejor percepción de calidad.
* Citric acid: Da frescura y un perfil más afrutado; niveles moderados-altos tienden a aparecer en vinos mejor puntuados.
* Sulphates: Actúan como conservantes y realzan sabor; en cantidades adecuadas suelen correlacionar positivamente con calidad.
* Volatile acidity (acidez volátil): Altos niveles generan aromas y sabores desagradables (avinagrado), por lo que suele correlacionar negativamente con calidad.
* Chlorides: Representa contenido salino; niveles altos pueden ser percibidos como defectos, afectando negativamente la calidad.


In [12]:
# cols up mayores valores - cols down menores
cols_up   = ["alcohol", "citric acid", "sulphates"]
cols_down = ["volatile acidity", "chlorides"]

In [None]:
# asignamos la etiqueta a cada fila según el kmeans
tmp = pd.DataFrame(X_train, columns=variables_ind)
tmp["cluster"] = kmeans.labels_
tmp

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,cluster
901,7.4,0.635,0.10,2.4,0.080,16.0,33.0,0.99736,3.58,0.69,10.80,1
1497,6.9,0.740,0.03,2.3,0.054,7.0,16.0,0.99508,3.45,0.63,11.50,1
203,7.0,0.420,0.35,1.6,0.088,16.0,39.0,0.99610,3.34,0.55,9.20,1
292,10.4,0.550,0.23,2.7,0.091,18.0,48.0,0.99940,3.22,0.64,10.30,2
1542,6.7,0.855,0.02,1.9,0.064,29.0,38.0,0.99472,3.30,0.56,10.75,1
...,...,...,...,...,...,...,...,...,...,...,...,...
1425,8.3,0.260,0.37,1.4,0.076,8.0,23.0,0.99740,3.26,0.70,9.60,2
343,10.9,0.390,0.47,1.8,0.118,6.0,14.0,0.99820,3.30,0.75,9.80,2
192,6.8,0.630,0.12,3.8,0.099,16.0,126.0,0.99690,3.28,0.61,9.50,0
899,8.3,1.020,0.02,3.4,0.084,6.0,11.0,0.99892,3.48,0.49,11.00,1


In [23]:
# obtiene por cluster el promedio de ciertas columnas y un “puntaje de calidad”
means = tmp.groupby("cluster")[cols_up + cols_down].mean()
means

Unnamed: 0_level_0,alcohol,citric acid,sulphates,volatile acidity,chlorides
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,9.867489,0.290428,0.621151,0.534589,0.08723
1,10.454819,0.122358,0.611842,0.608494,0.078954
2,10.748054,0.467944,0.756523,0.405,0.098777


In [25]:
# a excepcion de los clorides en 2 vemos que todo cuaja
quality_score = means[cols_up].sum(axis=1) - means[cols_down].sum(axis=1)
quality_score

cluster
0    10.157249
1    10.501571
2    11.468745
dtype: float64

In [27]:
# ordenamos los clusters de menos a mayor puntaje
cluster_to_label = {cl:i for i, cl in enumerate(quality_score.sort_values().index)}
cluster_to_label

{0: 0, 1: 1, 2: 2}

In [28]:
y_train = np.vectorize(cluster_to_label.get)(kmeans.labels_)
y_test  = np.vectorize(cluster_to_label.get)(kmeans.predict(X_test_s))

In [15]:
# búsqueda de k (1..20) 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import numpy as np
ks = range(1, 20+1)
models = [KNeighborsClassifier(n_neighbors=k, weights="distance").fit(X_train_s, y_train) for k in ks]
accs   = [accuracy_score(y_test, m.predict(X_test_s)) for m in models]

best_idx = int(np.argmax(accs))
best_k   = list(ks)[best_idx]
best_model = models[best_idx]
print(f"Mejor k: {best_k} | Accuracy: {accs[best_idx]:.4f}")

Mejor k: 7 | Accuracy: 0.9594


In [30]:
# evaluación mínima 
y_pred = best_model.predict(X_test_s)
print("Matriz de confusión:\n", confusion_matrix(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred, digits=3))

Matriz de confusión:
 [[ 64   3   0]
 [  3 137   3]
 [  2   2 106]]

Classification report:
               precision    recall  f1-score   support

           0      0.928     0.955     0.941        67
           1      0.965     0.958     0.961       143
           2      0.972     0.964     0.968       110

    accuracy                          0.959       320
   macro avg      0.955     0.959     0.957       320
weighted avg      0.960     0.959     0.959       320



In [38]:
mapping_txt = {0: "baja", 1: "media", 2: "alta"}

# recibe un vector de 11 valores ordenados
# "fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol"

def predict_wine_quality(values, via="kmeans"):
    # usamos la misma normalización que en el entrenamiento del modelo
    x_s = scaler.transform(np.array(values, dtype=float).reshape(1, -1))
    # aplicamos kmeans
    label = cluster_to_label[kmeans.predict(x_s)[0]]
    # mapeamos cluster con el texto
    return f"Este vino probablemente sea de calidad {mapping_txt[label]} 🍷"

In [37]:
import random

def generar_listas_vino(n=10):
    listas = []
    for _ in range(n):
        lista = [
            round(random.uniform(4, 15), 2),     # fixed acidity
            round(random.uniform(0, 1.5), 2),    # volatile acidity
            round(random.uniform(0, 1), 2),      # citric acid
            round(random.uniform(0.5, 15), 2),   # residual sugar
            round(random.uniform(0.01, 0.2), 3), # chlorides
            round(random.uniform(1, 72), 1),     # free sulfur dioxide
            round(random.uniform(6, 289), 1),    # total sulfur dioxide
            round(random.uniform(0.990, 1.005), 4), # density
            round(random.uniform(2.9, 4.0), 2),  # pH
            round(random.uniform(0.3, 2.0), 2),  # sulphates
            round(random.uniform(8.0, 15.0), 1)  # alcohol
        ]
        listas.append(lista)
    return listas

# Ejemplo de uso:
muestras = generar_listas_vino()
for i, muestra in enumerate(muestras, 1):
    print(f"Muestra {i}: {muestra} 🍷 {predict_wine_quality(muestra)}")


Muestra 1: [12.18, 0.73, 0.46, 4.69, 0.17, 17.0, 9.3, 0.9911, 3.2, 0.92, 8.4] 🍷 Este vino probablemente sea de calidad alta 🍷
Muestra 2: [4.82, 0.58, 0.47, 13.38, 0.194, 44.5, 181.6, 1.0021, 3.01, 0.32, 10.7] 🍷 Este vino probablemente sea de calidad baja 🍷
Muestra 3: [14.32, 0.63, 0.35, 7.4, 0.16, 2.6, 251.5, 0.9957, 3.76, 0.82, 12.1] 🍷 Este vino probablemente sea de calidad baja 🍷
Muestra 4: [6.22, 0.14, 0.56, 7.04, 0.044, 49.9, 12.4, 1.0049, 3.1, 1.96, 13.8] 🍷 Este vino probablemente sea de calidad alta 🍷
Muestra 5: [10.55, 1.2, 0.68, 7.04, 0.025, 34.5, 227.3, 1.0034, 3.58, 0.5, 8.3] 🍷 Este vino probablemente sea de calidad baja 🍷
Muestra 6: [13.95, 0.93, 0.59, 8.8, 0.172, 26.9, 239.6, 0.9928, 3.6, 1.16, 10.7] 🍷 Este vino probablemente sea de calidad baja 🍷
Muestra 7: [4.17, 0.84, 0.07, 14.98, 0.156, 43.0, 155.7, 1.0025, 3.09, 1.75, 10.2] 🍷 Este vino probablemente sea de calidad baja 🍷
Muestra 8: [10.7, 1.41, 0.45, 13.78, 0.183, 41.4, 135.0, 0.9968, 2.94, 1.73, 8.9] 🍷 Este vino proba

