# **7. One hot encoding**

### Objetivo
Que la o el estudiante aprenda a transformar los datos categóricos usando One hot encoding.

### Descripción del procedimiento a seguir
- Se carga el conjunto de datos Titanic
- Se seleccionan las caracteristicas y el target
- Se convierten las columnas categóricas usando One hot encoding
- Se separan los datos en entrenamiento y prueba
- Se entrena un modelo y se evalúa con validación cruzada
- Se predice sobre el conjunto de prueba
- Se deja como ejercicio trabajar con el conjunto de datos Car


### Datos
**Profesora**: Dra. Jessica Beltrán Márquez<br>
Maestría en Ciencia de Datos y Optimización<br>
Centro de Investigación en Matemáticas Aplicadas <br>
Universidad Autónoma de Coahuila


### Bibliografía
1. https://archive.ics.uci.edu/dataset/19/car+evaluation


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score


### **1. Cargamos el conjunto de datos Titanic**

In [None]:
df = pd.read_csv('/content/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [None]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

### **2. Seleccionamos las columnas que usaremos con características y el target**

In [None]:
# Select features and target variable
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
y = df['Survived']


### **3. Convertimos las columnas categóricas a one hot encoding**

In [None]:
# Select categorical columns that you want to one-hot encode
categorical_columns = ['Sex', 'Embarked']

# Use pandas get_dummies to perform one-hot encoding
X = pd.get_dummies(X, columns=categorical_columns)

# Display the first few rows of the encoded DataFrame
print(X.head())

   Pclass   Age  SibSp  Parch     Fare  Sex_female  Sex_male  Embarked_C  \
0       3  22.0      1      0   7.2500           0         1           0   
1       1  38.0      1      0  71.2833           1         0           1   
2       3  26.0      0      0   7.9250           1         0           0   
3       1  35.0      1      0  53.1000           1         0           0   
4       3  35.0      0      0   8.0500           0         1           0   

   Embarked_Q  Embarked_S  
0           0           1  
1           0           0  
2           0           1  
3           0           1  
4           0           1  


## **4. Lideamos con los valores faltantes en Age**

In [None]:
# Handle missing values if any (e.g., fill missing values in 'Age' with the mean)
X['Age'].fillna(X['Age'].mean(), inplace=True)


In [None]:
rows_with_nan = X[X.isna().any(axis=1)]
print("Rows with NaN values:\n", rows_with_nan)

Rows with NaN values:
 Empty DataFrame
Columns: [Pclass, Age, SibSp, Parch, Fare, Sex_female, Sex_male, Embarked_C, Embarked_Q, Embarked_S]
Index: []


In [None]:

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the KNN model
knn_model = KNeighborsClassifier()

# Fit the model to the training data
knn_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = knn_model.predict(X_test_scaled)

# Print classification report or other evaluation metrics
print("Classification Report on Test Set:\n", classification_report(y_test, y_pred))


Classification Report on Test Set:
               precision    recall  f1-score   support

           0       0.80      0.88      0.84       157
           1       0.80      0.68      0.74       111

    accuracy                           0.80       268
   macro avg       0.80      0.78      0.79       268
weighted avg       0.80      0.80      0.80       268



### **5. Separamos el conjunto de datos**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


### **6. Entrenamos y buscamos los mejores hiperparámetros usando CV**

In [None]:

# Define the KNN model
knn_model = KNeighborsClassifier()

# Define the scaler
scaler = StandardScaler()

# Create a pipeline with scaler and KNN classifier
pipeline = Pipeline([
    ('scaler', scaler),
    ('knn', knn_model)
])

# Define the hyperparameters you want to search over
param_grid = {
    'knn__n_neighbors': [1,3, 5, 7, 9,11]
}

# Define the cross-validation scheme
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Use GridSearchCV with the pipeline to find the best hyperparameter
grid_search = GridSearchCV(pipeline, param_grid, cv=kf, scoring='f1')
grid_search.fit(X_train, y_train)

# Print the best hyperparameter and its corresponding F1 score
print("Best hyperparameter k:", grid_search.best_params_['knn__n_neighbors'])
print("Best F1 score:", grid_search.best_score_)


Best hyperparameter k: 11
Best F1 score: 0.7291610560627451


### **7. Predecimos sobre el conjunto de datos de prueba**

In [None]:
# Extract the best model from the grid search
best_model = grid_search.best_estimator_

# Assuming X_test is your test data
X_test_scaled = best_model.named_steps['scaler'].transform(X_test)

# Make predictions on the scaled test data
y_test_pred = best_model.predict(X_test_scaled)

# Evaluar con las métricas exactitud y matriz de confusión
f_test = f1_score(y_test, y_test_pred)
print(f_test)
# Print classification report or other evaluation metrics
print("Classification Report on Test Set:\n", classification_report(y_test, y_test_pred))

0.6880000000000001
Classification Report on Test Set:
               precision    recall  f1-score   support

           0       0.88      0.62      0.73       168
           1       0.57      0.86      0.69       100

    accuracy                           0.71       268
   macro avg       0.73      0.74      0.71       268
weighted avg       0.77      0.71      0.71       268





### **8. Ejercicio, clasificar el conjunto de datos Car**
- Descargar el conjunto de datos de https://archive.ics.uci.edu/dataset/19/car+evaluation
- Leer sobre el conjunto de datos e identificar en que consisten las variables
- Aplicar one hot encoding sobre las variables categóricas.
- Crear X con las características y y con los targets
- Entrenar un clasificador KNN
- Evaluar sobre el conjunto de prueba

NOTA: Prueba leer el DataFrame directamente usando la url del dataset.

url = " Aqui la URL <URL>"  <br>
column_names = ['c1', 'c2', 'c3', .., 'class'] <br>
car_data = pd.read_csv(url, names=column_names)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

In [None]:
column_names = ['c1','c2','c3','c4','c5','c6','class']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data', names= column_names)
df.head()

Unnamed: 0,c1,c2,c3,c4,c5,c6,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [None]:
# Select features and target variable
X = df[['c1', 'c2', 'c3', 'c4', 'c5', 'c6']]
y = df['class']

In [19]:
# Select categorical columns that you want to one-hot encode
categorical_columns = ['c1', 'c2', 'c3', 'c4']

# Use pandas get_dummies to perform one-hot encoding
X = pd.get_dummies(X, columns=categorical_columns)

# Display the first few rows of the encoded DataFrame
print(X.head())

   c5_big  c5_med  c5_small  c6_high  c6_low  c6_med  c1_high  c1_low  c1_med  \
0       0       0         1        0       1       0        0       0       0   
1       0       0         1        0       0       1        0       0       0   
2       0       0         1        1       0       0        0       0       0   
3       0       1         0        0       1       0        0       0       0   
4       0       1         0        0       0       1        0       0       0   

   c1_vhigh  ...  c2_low  c2_med  c2_vhigh  c3_2  c3_3  c3_4  c3_5more  c4_2  \
0         1  ...       0       0         1     1     0     0         0     1   
1         1  ...       0       0         1     1     0     0         0     1   
2         1  ...       0       0         1     1     0     0         0     1   
3         1  ...       0       0         1     1     0     0         0     1   
4         1  ...       0       0         1     1     0     0         0     1   

   c4_4  c4_more  
0     0      