# ***Notebook: Model Training and Evaluation***

This notebook trains and evaluates multiple machine learning models to predict product conditions ("new" or "used"). The workflow includes data preprocessing, model training, and performance evaluation using metrics such as Precision, Recall, F1-Score, and Accuracy.

***Importing Libraries***

Import necessary libraries for data preprocessing, model training, and evaluation.

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, f1_score

***Loading the Processed Dataset***

Load the preprocessed dataset created during the Exploratory Data Analysis (EDA) phase.

In [None]:
df_model = pd.read_csv('data/processed_data.csv')

***Splitting the Dataset***

Split the dataset into training and testing sets, with 80% of the data used for training and 20% for testing.

In [None]:
X = df_model.drop('condition', axis=1)
y = df_model['condition']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### ***Data Preprocessing***

***One-Hot Encoding for Categorical Features***

Apply One-Hot Encoding to categorical features to convert them into numerical format.

In [None]:
categorical_features = ['listing_type_id', 'buying_mode']
preprocessor = ColumnTransformer(
	transformers=[
		('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
	],
	remainder='passthrough'
)

X_train_encoded = preprocessor.fit_transform(X_train)
X_test_encoded = preprocessor.transform(X_test)

***Feature Scaling***

Standardize the features to ensure all variables have the same scale, which is important for models like Logistic Regression and K-Nearest Neighbors.

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

### ***Training and Evaluating Models***

***Logistic Regression***

Train a Logistic Regression model and evaluate its performance using classification metrics.

In [12]:
print("=== Logistic Regression ===")
model_lr = LogisticRegression()
model_lr.fit(X_train_scaled, y_train)
y_pred_lr = model_lr.predict(X_test_scaled)

print(classification_report(y_test, y_pred_lr))
print("Accuracy:", accuracy_score(y_test, y_pred_lr))

=== Logistic Regression ===
              precision    recall  f1-score   support

           0       0.91      0.47      0.62      9283
           1       0.67      0.96      0.79     10717

    accuracy                           0.73     20000
   macro avg       0.79      0.71      0.70     20000
weighted avg       0.78      0.73      0.71     20000

Accuracy: 0.7303


***Decision Tree***

Train a Decision Tree model and evaluate its performance.

In [13]:
print("\n=== Decision Tree ===")
model_dt = DecisionTreeClassifier(random_state=42)
model_dt.fit(X_train_scaled, y_train)
y_pred_dt = model_dt.predict(X_test_scaled)
print(classification_report(y_test, y_pred_dt))
print("Accuracy:", accuracy_score(y_test, y_pred_dt))


=== Decision Tree ===
              precision    recall  f1-score   support

           0       0.77      0.88      0.82      9283
           1       0.88      0.77      0.82     10717

    accuracy                           0.82     20000
   macro avg       0.82      0.82      0.82     20000
weighted avg       0.83      0.82      0.82     20000

Accuracy: 0.8181


***Random Forest***

Train a Random Forest model and evaluate its performance.

In [None]:
print("\n=== Random Forest ===")
model_rf = RandomForestClassifier(random_state=42)
model_rf.fit(X_train_scaled, y_train)
y_pred_rf = model_rf.predict(X_test_scaled)
print(classification_report(y_test, y_pred_rf))
print("Accuracy:", accuracy_score(y_test, y_pred_rf))



=== Random Forest ===
              precision    recall  f1-score   support

           0       0.78      0.87      0.82      9283
           1       0.87      0.79      0.83     10717

    accuracy                           0.82     20000
   macro avg       0.82      0.83      0.82     20000
weighted avg       0.83      0.82      0.82     20000

Accuracy: 0.8223


***K-Nearest Neighbors (KNN)***

Train a K-Nearest Neighbors model and evaluate its performance.

In [None]:
print("\n=== K-Nearest Neighbors (KNN) ===")
model_knn = KNeighborsClassifier(n_neighbors=5)
model_knn.fit(X_train_scaled, y_train)
y_pred_knn = model_knn.predict(X_test_scaled)
print(classification_report(y_test, y_pred_knn))
print("Accuracy:", accuracy_score(y_test, y_pred_knn))


=== K-Nearest Neighbors (KNN) ===


found 0 physical cores < 1
  File "C:\Users\valen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")


              precision    recall  f1-score   support

           0       0.79      0.76      0.78      9283
           1       0.80      0.83      0.81     10717

    accuracy                           0.80     20000
   macro avg       0.80      0.79      0.80     20000
weighted avg       0.80      0.80      0.80     20000

Accuracy: 0.79665


### ***Comparing Model Performance***

Create a summary table of metrics (Precision, Recall, F1-Score, and Accuracy) for all models to compare their performance.

In [None]:
metrics = {
    "Model": [],
    "Precision": [],
    "Recall": [],
    "F1-Score": [],
    "Accuracy": []
}

def add_metrics(model_name, y_test, y_pred):
    metrics["Model"].append(model_name)
    metrics["Precision"].append(precision_score(y_test, y_pred, average='weighted'))
    metrics["Recall"].append(recall_score(y_test, y_pred, average='weighted'))
    metrics["F1-Score"].append(f1_score(y_test, y_pred, average='weighted'))
    metrics["Accuracy"].append(accuracy_score(y_test, y_pred))

add_metrics("Logistic Regression", y_test, y_pred_lr)
add_metrics("Decision Tree", y_test, y_pred_dt)
add_metrics("Random Forest", y_test, y_pred_rf)
add_metrics("K-Nearest Neighbors", y_test, y_pred_knn)

metrics_df = pd.DataFrame(metrics)

print(metrics_df)

                 Model  Precision   Recall  F1-Score  Accuracy
0  Logistic Regression   0.782686  0.73030  0.710491   0.73030
1        Decision Tree   0.825589  0.81810  0.818157   0.81810
2        Random Forest   0.827197  0.82230  0.822479   0.82230
3  K-Nearest Neighbors   0.796493  0.79665  0.796346   0.79665


### ***Summary***

This notebook performs the following tasks:

1. Loads the preprocessed dataset.
2. Splits the dataset into training and testing sets.
3. Preprocesses the data using One-Hot Encoding and feature scaling.
4. Trains and evaluates four machine learning models: ***Logistic Regression***, ***Decision Tree***, ***Random Forest***, ***K-Nearest Neighbors***
5. Compares the performance of the models using a summary table of metrics.

The results indicate that the ***Random Forest*** model achieved the best performance across all metrics, making it the most suitable model for predicting product conditions in this dataset. Further hyperparameter tuning and feature engineering could improve the results.