<a href="https://colab.research.google.com/github/vamshitn/Samsung-innovation-campus/blob/main/KNN_%26_SVM(garbage).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
import time
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

In [25]:
try:
    # Load your dataset
    garbage_df = pd.read_csv("garbage.csv")
    print("Dataset loaded successfully!")
    print(garbage_df.head())  # Show first few rows
except FileNotFoundError:
    print("Error: garbage.csv not found. Please make sure the file is in the correct directory.")
    raise
except Exception as e:
    print(f"An error occurred: {e}")
    raise

Dataset loaded successfully!
   Unnamed: 0  weight   volume  moisture_content  organic_content  \
0           0  288.20  1166.79             40.05             1.73   
1           1  220.01  1267.74             41.09            56.69   
2           2  248.94   873.31             10.33            63.33   
3           3  312.04  1031.41             51.91            12.54   
4           4  293.38  1068.42              9.68            61.05   

         source waste_type  label  
0     household      metal      0  
1     household      paper      0  
2  agricultural      paper      0  
3  agricultural      metal      0  
4  agricultural      paper      0  


In [29]:
# Select features (X) and target (y)
# Exclude 'Unnamed: 0' as it seems to be an index column
X = garbage_df.drop(['Unnamed: 0', 'label'], axis=1)
y = garbage_df['label']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(include=['number']).columns

# Apply one-hot encoding to categorical columns
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# Now X contains only numerical data
X = X.values # Convert DataFrame to numpy array for scikit-learn
y = y.values

In [30]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


#K-Nearest Neighbors (KNN) Classifier
KNN is a simple, instance-based learning algorithm that classifies new data points based on the majority class of its nearest neighbors. The performance is heavily influenced by the choice of 'k' (number of neighbors). For this example, we'll use a k of 5, which is a common starting point.

In [31]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [32]:
start_time_knn = time.time()

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_scaled, y_train)

end_time_knn = time.time()

In [34]:
accuracy_knn = accuracy_score(y_test, y_pred_knn)
roc_auc_knn = roc_auc_score(y_test, y_pred_knn, multi_class="ovr") if len(set(y)) > 2 else roc_auc_score(y_test, y_pred_knn)
training_time_knn = end_time_knn - start_time_knn

#Support Vector Machine (SVM) Classifier
SVM is a powerful algorithm that finds the optimal hyperplane to separate different classes. The SVC (Support Vector Classifier) in scikit-learn is a versatile implementation that can use different kernels, such as the rbf (Radial Basis Function) kernel for non-linear decision boundaries.

In [35]:
print("=== K-Nearest Neighbors (KNN) ===")
print(f"Accuracy: {accuracy_knn:.4f}")
print(f"ROC AUC Score: {roc_auc_knn:.4f}")
print(f"Training Time: {training_time_knn:.4f} seconds")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_knn))

=== K-Nearest Neighbors (KNN) ===
Accuracy: 0.8833
ROC AUC Score: 0.8289
Training Time: 0.0080 seconds

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.94      0.92       225
           1       0.79      0.72      0.76        75

    accuracy                           0.88       300
   macro avg       0.85      0.83      0.84       300
weighted avg       0.88      0.88      0.88       300



In [36]:
start_time_svm = time.time()

svm_model = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_model.fit(X_train_scaled, y_train)

end_time_svm = time.time()

# Predictions
y_pred_svm = svm_model.predict(X_test_scaled)

# Metrics
accuracy_svm = accuracy_score(y_test, y_pred_svm)
roc_auc_svm = roc_auc_score(y_test, y_pred_svm, multi_class="ovr") if len(set(y)) > 2 else roc_auc_score(y_test, y_pred_svm)
training_time_svm = end_time_svm - start_time_svm

print("\n=== Support Vector Machine (SVM) ===")
print(f"Accuracy: {accuracy_svm:.4f}")
print(f"ROC AUC Score: {roc_auc_svm:.4f}")
print(f"Training Time: {training_time_svm:.4f} seconds")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_svm))


=== Support Vector Machine (SVM) ===
Accuracy: 0.9267
ROC AUC Score: 0.8978
Training Time: 0.0433 seconds

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.96      0.95       225
           1       0.86      0.84      0.85        75

    accuracy                           0.93       300
   macro avg       0.91      0.90      0.90       300
weighted avg       0.93      0.93      0.93       300



When you run this code, you will likely see that both KNN and SVM achieve very high accuracy on this dataset. SVM, in particular, often performs exceptionally well because it's effective at finding a clear decision boundary even in high-dimensional space. While KNN is simple to implement, its performance can be more sensitive to the choice of 'k' and the distance metric.