# Importing Libraries

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Fetching the Data

In [2]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', as_frame=False)

  warn(


In [3]:
df = mnist.copy()

## Importing Models from Scikit-Learn

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [5]:
X, y = mnist["data"], mnist["target"]

## Applying Machine Learning Models
### Performing Train-Test Split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=10000, random_state=42)


### Extenuating individual classifiers

In [7]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
et_classifier = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_classifier = SVC(probability=True, random_state=42)

### Fitting training data to classifiers

In [8]:
rf_classifier.fit(X_train, y_train)
et_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)

### Classifier Predictions on Validation Set

In [9]:
rf_val_preds = rf_classifier.predict(X_val)
et_val_preds = et_classifier.predict(X_val)
svm_val_preds = svm_classifier.predict(X_val)

### Accuracy Calculation on Validation Set

In [10]:
rf_val_accuracy = accuracy_score(y_val, rf_val_preds)
et_val_accuracy = accuracy_score(y_val, et_val_preds)
svm_val_accuracy = accuracy_score(y_val, svm_val_preds)

In [11]:
print(f"Random Forest Validation Accuracy: {rf_val_accuracy}")
print(f"Extra Trees Validation Accuracy: {et_val_accuracy}")
print(f"SVM Validation Accuracy: {svm_val_accuracy}")

Random Forest Validation Accuracy: 0.9692
Extra Trees Validation Accuracy: 0.9715
SVM Validation Accuracy: 0.9788


## Creating Ensemble with Soft Voting

In [12]:
ensemble = VotingClassifier(estimators=[
    ('rf', rf_classifier),
    ('et', et_classifier),
    ('svm', svm_classifier)
], voting='soft')

### Fittig Ensemble Model onto Training Set

In [13]:
ensemble.fit(X_train, y_train)

### Prediction of Ensemble Model

In [14]:
ensemble_val_preds = ensemble.predict(X_val)

In [15]:
ensemble_val_accuracy = accuracy_score(y_val, ensemble_val_preds)
print("Ensemble Validation Accuracy (Soft Voting):", ensemble_val_accuracy)

Ensemble Validation Accuracy (Soft Voting): 0.9791


In [16]:
ensemble_test_preds = ensemble.predict(X_test)

In [17]:
ensemble_test_accuracy = accuracy_score(y_test, ensemble_test_preds)
print("Ensemble Test Accuracy (Soft Voting):", ensemble_test_accuracy)


Ensemble Test Accuracy (Soft Voting): 0.9767


# Findings and Conclusion:

Based on the provided accuracy scores, we can analyze the perofrmance of the individual classifiers as well as the Ensemble classifier which is using soft voting. 

The Random Forest clasifier had a validation accuracy of 0.962. Overall, this is a strong indication of performance but out of the three individual classifiers, the Random Forest has the loweest accuracy.

Meanwhile, the Extra-Trees classifier outperformed the Random Forest classifier with a validation accuracy of 0.9715. This is an improvement over the previous classifier when classifying the validation set. 

Finally, the SVM classifier achieved the highest validation accuracy among the three individual classifiers, with a score of 0.9788. Out of the three models tested in this assignment, SVM appears to offer the most accuracy when predicting with the validation data. This could be due to the nature of the dataset, SVM would have an easier time classifying data that is more strictly linear or non-linear rather than "noisy". 

When analyzing the Ensemble model using soft voting however, the model came close to the accuracy to that of SVM with a score of 0.9767. What this could mean is that Ensemble model with soft voting may have more external validity given the various classifiers used together. To this end, if we were to use this model on more unseen data it could perform better given the nature of multiple classifiers. However, in this case the SVM was the most effective based on the time and resource intensity relative to accuracy score. 