| title | author | output | ||||
|---|---|---|---|---|---|---|
Machine Learning for Binary Classification |
Tai Luan Nguyen |
|
This project performs various machine learning tasks on a dataset with 19020 samples and 11 attributes. Demonstrates the application of various machine learning algorithms to a dataset, allowing the comparison of their performance using classification reports.
Additionally, it conducts a hyperparameter search for a neural network model.
The project including:
- Importing Libaries
- Dataset Loading and Preprocessing
- Split dataset
- Data Scaling and Oversampling
- k-Nearest Neighbour (kNN) Classifier
- Naive Bayes Classifier
- Logistic Regression Classifier
- Support Vector Machine (SVM) Classifier
- Neurol Network (Deep Learning) Classifier
Project link: https://github.com/luan30092000/binaryClassification
Data are MC generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope.
Reference: Bock,R.. (2007). MAGIC Gamma Telescope. UCI Machine Learning Repository. https://doi.org/10.24432/C52C8B.
Necessary libraries and its purpose for this project:\
numpy: data manipulation and array operation.pandas: load and manipulate dataset, as well as for data preprocessing and analysis.matplotlib:matplotlib.pyplot, create histograms for data visualization.sklearn.preprocessing:StandardScalerfrom this library is used for feature scaling, which standardizes the data to have a mean of 0 and a standard deviation of 1.imblearn.over_sampling:RandomOverSampleris used to oversample the minority class to balance the class distribution.sklearn.neighbors:KNeighborsClassifierto create and evaluate a kNN model.sklearn.naive_bayes:GaussianNBto create and evaluate a Naive Bayes model.sklearn.linear_model:LogisticRegressionto create and evaluate a logistic regression model.sklearn.svm:SVCto create and evaluate an SVM modeltensorflow: used to create and train a neural network for classification
- Loads a dataset from a file named "magic04.data" using pandas and defines column names for the dataset.
- Converts the "class" column to binary values (0 or 1) by mapping "g" to 1 and "h" to 0.
- Plots histograms for each feature, comparing the distributions for class 0 (hadron) and class 1 (gamma).
- The dataset is split into train, validation, and test sets using the np.split function, with a 60-20-20 split ratio.
- Scale the features and, optionally, oversample the dataset.
StandardScaleris used to scale the features
- Uses the scikit-learn library to create a k-Nearest Neighbors (kNN) model with 5 neighbors.
- Fits the model to the training data and makes predictions on the test data.
- Accuracy: 81%
- Uses
scikit-learn's Gaussian Naive Bayes classifier to create a Naive Bayes model. - Fits the model to the training data and makes predictions on the test data.
- Accuracy: 73%
- Logistic regression model is created using
scikit-learn - The model is trained on the training data and used to predict on the test data.
- Accuracy: 78%
- Support Vector Machine (SVM) classifier is created using
scikit-learn. - Model is trained on the training data and used to predict on the test data.
- Accuracy: 85%
- Neural network model is defined using
TensorFlow/Keras - Performs a hyperparameter grid search, iterating over various combinations of the number of nodes, dropout probability, learning rate, and batch size.
- Trains multiple neural network models with different hyperparameters on the training data and selects the model with the lowest validation loss.
- The chosen neural network model has the accuracy of 87%