This document presents a summary of three machine learning projects focusing on different techniques: Decision Trees, Ensemble Regression, and K-Nearest Neighbors (KNN) Classification. Each project follows structured steps including data preprocessing, model training, evaluation, and key observations.
To predict wine quality based on various chemical properties using a Decision Tree model.
-
Data Preprocessing
- Loaded dataset (
Decision_Tree_train.csv). - Handled missing values by filling them with column means.
- Loaded dataset (
-
Feature Selection
- Used
SelectKBestwithf_classifto select the top 5 most significant features:- Volatile Acidity
- Chlorides
- Total Sulfur Dioxide
- Density
- Alcohol
- Used
-
Model Training
- Data split into training (80%) and testing (20%) sets.
- Trained a Decision Tree classifier on selected features.
-
Evaluation Metrics
- Confusion Matrix: Displayed class-wise performance.
- F1 Score: 0.5385, indicating moderate classification performance.
- Classification Report: The model struggled with underrepresented classes (3 & 9) but performed well on majority classes (5 & 6).
-
Test Data Prediction
- Preprocessed
Decision_Tree_test.csvsimilarly. - Predictions saved in
submission.csv.
- Preprocessed
- Model struggled with overlapping classes (e.g., 5 & 6).
- Could improve with hyperparameter tuning (max depth, minimum samples per leaf) or advanced models like Random Forest.
To predict a continuous target variable (y) using ensemble regression techniques.
-
Data Preprocessing
- Loaded dataset (
Ensemble_Reg_train.csv). - Standardized features using
StandardScaler.
- Loaded dataset (
-
Train-Test Split
- Dataset split into 80% training and 20% testing.
-
Model Selection & Tuning
- Used GridSearchCV to optimize:
- RandomForestRegressor (
max_depth=5, n_estimators=300, RMSE=52.15) - SVR (
C=1, epsilon=0.1, kernel=linear, RMSE=51.75)
- RandomForestRegressor (
- Used GridSearchCV to optimize:
-
Ensemble Models Tested
- Voting Regressor (RandomForest + SVR): Best performer
- Test RMSE: 60.48
- Cross-validation RMSE: 50.47
- Bagging Regressor (RandomForest base model)
- Test RMSE: 62.36
- Cross-validation RMSE: 52.12
- Stacking Regressor (RandomForest + SVR with Linear Regression final estimator)
- Test RMSE: 61.45
- Cross-validation RMSE: 50.80
- Voting Regressor (RandomForest + SVR): Best performer
-
Final Model & Predictions
- Voting Regressor chosen for final predictions on
Ensemble_Reg_test.csv. - Results saved in
submission.csv.
- Voting Regressor chosen for final predictions on
- Ensemble methods improved performance over individual models.
- Hyperparameter tuning played a crucial role in reducing RMSE.
- Voting Regressor demonstrated the best generalization.
To classify objects based on mass, width, and height using a KNN classifier.
-
Data Exploration & Preprocessing
- Loaded dataset (
KNN_train.csv). - Features:
mass, width, height(ID & label removed). - Standardized using
StandardScaler.
- Loaded dataset (
-
Model Training
- Used KNN classifier with
n_neighbors=1.
- Used KNN classifier with
-
Model Evaluation
- Achieved 100% accuracy on training data.
- Classification report showed perfect precision, recall, and F1-score.
-
Testing & Predictions
- Predictions made on
KNN_test.csv. - Results saved in
submission.csv.
- Predictions made on
- The model perfectly classified training data but may overfit.
- Testing performance needs evaluation for generalization.
To classify data points into distinct categories using a logistic regression model.
- Performed data cleaning, handling missing values.
- Applied feature scaling to standardize the dataset.
- Selected the most relevant features using correlation analysis.
- Utilized scikit-learn’s
LogisticRegression. - Split data into training and testing sets.
- Trained the model on the training data.
- Accuracy: Measured correct predictions.
- Confusion Matrix: Visualized classification performance.
- Precision, Recall, F1-Score: Analyzed the balance between false positives and false negatives.
- The model performed well on the dataset but showed some limitations with class imbalance.
To implement a neural network for classification using the MNIST dataset.
- Normalized pixel values of images between 0 and 1.
- Flattened image data into feature vectors.
- Used
MLPClassifierfrom scikit-learn. - Hidden layers: Configured with different neuron counts.
- Activation function: ReLU for hidden layers, softmax for output.
- Optimizer: Adam, with learning rate tuning.
- Accuracy Score: Compared with logistic regression results.
- Loss Curve: Tracked training process.
- Confusion Matrix: Identified misclassified digits.
- Achieved higher accuracy than logistic regression.
- Required hyperparameter tuning for optimal performance.
To predict a continuous target variable based on multiple independent variables.
- Checked for multicollinearity using VIF (Variance Inflation Factor).
- Standardized numerical features for consistency.
- Applied
LinearRegressionfrom scikit-learn. - Trained the model using a training dataset.
- Predicted values for test data.
- Mean Squared Error (MSE): Assessed prediction errors.
- R-squared Score: Measured variance explanation.
- Residual Analysis: Ensured assumptions of linear regression were met.
- Provided reasonable predictions but required feature engineering for improvement.
- Outliers and non-linearity affected accuracy.
This showcase presents three machine learning applications with distinct approaches:
- Decision Tree for classification: Moderate performance with potential improvements.
- Ensemble Regression: Voting Regressor emerged as the best model for prediction.
- KNN Classification: Highly accurate but requires validation on test data.
- Logistic Regression: Performed well but had class imbalance issues.
- MLP: Outperformed logistic regression but needed careful tuning.
- Multi-Linear Regression: Worked for continuous data but was sensitive to feature selection.
This ML-Lab-Showcase demonstrates key ML techniques and their effectiveness in different scenarios, providing a foundation for further research and refinement.