Skip to content

Overview of different machine learning algorithms, both supervised and unsupervised. Includes exploratory data analysis and performance evaluation of the models.

Notifications You must be signed in to change notification settings

rbhubert/machine-learning-overview

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Overview

Projects for the course "2022 Python for Machine Learning and Data Science Masterclass".

Data exploration

We will go through an exploratory data analysis of the dataset that will be used later in the linear regression project. We start by looking at the different features in the dataset and visualizing the values of the outcome variable (SalePrince). We then look for and deal with outliers as well as missing values. This involves understanding the variables and filling in the NaN values accordingly. Finally, we observe and transform the categorical variables to one-hot encoding.

Supervised algorithms

Using the final version of the Ames_Housing_Data dataset, we scaled the features to then create different regression models, namely Basic, Ridge, Lasso, and ElasticNet, to predict the selling price of a house. We perform performance evaluation to select the best models by displaying residuals and a probability plot to help us visualize model performance. Once we select the best models, we tune their hyperparameters using GridSearchCV. Finally, we perform another performance analysis using MAE, RMSE, and residual visualizations.

For this project, we analyzed the HeartDisease dataset to then create a logistic regression model that will predict whether or not a person has the presence of heart disease based on that person's physical characteristics. We start with an exploratory data analysis, continue with feature scaling, and finally build the model, using cross-validation to find the best hyperparameter. We perform performance analysis using confusion matrix and classification report (accuracy, recall, f1-score, and support) and displays of accuracy recall and ROC curves.

In this project, we explore the Sonar dataset and create a KNN model capable of detecting the difference between a rock and a mine. In this case, we create a Pipeline (which will have the scaler and the KNN model), along with GridSearchCV for tuning the best hyperparameter k. We carry out the performance evaluation through a confusion matrix and a classification report.

For this project, we performed an analysis of the Winefraude dataset and created an SVM model capable of detecting fraudulent wine samples. As we did in another project, we first performed an exploratory data analysis, scaled the features, and finally built the model using GridSearchCV to find the best C and gamma hyperparameters. Performance analysis was carried out with the confusion matrix and classification report.

In this project, we perform an exploratory data analysis of the Telco Customer Churn dataset and create three different tree-based models: decision trees, random forest, and adaboost, to finally compare the performance using the confusion matrix and classification report.

In this project, we create a linear SVC model for the Movie Reviews dataset, using a bag of words and TF-IDF to convert text to numeric vectors. Performance analysis was carried out using confusion matrix and classification_report.

Unsupervised algorithms

For this project, we will use the CIA Country Facts dataset and create a K-means model to understand the similarity between countries and regions of the world by experimenting with different numbers of groups. The first step was to understand the data, to then prepare the features to be used by the model (dealing with missing values, one-hot coding, scaling the features). The results of the model can be displayed on a world map.

This little project is for testing K-means for color quantization. Basically, we first translate an image into pixels and RGB colors, and then create a K-means model to reduce the amount of color to just k = 10.

The DBSCAN project starts by exploring the wholesale customer dataset, then the feature scaling, and finally building the model. The last step is to display the results, with the new values defined by the model.

For this project, we analyzed the handwritten digit pen-based recognition dataset and used PCA to reduce the number of features that allow a model to identify a number.

About

Overview of different machine learning algorithms, both supervised and unsupervised. Includes exploratory data analysis and performance evaluation of the models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published