Machine Learning Overview

Projects for the course "2022 Python for Machine Learning and Data Science Masterclass".

Data exploration

Cleaning Ames_Housing_Data dataset

We will go through an exploratory data analysis of the dataset that will be used later in the linear regression project. We start by looking at the different features in the dataset and visualizing the values of the outcome variable (SalePrince). We then look for and deal with outliers as well as missing values. This involves understanding the variables and filling in the NaN values accordingly. Finally, we observe and transform the categorical variables to one-hot encoding.

Supervised algorithms

Linear Regression Project

Using the final version of the Ames_Housing_Data dataset, we scaled the features to then create different regression models, namely Basic, Ridge, Lasso, and ElasticNet, to predict the selling price of a house. We perform performance evaluation to select the best models by displaying residuals and a probability plot to help us visualize model performance. Once we select the best models, we tune their hyperparameters using GridSearchCV. Finally, we perform another performance analysis using MAE, RMSE, and residual visualizations.

Logistic Regression Project

For this project, we analyzed the HeartDisease dataset to then create a logistic regression model that will predict whether or not a person has the presence of heart disease based on that person's physical characteristics. We start with an exploratory data analysis, continue with feature scaling, and finally build the model, using cross-validation to find the best hyperparameter. We perform performance analysis using confusion matrix and classification report (accuracy, recall, f1-score, and support) and displays of accuracy recall and ROC curves.

K Nearest Neighbors Project

In this project, we explore the Sonar dataset and create a KNN model capable of detecting the difference between a rock and a mine. In this case, we create a Pipeline (which will have the scaler and the KNN model), along with GridSearchCV for tuning the best hyperparameter k. We carry out the performance evaluation through a confusion matrix and a classification report.

Support Vector Machine Project

For this project, we performed an analysis of the Winefraude dataset and created an SVM model capable of detecting fraudulent wine samples. As we did in another project, we first performed an exploratory data analysis, scaled the features, and finally built the model using GridSearchCV to find the best C and gamma hyperparameters. Performance analysis was carried out with the confusion matrix and classification report.

Tree Models Project

In this project, we perform an exploratory data analysis of the Telco Customer Churn dataset and create three different tree-based models: decision trees, random forest, and adaboost, to finally compare the performance using the confusion matrix and classification report.

Text Classification Project

In this project, we create a linear SVC model for the Movie Reviews dataset, using a bag of words and TF-IDF to convert text to numeric vectors. Performance analysis was carried out using confusion matrix and classification_report.

Unsupervised algorithms

K-Means Clustering Project

For this project, we will use the CIA Country Facts dataset and create a K-means model to understand the similarity between countries and regions of the world by experimenting with different numbers of groups. The first step was to understand the data, to then prepare the features to be used by the model (dealing with missing values, one-hot coding, scaling the features). The results of the model can be displayed on a world map.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
cleaning		cleaning
data		data
supervised		supervised
unsupervised		unsupervised
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Overview

Data exploration

Cleaning Ames_Housing_Data dataset

Supervised algorithms

Linear Regression Project

Logistic Regression Project

K Nearest Neighbors Project

Support Vector Machine Project

Tree Models Project

Text Classification Project

Unsupervised algorithms

K-Means Clustering Project

Color Quantization Test

DBSCAN Project

PCA

About

Releases

Packages

Languages

rbhubert/machine-learning-overview

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Overview

Data exploration

Supervised algorithms

Unsupervised algorithms

About

Topics

Resources

Stars

Watchers

Forks

Languages