This project aims to build a kNN (k-Nearest Neighbors) model for predicting the income levels of adults based on various features. Additionally, interactive visualizations using Plotly Dash have been implemented to explore the dataset and model results.
Before running the notebook, ensure that you have the following Python packages installed:
- dash
- jupyter_dash
- pandas
- numpy
- sklearn
- plotly You can install these packages using pip:
pip install dash jupyter_dash pandas numpy sklearn plotly
The dataset used in this project is the "Adults Income" dataset. It contains various attributes such as age, workclass, education, marital status, occupation, relationship, race, gender, hours-per-week, native country, and income. The goal is to predict whether an adult earns more than $50K annually.
- Installing and Importing Packages: Necessary packages are installed and imported.
- Loading and Cleaning Data: The dataset is loaded and cleaned by removing records with missing values and duplicates.
- Encoding Categorical Data: Categorical variables are encoded using LabelEncoder from sklearn.
- Exploratory Data Analysis (EDA):
- Correlation Heatmap: Visualizes the correlation between features using Plotly Express.
- Feature Selection: Determines the most important features using SelectKBest, PCA, and ExtraTreesClassifier.
- Interactive Visualizations:
- Distribution Analysis: Implements an interactive histogram for analyzing the distribution of variables.
- Feature Importance: Visualizes the importance of features using Plotly Dash.
- Confusion Matrix: Displays a confusion matrix for model evaluation.
- kNN Classification Prediction Plot: Implements an interactive plot to visualize kNN classification predictions.
- Distribution Analysis: Implements an interactive histogram for analyzing the distribution of variables.
- Model Building and Evaluation:
- Conclusion: Summarizes the project and findings.
This project demonstrates the process of building a kNN model for predicting adult income levels and creating interactive visualizations to explore the dataset and model results. By leveraging Plotly Dash, users can interactively analyze the data and understand the model's performance.
To fully interact with the visualizations, it's recommended to run the notebook in a Jupyter environment or Google Colab. Enjoy exploring the dataset and model predictions!