In this project, I utilize R to preprocess Spotify's "Unpopular Songs" and "Genre of Artists" datasets from Kaggle. Following Hadley Wickham’s “Tidy Data” principles, I have cleaned up all types of messy data to ensure the resulting clean dataset is ready for statistical analysis, ensuring accurate and ethical data practices.
Check out the detailed data preprocessing in the code.
You can download the datasets from Kaggle.
In the project, I applied various data cleaning techniques, including:
- Removing Duplicates
- Verifying Data Structures
- Ensuring
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
- Scanning for Missing Values
- Scanning for Special Values
- Scanning for Errors
- Scanning for Outliers
This project is derived from my own RMIT Master of Analytics assignment in the “Data Wrangling” course (2022). It has been slightly modified and refined to showcase my data preprocessing techniques using R.
Feel free to read the step-by-step explanation in my blog.