As data rarely comes ready to be used and analyzed for machine learning right away, this package aims to help speed up the process of cleaning and doing initial exploratory data anslysis (EDA). The package focuses on the tasks of dealing with outlier and missing values, scaling and correlation visualization.
While R packages with similar functionalities exist, this package aims to simplify the amount of code necessary for these functions and outputs. Packages with similar functionality are as follows:
The four functions contained in this package are as follows:
cor_map
: A function to plot a correlation matrix of numeric columns in the dataframeoutlier_identifier
: A function to identify and deal with outliersscale
A function to scale numerical values in the datasetimputer
: A function to impute missing values
You can install the development version from GitHub with:
# install.packages("devtools") # run this line first if `devtools` package is not installed in your local.
devtools::install_github("UBC-MDS/eda.utilsR")
dplyr
ggplot2
reshape2
stats
rlang
testthat
tibble
The eda.utilsR package help you to build exploratory data analysis.
eda.utilsR includes multiple custom functions to perform initial exploratory analysis on any input data describing the structure and the relationships present in the data. The generated output can be obtained in both object and graphical form.
The eda.utilsR is capable of :
- Diagnose data quality : Resolve skewed data by identifying missing data and outlier and provide corresponding remedy.
- Discover data: Plot correlation matrix to help explore data to understand the data and find scenarios for performing the analysis.
- Machine learning perpetration : Perform column transformations, derive scaler automatically to fulfill further machine learning need
Please find the detailed documentation in the vignette.
library(eda.utilsR)
For this demonstration we will use the following datasets.
data <- data.frame('SepalLengthCm' = (c(5.1, 4.9, 4.7)),
'SepalWidthCm'= (c(1.4, 1.4, 1.3)),
'PetalWidthCm'= (c(0.2, 0.1, 0.2)),
'Species' = (c('Iris-setosa','Iris-virginica', 'Iris-germanica')))
data_with_NA <- data.frame('SepalLengthCm' = (c(5.1, 4.9, 4.7)),
'SepalWidthCm'= (c(1.4, 1.4, 1.3)),
'PetalWidthCm'= (c(0.2, 0.1, NA)))
data_with_outlier <- data.frame('SepalLengthCm' = (c(5.1, 4.9, 4.7, 5.2, 5.1, 5.2, 5.1, 4.8, 5.3)),
'SepalWidthCm'= (c(1.4, 1.4, 1.3, 1.2, 1.2, 1.3, 1.6, 1.3, 1.5)),
'PetalWidthCm'= (c(0.2, 0.1, 30, 0.2, 0.3, 0.1, 0.4, 0.5, 0.5)))
- Impute: Resolve skewed data by identifying missing data and outlier and provide corresponding remedy.
# calling imputer function
imputer(data_with_NA)
#> SepalLengthCm SepalWidthCm PetalWidthCm
#> 1 5.1 1.4 0.20
#> 2 4.9 1.4 0.10
#> 3 4.7 1.3 0.15
- Identify Outliers: Identify and deal with outliers in the dataset.
# calling outlier_identifier function
outlier_identifier(data_with_outlier, method = "mean")
#> SepalLengthCm SepalWidthCm PetalWidthCm
#> 1 5.1 1.4 0.20
#> 2 4.9 1.4 0.10
#> 3 4.7 1.3 3.59
#> 4 5.2 1.2 0.20
#> 5 5.1 1.2 0.30
#> 6 5.2 1.3 0.10
#> 7 5.1 1.6 0.40
#> 8 4.8 1.3 0.50
#> 9 5.3 1.5 0.50
- Correlation Heatmap Plotting: Easily plot a correlation matrix along with its values to help explore data.
# defining the numeric columns
num_col <- c('SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm')
# calling cor_map function
cor_map(data, num_col, "purpleorange")
- Scaling: Scale the data in preperation for future use in machine learning projects.
# defining the numeric columns
num_col <- c('SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm')
# calling scale function
scale(data, num_col, "minmax")
#> SepalLengthCm SepalWidthCm PetalWidthCm
#> 1 1.0 1 1
#> 2 0.5 1 0
#> 3 0.0 0 1
This package is authored by Chuang Wang, Fatime Selimi, Jiacheng Wang, and Micah Kwok as part of the course project in DSCI-524 (UBC-MDS program). You can see the list of all contributors in the contributors tab.
We welcome and recognize all contributions. If you wish to participate, please review our contributing guidelines.