eda.utilsR

Overview

As data rarely comes ready to be used and analyzed for machine learning right away, this package aims to help speed up the process of cleaning and doing initial exploratory data anslysis (EDA). The package focuses on the tasks of dealing with outlier and missing values, scaling and correlation visualization.

Our Place in the R Ecosystem

While R packages with similar functionalities exist, this package aims to simplify the amount of code necessary for these functions and outputs. Packages with similar functionality are as follows:

Functions

The four functions contained in this package are as follows:

cor_map: A function to plot a correlation matrix of numeric columns in the dataframe
outlier_identifier: A function to identify and deal with outliers
scale A function to scale numerical values in the dataset
imputer: A function to impute missing values

Installation

You can install the development version from GitHub with:

# install.packages("devtools") # run this line first if `devtools` package is not installed in your local.
devtools::install_github("UBC-MDS/eda.utilsR")

Dependencies

dplyr
ggplot2
reshape2
stats
rlang
testthat
tibble

Usage

The eda.utilsR package help you to build exploratory data analysis.

eda.utilsR includes multiple custom functions to perform initial exploratory analysis on any input data describing the structure and the relationships present in the data. The generated output can be obtained in both object and graphical form.

The eda.utilsR is capable of :

Diagnose data quality : Resolve skewed data by identifying missing data and outlier and provide corresponding remedy.
Discover data: Plot correlation matrix to help explore data to understand the data and find scenarios for performing the analysis.
Machine learning perpetration : Perform column transformations, derive scaler automatically to fulfill further machine learning need

Documentation

Please find the detailed documentation in the vignette.

Example

library(eda.utilsR)

Data

For this demonstration we will use the following datasets.

data <- data.frame('SepalLengthCm' = (c(5.1, 4.9, 4.7)), 
                   'SepalWidthCm'= (c(1.4, 1.4, 1.3)),
                  'PetalWidthCm'= (c(0.2, 0.1, 0.2)),
                  'Species' = (c('Iris-setosa','Iris-virginica', 'Iris-germanica')))

data_with_NA <- data.frame('SepalLengthCm' = (c(5.1, 4.9, 4.7)), 
                   'SepalWidthCm'= (c(1.4, 1.4, 1.3)),
                  'PetalWidthCm'= (c(0.2, 0.1, NA)))

data_with_outlier <- data.frame('SepalLengthCm' = (c(5.1, 4.9, 4.7, 5.2, 5.1, 5.2, 5.1, 4.8, 5.3)), 
                   'SepalWidthCm'= (c(1.4, 1.4, 1.3, 1.2, 1.2, 1.3, 1.6, 1.3, 1.5)),
                  'PetalWidthCm'= (c(0.2, 0.1, 30, 0.2, 0.3, 0.1, 0.4, 0.5, 0.5)))

1. `imputer`

Impute: Resolve skewed data by identifying missing data and outlier and provide corresponding remedy.

# calling imputer function
imputer(data_with_NA)
#>   SepalLengthCm SepalWidthCm PetalWidthCm
#> 1           5.1          1.4         0.20
#> 2           4.9          1.4         0.10
#> 3           4.7          1.3         0.15

2. `outlier_identifier`

Identify Outliers: Identify and deal with outliers in the dataset.

# calling outlier_identifier function
outlier_identifier(data_with_outlier, method = "mean")
#>   SepalLengthCm SepalWidthCm PetalWidthCm
#> 1           5.1          1.4         0.20
#> 2           4.9          1.4         0.10
#> 3           4.7          1.3         3.59
#> 4           5.2          1.2         0.20
#> 5           5.1          1.2         0.30
#> 6           5.2          1.3         0.10
#> 7           5.1          1.6         0.40
#> 8           4.8          1.3         0.50
#> 9           5.3          1.5         0.50

3. `cor_map`

Correlation Heatmap Plotting: Easily plot a correlation matrix along with its values to help explore data.

# defining the numeric columns
num_col <- c('SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm')

# calling cor_map function
cor_map(data, num_col, "purpleorange")

4. `scale`

Scaling: Scale the data in preperation for future use in machine learning projects.

# defining the numeric columns
num_col <- c('SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm')

# calling scale function
scale(data, num_col, "minmax")
#>   SepalLengthCm SepalWidthCm PetalWidthCm
#> 1           1.0            1            1
#> 2           0.5            1            0
#> 3           0.0            0            1

Contributors

This package is authored by Chuang Wang, Fatime Selimi, Jiacheng Wang, and Micah Kwok as part of the course project in DSCI-524 (UBC-MDS program). You can see the list of all contributors in the contributors tab.

We welcome and recognize all contributions. If you wish to participate, please review our contributing guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github		.github
R		R
docs		docs
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CONDUCT.rst		CONDUCT.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
eda.utilsR.Rproj		eda.utilsR.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

eda.utilsR

Overview

Our Place in the R Ecosystem

Functions

Installation

Dependencies

Usage

Documentation

Example

Data

1. `imputer`

2. `outlier_identifier`

3. `cor_map`

4. `scale`

Contributors

About

Licenses found

Releases

Packages

Languages

License

Licenses found

wangjc640/eda.utilsR

Folders and files

Latest commit

History

Repository files navigation

eda.utilsR

Overview

Our Place in the R Ecosystem

Functions

Installation

Dependencies

Usage

Documentation

Example

Data

1. imputer

2. outlier_identifier

3. cor_map

4. scale

Contributors

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `imputer`

2. `outlier_identifier`

3. `cor_map`

4. `scale`

Packages