eda_utils_py

Overview

As data rarely comes ready to be used and analyzed for machine learning right away, this package aims to help speed up the process of cleaning and doing initial exploratory data anslysis (EDA). The package focuses on the tasks of dealing with outlier and missing values, scaling and correlation visualization.

Installation

$ pip install -i https://test.pypi.org/simple/ eda_utils_py

Functions

The four functions contained in this package are as follows:

imputer: A function to impute missing values
outlier_identifier: A function to identify and deal with outliers
cor_map: A function to plot a correlation matrix of numeric columns in the dataframe
scale A function to scale numerical values in the dataset

Our Place in the Python Ecosystem

While Python packages with similar functionalities exist, this package aims to simplify the amount of code necessary for these functions and outputs. Packages with similar functionality are as follows:

Dependencies

Please see a list of dependencies here.

Usage

The eda_utils_py package will help you in your exploratory data analysis portion of your work.

eda_utils_py includes multiple custom functions to perform initial exploratory analysis on any input data describing the structure and the relationships present in the data. Depending on the function, the generated output can be obtained in object or graphical form.

import pandas as pd
from eda_utils_py import eda_utils_py

data = pd.DataFrame({
         'SepalLengthCm':[5.1, 4.9, 4.7],
         'SepalWidthCm':[1.4, 1.4, 1.3],
         'PetalWidthCm':[0.2, 0.1, 0.2],
         'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica']
         })

data_with_NA = pd.DataFrame({
         'SepalLengthCm':[5.1, 4.9, 4.7],
         'SepalWidthCm':[1.4, 1.4, 1.3],
         'PetalWidthCm':[0.2, 0.1, None]
         })

data_with_outlier = pd.DataFrame({
         'SepalLengthCm':[5.1, 4.9, 4.7, 5.2, 5.1, 5.2, 5.1, 4.8],
         'SepalWidthCm':[1.4, 1.4, 1.3, 1.2, 1.2, 1.3, 1.6, 1.3],
         'PetalWidthCm':[0.2, 0.1, 30, 0.2, 0.3, 0.1, 0.4, 0.5]
         })
         
data_with_scale = pd.DataFrame({'SepalLengthCm':[1, 0, 0, 3, 4], 
         'SepalWidthCm':[4, 1, 1, 0, 1], 
         'PetalWidthCm':[2, 0, 0, 2, 1],
         'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica', 'Iris-virginica','Iris-germanica']})

The eda_utils_py package contains functions that will help you to:

Impute: Resolve skewed data by identifying missing data and outlier and provide corresponding remedy.

imputer(data_with_NA)

Output of imputer():

Identify Outliers: Identify and deal with outliers in the dataset.

outlier_identifier(data_with_outlier, method = "median")

Output of outlier_identifier():

Correlation Heatmap Plotting: Easily plot a correlation matrix along with its values to help explore data.

numerical_columns = ['SepalLengthCm','SepalWidthCm','PetalWidthCm']

cor_map(data, numerical_columns, col_scheme = 'purpleorange')

Output of cor_map():

Scaling: Scale the data in preperation for future use in machine learning projects.

numerical_columns = ['SepalLengthCm','SepalWidthCm','PetalWidthCm']

scale(data, numerical_columns, scaler="minmax")

Output of scale():

Documentation

The official documentation is hosted on Read the Docs: https://eda_utils_py.readthedocs.io/en/latest/

Contributors

This package is authored by Chuang Wang, Fatime Selimi, Jiacheng Wang, and Micah Kwok as part of the course project in DSCI-524 (UBC-MDS program). You can see the list of all contributors in the contributors tab.

We welcome and recognize all contributions. If you wish to participate, please review our contributing guidelines.

Credits

This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
.github/workflows		.github/workflows
docs		docs
eda_utils_py		eda_utils_py
images		images
tests		tests
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
CONDUCT.rst		CONDUCT.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eda_utils_py

Overview

Installation

Functions

Our Place in the Python Ecosystem

Dependencies

Usage

Documentation

Contributors

Credits

About

Releases 10

Packages

Languages

License

wangjc640/dsci524-group18

Folders and files

Latest commit

History

Repository files navigation

eda_utils_py

Overview

Installation

Functions

Our Place in the Python Ecosystem

Dependencies

Usage

Documentation

Contributors

Credits

About

Resources

License

Stars

Watchers

Forks

Releases 10

Packages 0

Languages

Packages