# DataPrep

DataPrep lets you prepare your data using a single library with a few lines of code.

Currently, you can use DataPrep to:

- Collect data from common data sources (through dataprep.connector)
- Do your exploratory data analysis (through dataprep.eda)
- Clean and standardize data (through dataprep.clean)


# Install the latest develop version of DataPrep

In [None]:
!pip install -U git+https://github.com/sfu-db/dataprep.git@develop --quiet

# Load Modules

In [8]:
from dataprep.eda import *
from dataprep.datasets import load_dataset
from dataprep.eda import plot, plot_correlation, plot_missing, plot_diff, create_report

# Load Dataset

In [11]:
df = load_dataset("titanic")

# Create Overview

In [12]:
plot(df)

AttributeError: scipy.stats.stats is deprecated and has no attribute _power_div_lambda_names. Try looking in scipy.stats instead.

# Understand Missing values

## 1. Missing value overview

In [None]:
plot_missing(df)

## 2. Impact of missing values

In [None]:
plot_missing(df, "Age")

# Understand correlation

## 1. Correlation overview

In [None]:
plot_correlation(df)

## 2. Understand how other columns correlated to the given column

In [None]:
plot_correlation(df, "Age")

# Understand single column

## 1. Numerical column

In [None]:
plot(df, "Age")

## 2. Categorical column

In [None]:
plot(df, "Sex")

# Understand column relationship

## 1. Numerical vs Numerical

In [None]:
plot(df, "Age", "Fare")

## 2. Numerical vs Categorical

In [None]:
plot(df, "Age", "Sex")

## 3. Categorical vs Categorical

In [None]:
plot(df, "Sex", "Survived")

# Customize Visualization

## 1. Show only interested visualizations

In [None]:
 plot(df, "Age", display = ["Histogram"])

## 2. Set config for visualizations.

In [None]:
plot(df, "Age", display = ["Histogram"], config = {"hist.bins": 1000, "height":500})

# Compare two dataframes

In [None]:
df1 = df[df["Survived"] == 0]
df2 = df[df["Survived"] == 1]
plot_diff([df1, df2])

# Create an html report

In [None]:
create_report(df)

# Analyze geographical data

In [None]:
country = load_dataset("countries")
plot(country, "Country")