<a href="https://colab.research.google.com/github/softhints/Pandas-Exercises-Projects/blob/main/project/Exploratory%20Data%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reading dataset

In [None]:
import pandas as pd
file = 'https://raw.githubusercontent.com/softhints/Pandas-Exercises-Projects/main/data/food_recipes.csv'
df = pd.read_csv(file, low_memory=False)

file_m = 'https://raw.githubusercontent.com/softhints/Pandas-Exercises-Projects/main/data/movies_metadata.csv'
df_m = pd.read_csv(file_m, low_memory=False)

# 1. [sweetviz](https://pypi.org/project/sweetviz/)

> In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code!

* `pip install sweetviz` 

[github - sweetviz](https://github.com/fbdesignpro/sweetviz)

### Features
* Target analysis
* Visualize and compare
* Mixed-type associations
* Type inference
* Summary information

In [None]:
import sweetviz as sv

my_report = sv.analyze(df)
my_report.show_html()

## 1.1 haisweetviz

* `pip install haisweetviz` 

In [None]:
import sweetviz as sv

my_report = sv.analyze(df)
my_report.show_html()

# 2. [autoviz](https://pypi.org/project/autoviz/)

> Automatically Visualize any dataset, any size with a single line of code. Now you can save these interactive charts as HTML files automatically with the "html" setting.

* `pip install autoviz`


[github - autoviz](https://github.com/AutoViML/AutoViz)

### Features
* Visualize and compare
* Summary information
* Outliers and missing values
* Data cleaning improvement suggestions

In [None]:
from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()

dft = AV.AutoViz( file, sep=",")

## 3. [pandas-profiling](https://pypi.org/project/pandas-profiling/)

> pandas-profiling generates profile reports from a pandas DataFrame. pandas-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardized univariate and multivariate report for data understanding.

* `pip install pandas-profiling` 

[github - profiling](https://github.com/ydataai/pandas-profiling)

### Features
*   **Type inference**: detect the types of columns in a DataFrame
*   **Essentials**: type, unique values, indication of missing values
*   **Quantile statistics**: minimum value, Q1, median, Q3, maximum, range, interquartile range
*   **Descriptive statistics**: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
*   **Most frequent and extreme values**
*   **Histograms**: categorical and numerical
*   **Correlations**: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik)
*   **Missing values**: through counts, matrix, heatmap and dendrograms
*   **Duplicate rows**: list of the most common duplicated rows
*   **Text analysis**: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
*   **File and Image analysis**:


In [None]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df, explorative=True)
profile

# 4. [dataprep](https://pypi.org/project/dataprep/)

> DataPrep lets you prepare your data using a single library with a few lines of code.

* `pip install -U dataprep` 

[github - dataprep](https://github.com/sfu-db/dataprep)

### Features
* Collect data from common data sources (through dataprep.connector)
* Do your exploratory data analysis (through dataprep.eda)
* Clean and standardize data (through dataprep.clean)

In [None]:
from dataprep.datasets import load_dataset
from dataprep.eda import create_report
# df = load_dataset("titanic")
create_report(df).show_browser()

# 5. [dabl](https://pypi.org/project/dabl/)

> Data Analysis Baseline Library.

* `pip install dabl` 

### Features
* Analyse single columns
* Grouped univariate histograms
* Scatter plot for categories
* Determine a good grid shape for subplots
* Create a mosaic plot from a dataframe
* Plots for categorical features in classification
* Visualize coefficients of a linear model

In [None]:
import dabl

dabl.plot(df_m, target_col="vote_count")

# 6. [dtale](https://pypi.org/project/dtale/)

> Data Analysis Baseline Library.

* `pip install dtale` 

[dtale - github](https://github.com/man-group/dtale)

### Features

* Summarize Data
* Duplicates detection
* Missing Analysis
* Outlier Detection
* Custom Filter
* Network Viewer
* Correlations
* Predictive Power Score
* Heat Map
* Load Data & Sample Datasets

In [None]:
import dtale
import pandas as pd

dtale.show(df_m)

# 7. [klib](https://pypi.org/project/klib/)

> Customized data preprocessing functions for frequent tasks..

* `pip install klib` 

[klib - github](https://github.com/akanz1/klib)

### Features
**klib.describe** - functions for visualizing datasets
- `klib.cat_plot(df)` - returns a visualization of the number and frequency of categorical features
- `klib.corr_mat(df)` - returns a color-encoded correlation matrix
- `klib.corr_plot(df)` - returns a color-encoded heatmap, ideal for correlations
- `klib.dist_plot(df)` - returns a distribution plot for every numeric feature
- `klib.missingval_plot(df)` - returns a figure containing information about missing values

**klib.clean** - functions for cleaning datasets
- `klib.data_cleaning(df)` - performs datacleaning (drop duplicates & empty rows/cols, adjust dtypes,...)
- `klib.clean_column_names(df)` - cleans and standardizes column names, also called inside data_cleaning()
- `klib.convert_datatypes(df)` - converts existing to more efficient dtypes, also called inside data_cleaning()
- `klib.drop_missing(df)` - drops missing values, also called in data_cleaning()
- `klib.mv_col_handling(df)` - drops features with high ratio of missing vals based on informational content
- `klib.pool_duplicate_subsets(df)` - pools subset of cols based on duplicates with min. loss of information

In [None]:
import klib

klib.missingval_plot(df_m)

In [None]:
df_cleaned = klib.data_cleaning(df_m)

In [None]:
klib.corr_plot(df_m)

In [None]:
klib.corr_plot(df_cleaned, target='revenue')

In [None]:
klib.dist_plot(df_m)

In [None]:
klib.corr_mat(df_cleaned)