![alt text](http://drive.google.com/uc?export=view&id=1IFEWet-Aw4DhkkVe1xv_2YYqlvRe9m5_)

# 5.2 Automated Exploratory Data Analysis (EDA) in Python

This tutorial introduces automated Exploratory Data Analysis (EDA) using Python libraries designed to streamline data exploration and visualization. Automated EDA tools generate comprehensive reports and visualizations with minimal code, allowing users to quickly understand dataset characteristics, including distributions, correlations, and missing values. This guide covers **ydata-profiling**, **sweetviz**, **dtale**, and **autoviz**, demonstrating how to use these libraries to perform efficient EDA. The tutorial assumes basic familiarity with Python and pandas.

## Prerequisites

### Install Required Packages

The following libraries will be used:

- **ydata-profiling**: Generates interactive HTML EDA reports with statistics, correlations, and missing value analysis.
- **sweetviz**: Creates comparison reports for datasets or feature-target distributions.
- **dtale**: Provides an interactive web interface for exploring DataFrames.
- **autoviz**: Automatically generates visualizations for all variables in a dataset.

Install the packages using pip:

```bash
pip install pandas ydata-profiling sweetviz dtale autoviz
```

### Verify Installation

Check if the packages are installed:

In [2]:
import pkg_resources

packages = ['pandas', 'ydata-profiling', 'sweetviz', 'dtale', 'autoviz']
installed = {pkg: pkg_resources.get_distribution(pkg).version for pkg in packages if pkg in [p.key for p in pkg_resources.working_set]}
print("Installed packages:", installed)

Installed packages: {'pandas': '2.3.1'}


In [None]:
%pip install ydata-profiling sweetviz dtale autoviz

### Import Libraries

Import the required libraries with suppressed warnings for cleaner output:

In [5]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from ydata_profiling import ProfileReport
import sweetviz as sv
import dtale
from autoviz.AutoViz_Class import AutoViz_Class

Imported v0.1.905. Please call AutoViz in this sequence:
    AV = AutoViz_Class()
    %matplotlib inline
    dfte = AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=1, lowess=False,
               chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)


## Dataset

We will use a sample dataset, `gp_soil_data_na.csv`, from a GitHub repository for demonstration. This dataset contains soil-related variables, including SOC (Soil Organic Carbon), DEM, and NLCD (land cover categories).

In [6]:
# Load dataset
url = "https://github.com/zia207/Python_for_Beginners/raw/refs/heads/main/Data/gp_soil_data_na.csv"
df = pd.read_csv(url)

## 1. ydata-profiling: Comprehensive EDA Report

**ydata-profiling** generates an interactive HTML report summarizing dataset characteristics, including variable types, quantiles, correlations, and missing values.

### Generate and Save Report

In [5]:
# Create profile report
profile = ProfileReport(df, title="Soil Data EDA Report", explorative=True)

# Save to HTML file
profile.to_file("soil_data_profile_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 19/19 [00:00<00:00, 50694.51it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

This code produces an HTML file (`soil_data_profile_report.html`) with sections for:
- Dataset overview (rows, columns, missing values)
- Variable summaries (distributions, statistics)
- Correlation matrices (Pearson, Spearman, etc.)
- Missing value patterns
- Sample data preview

Open the HTML file in a browser to explore the interactive report.

### Key Features
- **Interactive**: Clickable tabs for variable details and correlations.
- **Comprehensive**: Includes histograms, boxplots, and missing value visualizations.
- **Customizable**: Supports minimal mode for large datasets or custom correlation methods.

## 2. sweetviz: Comparative EDA Reports

**sweetviz** generates reports focused on dataset comparisons or feature-target analysis, producing visually appealing HTML outputs.

### Generate Report for Single Dataset

In [14]:
import sweetviz as sv

# Analyze the dataset with error handling
try:
    report = sv.analyze(df)
    # Save to HTML file
    report.show_html("soil_data_sweetviz_report.html")
except AttributeError as e:
    print("Error: Likely a compatibility issue with NumPy or Sweetviz. Try updating Sweetviz (`pip install --upgrade sweetviz`) or downgrading NumPy (`pip install numpy==1.23.5`).")
    print(f"Error details: {e}")

                                             |          | [  0%]   00:00 -> (? left)

Error: Likely a compatibility issue with NumPy or Sweetviz. Try updating Sweetviz (`pip install --upgrade sweetviz`) or downgrading NumPy (`pip install numpy==1.23.5`).


## 3. dtale: Interactive Web Interface

**dtale** provides a web-based interface for exploring DataFrames, allowing filtering, sorting, and chart creation interactively.

### Launch dtale Interface

In [21]:
# Start dtale
d = dtale.show(df, open_browser=True)

This opens a web browser with an interactive interface where you can:
- View and filter the DataFrame
- Generate plots (histograms, scatter plots, boxplots)
- Calculate summary statistics
- Analyze correlations and missing values

### Key Features
- **Interactive**: Real-time filtering and sorting.
- **Charting**: Create and customize plots within the interface.
- **Exportable**: Export data or charts for further analysis.

To stop the dtale server, close the browser tab or terminate the Python process.

## 4. autoviz: Automatic Visualizations

**autoviz** generates visualizations for all variables in a dataset with minimal configuration, producing charts like histograms, scatter plots, and boxplots.

### Generate Visualizations

In [None]:
# Initialize AutoViz
AV = AutoViz_Class()

# Generate visualizations
report = AV.AutoViz(
    filename="",
    dfte=df,
    depVar="",  # No target variable for unsupervised EDA
    chart_format="html",
    verbose=2  # Save plots to HTML
)


In [22]:
print(report)S

     ID   FIPS   STATE_ID    STATE           COUNTY          Longitude  \
0      1  56041     56        Wyoming         Uinta County -111.011860   
1      2  56023     56        Wyoming       Lincoln County -110.982973   
2      3  56039     56        Wyoming         Teton County -110.806490   
3      4  56039     56        Wyoming         Teton County -110.734417   
4      5  56029     56        Wyoming          Park County -110.730790   
5      6  56039     56        Wyoming         Teton County -110.661850   
6      7  56039     56        Wyoming         Teton County -110.643480   
7      8  56039     56        Wyoming         Teton County -110.595819   
8      9  56039     56        Wyoming         Teton County -110.576980   
9     10  56035     56        Wyoming      Sublette County -110.517020   
10    11  56035     56        Wyoming      Sublette County -110.513510   
11    12  56035     56        Wyoming      Sublette County -110.482307   
12    13  56039     56        Wyoming 

ERROR	Thread(Thread-96 (process_request_thread)) dtale.app:app.py:shutdown_server()- weakly-referenced object no longer exists
ERROR	Thread(Thread-96 (process_request_thread)) dtale.app:app.py:shutdown_server()- weakly-referenced object no longer exists
ERROR	Thread(Thread-96 (process_request_thread)) dtale.app:app.py:shutdown_server()- weakly-referenced object no longer exists
ERROR	Thread(Thread-96 (process_request_thread)) dtale.app:app.py:shutdown_server()- weakly-referenced object no longer exists
ERROR	Thread(Thread-96 (process_request_thread)) dtale.app:app.py:shutdown_server()- weakly-referenced object no longer exists
ERROR	Thread(Thread-96 (process_request_thread)) dtale.app:app.py:shutdown_server()- weakly-referenced object no longer exists
ERROR	Thread(Thread-96 (process_request_thread)) dtale.app:app.py:shutdown_server()- weakly-referenced object no longer exists
ERROR	Thread(Thread-96 (process_request_thread)) dtale.app:app.py:shutdown_server()- weakly-referenced object n

This generates an HTML directory with visualizations for each variable, including:
- Histograms for numerical variables
- Bar plots for categorical variables
- Scatter plots for relationships between variables

### Key Features
- **Automated**: Minimal code for comprehensive visualizations.
- **Customizable**: Supports target variable analysis and chart type selection.
- **Scalable**: Handles large datasets with sampling options.

## Handling Missing Values

Automated EDA tools also highlight missing values. For example, `ydata-profiling` and `sweetviz` include missing value visualizations. To programmatically handle missing values before EDA:

In [20]:
# Remove rows with missing values
df_clean = df.dropna()

# Impute missing values with median
df_imputed = df.fillna(df.median(numeric_only=True))

# Re-run EDA on cleaned data (e.g., with ydata-profiling)
profile_clean = ProfileReport(df_clean, title="Cleaned Soil Data EDA Report")
profile_clean.to_file("soil_data_cleaned_profile_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 19/19 [00:00<00:00, 70586.16it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Summary and Conclusion

Automated EDA tools like **ydata-profiling**, **sweetviz**, **dtale**, and **autoviz** simplify the process of exploring and visualizing datasets. These libraries provide quick insights into data distributions, correlations, and missing values, saving time compared to manual EDA. By generating interactive reports and visualizations, they enable both beginners and experienced analysts to uncover patterns and anomalies efficiently. Experiment with these tools on diverse datasets to leverage their strengths for specific use cases, such as comparing subgroups or exploring large datasets interactively.

## References

1. [ydata-profiling Documentation](https://ydata-profiling.ydata.ai/docs/master/)
2. [sweetviz Documentation](https://github.com/fbdesignpro/sweetviz)
3. [dtale Documentation](https://github.com/man-group/dtale)
4. [autoviz Documentation](https://github.com/AutoViML/AutoViz)