Skip to content

Commit

Permalink
docs(introduction): intro and install improvements (#979)
Browse files Browse the repository at this point in the history
* overall improvements to the docs initial page
* Adding details about missing report sections, ways to consume it, tweaking and uniformizing language
* details about file and image analysis
* installations: update install instructions for widgets version
  • Loading branch information
dr-ydata authored and sbrugman committed May 15, 2022
1 parent 188e02b commit c31b627
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 37 deletions.
38 changes: 17 additions & 21 deletions docsrc/source/pages/installation.rst
Expand Up @@ -17,22 +17,22 @@ Using pip
:alt: PyPi Version
:target: https://pypi.org/project/pandas-profiling/

You can install using the pip package manager by running
You can install using the ``pip`` package manager by running:

.. code-block:: console
pip install -U pandas-profiling[notebook]
jupyter nbextension enable --py widgetsnbextension
If you are in a notebook (locally, at LambdaLabs, on Google Colab or Kaggle), you can run:
If you are in a notebook (locally, LambdaLabs, Google Colab or Kaggle), you can run:

.. code-block::
import sys
!{sys.executable} -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension
You may have to restart the kernel or runtime.
You may have to restart the kernel or runtime for the package to work.

Using conda
-----------
Expand All @@ -45,53 +45,49 @@ Using conda
:alt: Conda Version
:target: https://anaconda.org/conda-forge/pandas-profiling

You can install using the conda package manager by running
A new conda environment containing the module can be created via:

.. code-block:: console
conda env create -n pandas-profiling
conda activate pandas-profiling
conda install -c conda-forge pandas-profiling
This creates a new conda environment containing the module.

.. hint::

Don't forget to specify the ``conda-forge`` channel. Omitting it won't result in an error, as an outdated package lives on the main channel. See `frequent issues <Support.rst#frequent-issues>`_

Jupyter notebook/lab
--------------------
Don't forget to specify the ``conda-forge`` channel. Omitting it won't result in an error, as an outdated package lives on the ``main`` channel and will be installed. See `Frequent issues <Support.rst#frequent-issues>`_ for details.

For the Jupyter widgets extension to work, which is used for Progress Bars and the widget interface, you might need to activate the extensions. Installing with conda will enable the extension for you for Jupyter Notebooks (not lab).
Widgets in Jupyter Notebook/Lab
-------------------------------

For Jupyter notebooks:
For the Jupyter widgets extension to work (used for progress bars and the interactive widget-based report), you might need to activate the corresponding extensions.
This can be done via ``pip``:

.. code-block::
jupyter nbextension enable --py widgetsnbextension
pip install ipywidgets
For Jupyter lab:
Or via ``conda``:

.. code-block::
conda install -c conda-forge nodejs
jupyter labextension install @jupyter-widgets/jupyterlab-manager
conda install -c conda-forge ipywidgets
More information is available at the `ipywidgets documentation <https://ipywidgets.readthedocs.io/en/stable/user_install.html>`_.
In most cases, this will also automatically configure Jupyter Notebook and Jupyter Lab (``>=3.0``). For older versions of both or in more complex
environment configurations, refer to `the official ipywidgets documentation <https://ipywidgets.readthedocs.io/en/stable/user_install.html>`_.

From source
-----------

Download the source code by cloning the repository or by pressing `'Download ZIP' <https://github.com/ydataai/pandas-profiling/archive/master.zip>`_ on this page.
Install by navigating to the proper directory and running
Install it by navigating to the uncompressed directory and running:

.. code-block:: console
python setup.py install
This can also be done in one line:
This can also be done via the following one-liner:

.. code-block:: console
pip install https://github.com/ydataai/pandas-profiling/archive/master.zip
pip install https://github.com/ydataai/pandas-profiling/archive/master.zip
42 changes: 26 additions & 16 deletions docsrc/source/pages/introduction.rst
Expand Up @@ -25,19 +25,29 @@ Introduction
:alt: Code style: black
:target: https://github.com/python/black

Generates profile reports from a pandas ``DataFrame``.
The pandas ``df.describe()`` function is great but a little basic for serious exploratory data analysis.
``pandas_profiling`` extends the pandas DataFrame with ``df.profile_report()`` for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

* **Type inference**: detect the types of columns in a dataframe.
* **Essentials**: type, unique values, missing values
* **Quantile statistics** like minimum value, Q1, median, Q3, maximum, range, interquartile range
* **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* **Most frequent values**
* **Histograms**
* **Correlations** highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
* **Missing values** matrix, count, heatmap and dendrogram of missing values
* **Duplicate rows** Lists the most occurring duplicate rows
* **Text analysis** learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data
``pandas-profiling`` generates profile reports from a pandas ``DataFrame``.
The pandas ``df.describe()`` function is handy yet a little basic for exploratory data analysis. ``pandas_profiling`` extends pandas DataFrame with ``df.profile_report()``,
which automatically generates a standardized univariate and multivariate report for data understanding.

For each column, the following information (whenever relevant for the column type) is presented in an interactive HTML report:

* **Type inference**: detect the types of columns in a ``DataFrame``
* **Essentials**: type, unique values, indication of missing values
* **Quantile statistics**: minimum value, Q1, median, Q3, maximum, range, interquartile range
* **Descriptive statistics**: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* **Most frequent and extreme values**
* **Histograms:** categorical and numerical
* **Correlations**: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér's V, Phik)
* **Missing values**: through counts, matrix, heatmap and dendrograms
* **Duplicate rows**: list of the most common duplicated rows
* **Text analysis**: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
* **File and Image analysis**: file sizes, creation dates, dimensions, indication of truncated images and existance of EXIF metadata


The report contains three additional sections:

* **Overview**: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
* **Warnings**: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
* **Reproduction**: technical details about the analysis (time, version and configuration)

The package can be used via code but also directly as a CLI utility. The generated interactive report can be consumed and shared as regular HTML or embedded in an interactive way inside Jupyter Notebooks.

0 comments on commit c31b627

Please sign in to comment.