# YData Profiling  Lesson

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **YData Profiling consists of a single unit.**
* By the end of this lesson, you should be able to:
  * Create a profile report for quick Exploratory Data Analysis (EDA) 
  * Add ydata-profiling capabilities to a Streamlit App

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand the data through data quality profiling reports for quick Exploratory Data Analysis (EDA)
* Add ydata profiling capabilities to a Streamlit App



---

* YData Profiling generates profile reports from a Pandas DataFrame for quick exploratory data analysis in just a few lines of code. The report will help you to evaluate your data, whether datetime, numerical, categorical, or boolean. Text data (for example, a sentence, or a review) is handled as individual categories.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 **Why do we study YData Profiling?**
  * It is a tool that will save you time during data exploration and analysis, through intuitive reporting. This report will serve as a starting point when investigating your data.




## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Learning Context

* We encourage you to:
  * Add **code cells and try out** other possibilities, ie.: play around with parameter values in a function/method, or consider additional function parameters etc.
  * Also, **add your comments** in the cells. It can help you to consolidate your learning. 

* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may seek additional in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For YData Profiling the link is [here](https://docs.profiling.ydata.ai/4.6/)**.

---

# <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Quick EDA: YData Profiling

YData Profiling is an automated dashboard for quick EDA - Exploratory Data Analysis. The documentation is found [here](https://docs.profiling.ydata.ai/4.6/)
* Instead of having multiple cells with commands like: `.info()`, `.describe()`, `.plot()`, `.corr()` etc, you run a single command to generate an useful and intuitive report for EDA
* Let's assume our data contains the following data types: Date, Numerical, Categorical and Boolean


---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Report

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We are just presenting here the library functions and how to interpret the YData Profiling Report. 
* We will cover in greater detail how to use YData Profiling Reports in practical terms in the second Walkthrough Project.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We use `ProfileReport()` to generate the report. The documentation function is [here](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/api/_autosummary/pandas_profiling.profile_report.ProfileReport.html). You will parse the DataFrame.


* You need to render the `profile` with the method `.to_notebook_iframe()` to see the report in a Jupyter Notebook.



```
from ydata_profiling import ProfileReport
profile = ProfileReport(df=df, minimal=True)
profile.to_notebook_iframe()
```


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> There is a minimal mode, which includes a minimal configuration file, where the more expensive computation is turned off. You should add the argument `minimal=True` when generating the report. In practical terms, the report now shows the following sections only:  Overview and Variables



* Considerations
  * The majority of the time, these two sections are adaquate. You can run the report once with all sections to have a detailed inspection, then for future inspections (since the exploratory process is iterative), you can use the minimal report, which is cheaper in terms of computing resources.
  * In addition, when you have big datasets, running a full report in ydata profiling might take too much time, so consider adding minimal=True and subset the data as a starting point for initial analysis in larger datasets.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">The report is displayed as an output in the notebook cell. The report has the following sections: 
  * Overview
  * Variables
  * Interactions
  * Correlations
  * Missing values
  * Sample


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In general, you will check the report to look for information and insights like:
* Detect the data types from the columns
* Assess unique values, missing values, duplicate rows
* Statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range, mean, mode, standard deviation, sum, median, kurtosis, skewness, most frequent values
* Histograms and Bar plot
* Correlations 

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Overview

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In Overview, you have a global summary report about your data, including the number of variables, rows, missing cells, duplicate rows, size in memory and variable types.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The tab "Warnings" informs you about any potential issues with your data.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The Reproduction tab tells when the analysis started and finished, and the YData Profiling version.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Variables

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The Variables section scans over each variable, showing a brief report of the variable, depending on its data type. 
* On the right-hand side, you can see a visual representation, where you can explore the distribution. You will scroll down over this section and will likely spend a good amount of time understanding your variables individually.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> If you click on "Toggle detail", a section appears indicating tabs for Statistics, and Charts

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Interactions

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the Interactions section, you can use the numerical variables and plot a scatter plot to quickly investigate the relationship and distribution.
* Note you select the variables from the x and y-axis in the tabs under "Interactions".

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Correlations

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A correlation study including multiple methods is available considering non-object variables.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Missing Values

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">A set of bar charts indicates the missing values across the variables. 
* In this case, there are no missing values in the studied dataset.


---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Sample

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">This section shows you the first and last 10 records of the dataset, so you can get a sense of the actual data

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Save HTML report

You can save the report in an HTML file, by using the method `.to_file()` on the report object, in this case, the `profile`. The argument is the file path + HTML file name.



```
from ydata_profiling import ProfileReport
profile = ProfileReport(df=df, minimal=True)
profile.to_file(output_file='report_output.html') 
```



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Render a YData Profiling report within Streamlit

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Disclaimer: the code shown here will not work in a Jupyter notebook cell, since it needs Streamlit running based on scripts. <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">
* The idea is just to show how to combine ydata-profiling and Streamlit.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You will need to add **streamlit-pandas-profiling** to your requirements.txt
* You **don't need to add ydata-profiling** since streamlit-pandas-profiling installs ydata-profiling as a dependancy. On top of that, it installs the functionality to render the report within Streamlit.



* Add  streamlit-pandas-profiling library to your requirements.txt
```
streamlit-pandas-profiling==0.1.2
```



This is an example of a potential python script where you can combine Streamlit and YData Profiling


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Note: If you run the cell below, it will not work! This code is designed to work as a script in a Streamlit application.|



```
# This should be in a python script

import streamlit as st
from ydata_profiling import ProfileReport
from streamlit_pandas_profiling import st_profile_report


df = ....  # get your DataFrame


# Create a function to display the ydata profiling report within Streamlit
def PandasProfilingReport(df):
	pr = ProfileReport(df, explorative=True, minimal=True)
	st_profile_report(pr)


# Call your function somewhere in the application
PandasProfilingReport(df)

```



