<a href="https://colab.research.google.com/github/sp8rks/MaterialsInformatics/blob/main/worked_examples/ydata_profiling/ydata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install ydata_profiling

Now we need some data, so lets read in the example file

In [None]:
from google.colab import drive
drive.mount('/content/drive/')
%cd /content/drive/My Drive/teaching/5540-6640 Materials Informatics/MaterialsInformatics/worked_examples/ydata_profiling

# YData Profiling

In this notebook we will run through the different uses of ydata profiling and how to use it to effectively visualize data. YData is a very helpful tool for visualizing datasets and their behavior. 

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport 

## Visualization

First and foremost, one of the main ways ydata is used to visualize datasets to gain insight into how the data behaves and how it can be processed.

Let's first load the dataset:

In [4]:
df = pd.read_csv('AgNP_dataset.csv')
df = df.dropna()

Next we will use ydata to visualize the dataset. Ydata will generate a profile report which is an interactable window. It will contain information such as a helpful overview, any correlations between variables, variable interactions, missing values, and a subsample of the dataset. 

In [5]:
profile = ProfileReport(df, title='Profiling Report', explorative=True)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Interpretation

Lets run through some of the items in this window

#### Overview
- Number of variables, observations, missing values, memory useage
- Helpful for a quick look at the size of the data set and how complete it is. If there are a large number of missing values than further processing on the dataset may be needed

#### Data Types
- A breakdown of the different data types present in the dataset
- Tells you what variables might need to be encoded or left out for machine learning purposes


#### Missing Values
- Information about the missing value statistics for each individual variable
- Can be helpful to check on what variables are contributing towards the total missing value number and what variables might need further processing


#### Descriptions
- Statistics for numerical columns such as count, mean, standard deviation, min, max, and quartiles
- Helps provide easy information about the distribution of the data


#### Correlations
- Correlation coefficients between variables (such as Pearson, Spearman, etc.)
- Measures the linear relationship between variables
- High absolute value (1 or -1) indicates a strong relationship (with 1 being direct and -1 being inverse) 


#### Distributions
- Histograms/Density plots showing the distribution of the variables
- Helps visualize the shape of the data and how different data points are spread across different values.


#### Variable Analysis
- Frequency counts and bar plots of unique values for each variable
- Identifies where points are concentrated which can provide information on where rare data exists

#### Interactions
- Shows how pairs of variables interact displayed as heat maps 
- Helps to reveal relationships and patterns between variables
- Can help to highlight possible colinearity between pairs of variables

