<div align="center">
<font size="6"> Pandas Profiling for automated EDA </font>  
</div>

<div align="center">
<font size="4"> with SDSS dataset </font>  
</div>

&nbsp;

<!-- <font size="2"> -->


<img align="right" src="https://pandas-profiling.github.io/pandas-profiling/docs/assets/logo_header.png" data-canonical-src="https://pandas-profiling.github.io/pandas-profiling/docs/assets/logo_header.png" width="700" height="300" />

Get [Pandas Profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/)

Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with `df.profile_report()` for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

* **Type inference:** detect the types of columns in a dataframe. **not in 2.3.0**
* **Essentials:** type, unique values, missing values
* **Quantile statistics:** minimum value, Q1, median, Q3, maximum, range, interquartile range
* **Descriptive statistics:** mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* **Most frequent values**
* **Histogram**
* **Correlations:** highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
* **Missing values:** matrix, count, heatmap and dendrogram of missing values
* **Text analysis:** learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
* **File and Image analysis:** extract file sizes, creation dates and dimensions and scan for truncated images or with EXIF information.

## Import Kaggle libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os                       # accessing directory structure

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Import custom libraries

In [None]:
import matplotlib.pyplot as plt # plotting
import seaborn as sns           # plotting

import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

from IPython.display import Image
%matplotlib inline

## Import Pandas Profiling

In [None]:
import pandas_profiling

In [None]:
## Check version
## Releases: https://pypi.org/project/pandas-profiling/#history
!pip freeze |grep pandas-profiling

## Read data

In [None]:
df = pd.read_csv('/kaggle/input/sloan-digital-sky-survey-dr16/Skyserver_12_30_2019 4_49_58 PM.csv')
df.head(10)

In [None]:
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')

In [None]:
df.columns

## Profiling with Pandas

In [None]:
from sklearn.utils import shuffle
df = shuffle(df)

df_cut = df[0:10000].copy() # part data, to save memory
df_cut

In [None]:
profile = pandas_profiling.ProfileReport(df_cut, title='SDSS dataset Report')

In [None]:
profile

## Profile can be saved to HTML file

In [None]:
# profile.to_file(output_file='SDSS_with_pandas_profiling.html')