**Note**: This exercise is adapted from the original [here](https://github.com/pandas-profiling/pandas-profiling/blob/master/examples/meteorites/meteorites.ipynb). As of September 2020 if you install [pandas_profiling on conda](https://anaconda.org/conda-forge/pandas-profiling) you might get an old version (1.41) as it seems for this package some channels on conda are a bit older then the latest version on [pypi](https://pypi.org/project/pandas-profiling/) (2.9.0 as of September 2020). To be super clear you can see the exact enviornment and library versions used to run this exercise in the Pipefile (see [pipenv](https://pipenv-fork.readthedocs.io/en/latest/) for more details) of this example [here](https://github.com/andrewm4894/pandas-profiling/blob/master/Pipfile).

## Pandas Profiling: NASA Meteorites example
Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

In [1]:
%load_ext autoreload
%autoreload 2

Make sure that we have the latest version of pandas-profiling.

In [2]:
# uncomment and run below if you need to pip install the pandas-profiling library
#import sys
#!{sys.executable} -m pip install -U pandas-profiling==2.9.0
#!jupyter nbextension enable --py widgetsnbextension

You might want to restart the kernel now.

### Import libraries

In [3]:
from pathlib import Path

import requests
import numpy as np
import pandas as pd

import ydata_profiling
from ydata_profiling.utils.cache import cache_file

### Load and prepare example dataset
We add some fake variables for illustrating pandas-profiling capabilities

In [7]:
# https://ydata-profiling.ydata.ai/docs/master/pages/reference/api/_autosummary/ydata_profiling.utils.cache.html
# cache_file(file_name, url)
# Check if file_name already is in the data path, otherwise download it from url.

file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)
    
df = pd.read_csv(file_name)
    
# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df['year'] = pd.to_datetime(df['year'], errors='coerce')

# Example: Constant variable
df['source'] = "NASA"

# Example: Boolean variable
df['boolean'] = np.random.choice([True, False], df.shape[0])
# a note for df.shape[0] for series can be found here 
# https://stackoverflow.com/questions/40902224/why-dataframe-shape0-prints-an-integer-but-dataframe-columnname-shape-prints

# Example: Mixed with base types
df['mixed'] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5,size=(len(df)))
'''
numpy.random.normal
random.normal(loc=0.0, scale=1.0, size=None)
Draw random samples from a normal (Gaussian) distribution.
'''

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u'name'] = duplicates_to_add[u'name'] + " copy"

df = pd.concat([df,duplicates_to_add], ignore_index=True)

In [29]:
duplicates_to_add

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation,source,boolean,mixed,reclat_city
0,Aachen copy,1,Valid,L5,21.0,Fell,1970-01-01 00:00:00.000001880,50.775,6.08333,"(50.775, 6.08333)",NASA,False,A,56.298197
1,Aarhus copy,2,Valid,H6,720.0,Fell,1970-01-01 00:00:00.000001951,56.18333,10.23333,"(56.18333, 10.23333)",NASA,False,1,63.79237
2,Abee copy,6,Valid,EH4,107000.0,Fell,1970-01-01 00:00:00.000001952,54.21667,-113.0,"(54.21667, -113.0)",NASA,True,1,48.260981
3,Acapulco copy,10,Valid,Acapulcoite,1914.0,Fell,1970-01-01 00:00:00.000001976,16.88333,-99.9,"(16.88333, -99.9)",NASA,True,1,13.841276
4,Achiras copy,370,Valid,L6,780.0,Fell,1970-01-01 00:00:00.000001902,-33.16667,-64.95,"(-33.16667, -64.95)",NASA,False,A,-31.683042
5,Adhi Kot copy,379,Valid,EH4,4239.0,Fell,1970-01-01 00:00:00.000001919,32.1,71.8,"(32.1, 71.8)",NASA,True,A,33.241399
6,Adzhi-Bogdo (stone) copy,390,Valid,LL3-6,910.0,Fell,1970-01-01 00:00:00.000001949,44.83333,95.16667,"(44.83333, 95.16667)",NASA,False,A,44.671004
7,Agen copy,392,Valid,H5,30000.0,Fell,1970-01-01 00:00:00.000001814,44.21667,0.61667,"(44.21667, 0.61667)",NASA,False,A,48.338963
8,Aguada copy,398,Valid,L6,1620.0,Fell,1970-01-01 00:00:00.000001930,-31.6,-65.23333,"(-31.6, -65.23333)",NASA,False,1,-25.049467
9,Aguila Blanca copy,417,Valid,L,1440.0,Fell,1970-01-01 00:00:00.000001920,-30.86667,-64.55,"(-30.86667, -64.55)",NASA,True,1,-31.929579


In [41]:
df[df['name'].str.contains('copy')]

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation,source,boolean,mixed,reclat_city
45716,Aachen copy,1,Valid,L5,21.0,Fell,1970-01-01 00:00:00.000001880,50.775,6.08333,"(50.775, 6.08333)",NASA,False,A,56.298197
45717,Aarhus copy,2,Valid,H6,720.0,Fell,1970-01-01 00:00:00.000001951,56.18333,10.23333,"(56.18333, 10.23333)",NASA,False,1,63.79237
45718,Abee copy,6,Valid,EH4,107000.0,Fell,1970-01-01 00:00:00.000001952,54.21667,-113.0,"(54.21667, -113.0)",NASA,True,1,48.260981
45719,Acapulco copy,10,Valid,Acapulcoite,1914.0,Fell,1970-01-01 00:00:00.000001976,16.88333,-99.9,"(16.88333, -99.9)",NASA,True,1,13.841276
45720,Achiras copy,370,Valid,L6,780.0,Fell,1970-01-01 00:00:00.000001902,-33.16667,-64.95,"(-33.16667, -64.95)",NASA,False,A,-31.683042
45721,Adhi Kot copy,379,Valid,EH4,4239.0,Fell,1970-01-01 00:00:00.000001919,32.1,71.8,"(32.1, 71.8)",NASA,True,A,33.241399
45722,Adzhi-Bogdo (stone) copy,390,Valid,LL3-6,910.0,Fell,1970-01-01 00:00:00.000001949,44.83333,95.16667,"(44.83333, 95.16667)",NASA,False,A,44.671004
45723,Agen copy,392,Valid,H5,30000.0,Fell,1970-01-01 00:00:00.000001814,44.21667,0.61667,"(44.21667, 0.61667)",NASA,False,A,48.338963
45724,Aguada copy,398,Valid,L6,1620.0,Fell,1970-01-01 00:00:00.000001930,-31.6,-65.23333,"(-31.6, -65.23333)",NASA,False,1,-25.049467
45725,Aguila Blanca copy,417,Valid,L,1440.0,Fell,1970-01-01 00:00:00.000001920,-30.86667,-64.55,"(-30.86667, -64.55)",NASA,True,1,-31.929579


### Inline report without saving object

In [9]:
report = df.profile_report(sort=None, html={'style':{'full_width': True}}, progress_bar=False)
report



### Save report to file

In [46]:
profile_report = df.profile_report(html={'style': {'full_width': True}})
profile_report.to_file("example.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  return fn(config, series, summary)
  return fn(config, series, summary)


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### More analysis (Unicode) and Print existing ProfileReport object inline

In [47]:
profile_report = df.profile_report(explorative=True, html={'style': {'full_width': True}})
profile_report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Notebook Widgets

In [8]:
profile_report.to_widgets()



VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…