<a href="https://colab.research.google.com/github/sanjaykshetri/springboard_HW/blob/master/Copy_of_meteorites.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## YData Profiling: NASA Meteorites example
Source of data: https://data.nasa.gov/docs/legacy/meteorite_landings/Meteorite_Landings.csv

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

Make sure that we have the latest version of pandas-profiling.

In [2]:
import sys

!{sys.executable} -m pip install -U ydata-profiling[notebook]
!pip install jupyter-contrib-nbextensions
!jupyter nbextension enable --py widgetsnbextension

Collecting ydata-profiling[notebook]
  Downloading ydata_profiling-4.17.0-py2.py3-none-any.whl.metadata (22 kB)
Collecting scipy<1.16,>=1.4.1 (from ydata-profiling[notebook])
  Downloading scipy-1.15.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting visions<0.8.2,>=0.7.5 (from visions[type_image_path]<0.8.2,>=0.7.5->ydata-profiling[notebook])
  Downloading visions-0.8.1-py3-none-any.whl.metadata (11 kB)
Collecting minify-html>=0.15.0 (from ydata-profiling[notebook])
  Downloading minify_html-0.18.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting filetype>=1.0.0 (from ydata-profiling[notebook])
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting phik<0.13,>=0.11.1 (from ydata-profiling[notebook])
  Downloading phik-0.12.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_

You might want to restart the kernel now.

### Import libraries

In [3]:
from pathlib import Path

import numpy as np
import pandas as pd
import requests

import ydata_profiling
from ydata_profiling.utils.cache import cache_file

### Load and prepare example dataset
We add some fake variables for illustrating pandas-profiling capabilities

In [4]:
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/docs/legacy/meteorite_landings/Meteorite_Landings.csv",
)

df = pd.read_csv(file_name)

# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df["year"] = pd.to_datetime(df["year"], errors="coerce")

# Example: Constant variable
df["source"] = "NASA"

# Example: Boolean variable
df["boolean"] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df["mixed"] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df["reclat_city"] = df["reclat"] + np.random.normal(scale=5, size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add["name"] = duplicates_to_add["name"] + " copy"

df = pd.concat([df, duplicates_to_add], ignore_index=True)

In [5]:
df.head()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation,source,boolean,mixed,reclat_city
0,Aachen,1,Valid,L5,21.0,Fell,1970-01-01 00:00:00.000001880,50.775,6.08333,"(50.775, 6.08333)",NASA,False,1,48.612275
1,Aarhus,2,Valid,H6,720.0,Fell,1970-01-01 00:00:00.000001951,56.18333,10.23333,"(56.18333, 10.23333)",NASA,False,A,49.315957
2,Abee,6,Valid,EH4,107000.0,Fell,1970-01-01 00:00:00.000001952,54.21667,-113.0,"(54.21667, -113.0)",NASA,True,1,43.681403
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1970-01-01 00:00:00.000001976,16.88333,-99.9,"(16.88333, -99.9)",NASA,True,1,9.547757
4,Achiras,370,Valid,L6,780.0,Fell,1970-01-01 00:00:00.000001902,-33.16667,-64.95,"(-33.16667, -64.95)",NASA,False,1,-28.265059


### Inline report without saving object

In [6]:
report = df.profile_report(
    sort=None, html={"style": {"full_width": True}}, progress_bar=False
)
report

100%|██████████| 14/14 [00:01<00:00,  7.57it/s]




### Save report to file

In [7]:
profile_report = df.profile_report(html={"style": {"full_width": True}})
profile_report.to_file("/tmp/example.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/14 [00:00<?, ?it/s][A
  7%|▋         | 1/14 [00:00<00:07,  1.64it/s][A
  "min": pd.Timestamp.to_pydatetime(series.min()),
  "max": pd.Timestamp.to_pydatetime(series.max()),

 43%|████▎     | 6/14 [00:01<00:01,  4.20it/s][A
 50%|█████     | 7/14 [00:01<00:02,  3.45it/s][A
 71%|███████▏  | 10/14 [00:02<00:00,  6.08it/s][A
100%|██████████| 14/14 [00:02<00:00,  6.27it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### More analysis (Unicode) and Print existing ProfileReport object inline

In [8]:
profile_report = df.profile_report(
    explorative=True, html={"style": {"full_width": True}}
)
profile_report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/14 [00:00<?, ?it/s][A
  7%|▋         | 1/14 [00:01<00:16,  1.23s/it][A
 36%|███▌      | 5/14 [00:02<00:03,  2.83it/s][A
100%|██████████| 14/14 [00:02<00:00,  5.57it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Notebook Widgets

In [9]:
profile_report.to_widgets()



Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

ValueError: ('widget type not understood', 'overview_tabs')