# Skimpy - A simple way to summarize your dataset

skimpy is a light weight tool that provides summary statistics about variables in data frames within the console. Think of it as a super version of df.summary().

```bash
pip install skimpy
```
https://pypi.org/project/skimpy/


# Table of Contents (datasets)

### Table of Contents <a class="anchor" id="skimpy_toc"></a>

* [Table of Contents](#DS103L2_toc)
    * [builtin seaborn datasets](#skimpy_datasets)
    * [about skim](#skimpy_skim)
    * [load external diabetes](#skimpy_diabetes)
    * [generate test data](#skimpy_generate_test_data)
    * [skimpy cli](#skimpy_cli)
    * [anagrams](#skimpy_anagrams)
    * [anscombe](#skimpy_anscombe)
    * [attention](#skimpy_attention)
    * [brain_networks](#skimpy_brain_networks)
    * [car_crashes](#skimpy_car_crashes)
    * [diamonds](#skimpy_diamonds)
    * [dots](#skimpy_dots)
    * [exercise](#skimpy_exercise)
    * [flights](#skimpy_flights)
    * [fmri](#skimpy_fmri)
    * [gammas](#skimpy_gammas)
    * [geyser](#skimpy_geyser)
    * [iris](#skimpy_iris)
    * [mpg](#skimpy_mpg)
    * [penguins](#skimpy_penguins)
    * [planets](#skimpy_planets)
    * [taxis](#skimpy_taxis)
    * [tips](#skimpy_tips)
    * [titanic](#skimpy_titanic)

In [1]:
# import all the libraries that are required for creating the statistical analysis and loading the data
import pandas as pd
from skimpy import skim, generate_test_data
import seaborn as sns

# list builtin seaborn datasets <a class="anchor" id="skimpy_datasets"></a>
[Back to Top](#skimpy_toc)

In [2]:
dataset_names = sns.get_dataset_names()

In [3]:
dataset_names

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'geyser',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'taxis',
 'tips',
 'titanic']

# about skim <a class="anchor" id="skimpy_skim"></a>
[Back to Top](#skimpy_toc)

In [4]:
skim?

[0;31mSignature:[0m
[0mskim[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdf[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader_style[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'bold cyan'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mcolour_kwargs[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Skim a data frame and return statistics.

skim is an alternative to pandas.DataFrame.summary(), quickly providing
an overview of a data frame. It produces a different set of summary
functions based on the types of columns in the dataframe. You may get
better results from ensuring that you set the datatypes in your dataframe
you want before running skim.
The colour_kwargs (str) are defined in dataframe_to_rich_table.

Args:
    df (pd.DataFrame): Dataframe to skim
  

# diabetes <a class="anchor" id="skimpy_diabetes"></a>
[Back to Top](#skimpy_toc)

In [5]:
# import an external file from Kaggle https://www.kaggle.com/saurabh00007/diabetescsv and move into Data folder
# use a relative path to load dataset
diabetes = pd.read_csv("../Data/Diabetes.csv")

In [6]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [7]:
skim(diabetes)

# generate test data <a class="anchor" id="skimpy_generate_test_data"></a>
[Back to Top](#skimpy_toc)

In [None]:
# generate test data from seaborn
test_data = generate_test_data()

In [None]:
generate_test_data?

In [None]:
test_data.describe()

In [None]:
test_data.info()

In [None]:
skim(test_data)

# command line (CLI) skimpy <a class="anchor" id="skimpy_cli"></a>
[Back to Top](#skimpy_toc)

In [None]:
#you can also run command line (CLI) skimpy on the data
!skimpy ../Data/diabetes.csv

# anagrams <a class="anchor" id="skimpy_anagrams"></a>
[Back to Top](#skimpy_toc)

In [8]:
anagrams= sns.load_dataset("anagrams")

In [9]:
skim(anagrams)

In [10]:
anagrams.head()

Unnamed: 0,subidr,attnr,num1,num2,num3
0,1,divided,2,4.0,7
1,2,divided,3,4.0,5
2,3,divided,3,5.0,6
3,4,divided,5,7.0,5
4,5,divided,4,5.0,8


# anscombe <a class="anchor" id="skimpy_anscombe"></a>
[Back to Top](#skimpy_toc)

In [11]:
anscombe = sns.load_dataset("anscombe")

In [12]:
skim(anscombe)

In [None]:
anscombe.head()

# attention <a class="anchor" id="skimpy_attention"></a>
[Back to Top](#skimpy_toc)

In [None]:
attention = sns.load_dataset("attention")

In [None]:
attention.describe()

In [None]:
skim(attention)

# brain_networks <a class="anchor" id="skimpy_brain_networks"></a>
[Back to Top](#skimpy_toc)

In [None]:
brain_networks = sns.load_dataset("brain_networks")

In [None]:
brain_networks.describe()

In [None]:
# error
#skim(brain_networks)

# car_crashes <a class="anchor" id="skimpy_car_crashes"></a>
[Back to Top](#skimpy_toc)

In [None]:
car_crashes = sns.load_dataset("car_crashes")

In [None]:
car_crashes.describe()

In [None]:
skim(car_crashes)

# diamonds <a class="anchor" id="skimpy_diamonds"></a>
[Back to Top](#skimpy_toc)

In [None]:
diamonds = sns.load_dataset("diamonds")

In [None]:
diamonds.describe()

In [None]:
skim(diamonds)

# dots <a class="anchor" id="skimpy_dots"></a>
[Back to Top](#skimpy_toc)

In [None]:
dots = sns.load_dataset("dots")

In [None]:
dots.describe()

In [None]:
skim(dots)

# exercise <a class="anchor" id="skimpy_exercise"></a>
[Back to Top](#skimpy_toc)

In [None]:
exercise = sns.load_dataset("exercise")

In [None]:
exercise.describe()

In [None]:
skim(exercise)

# flights <a class="anchor" id="skimpy_flights"></a>
[Back to Top](#skimpy_toc)

In [None]:
flights = sns.load_dataset("flights")

In [None]:
flights.describe()

In [None]:
skim(flights)

In [None]:
flights.head()

In [None]:
flights.tail()

# fmri <a class="anchor" id="skimpy_fmri"></a>
[Back to Top](#skimpy_toc)

In [None]:
fmri = sns.load_dataset("fmri")

In [None]:
fmri.describe()

In [None]:
skim(fmri)

# gammas <a class="anchor" id="skimpy_gammas"></a>
[Back to Top](#skimpy_toc)

In [None]:
gammas  = sns.load_dataset("gammas")

In [None]:
gammas.describe()

In [None]:
skim(gammas)

# geyser <a class="anchor" id="skimpy_geyser"></a>
[Back to Top](#skimpy_toc)

In [None]:
geyser = sns.load_dataset("geyser")

In [None]:
geyser.describe()

In [None]:
skim(geyser)

# iris <a class="anchor" id="skimpy_iris"></a>
[Back to Top](#skimpy_toc)

In [None]:
iris = sns.load_dataset("iris")

In [None]:
iris.describe()

In [None]:
skim(iris)

# mpg <a class="anchor" id="skimpy_mpg"></a>
[Back to Top](#skimpy_toc)

In [None]:
mpg = sns.load_dataset("mpg")

In [None]:
mpg.describe()

In [None]:
skim(mpg)

# penguins <a class="anchor" id="skimpy_penguins"></a>
[Back to Top](#skimpy_toc)

In [None]:
penguins = sns.load_dataset("penguins")

In [None]:
penguins.describe()

In [None]:
skim(penguins)

# planets <a class="anchor" id="skimpy_planets"></a>
[Back to Top](#skimpy_toc)

In [None]:
planets = sns.load_dataset("planets")

In [None]:
planets.describe()

In [None]:
skim(planets)

# taxis <a class="anchor" id="skimpy_taxis"></a>
[Back to Top](#skimpy_toc)

In [None]:
taxis = sns.load_dataset("taxis")

In [None]:
taxis.describe()

In [None]:
skim(taxis)

# tips <a class="anchor" id="skimpy_tips"></a>
[Back to Top](#skimpy_toc)

In [None]:
tips = sns.load_dataset("tips")

In [None]:
tips.describe()

In [None]:
skim(tips)

# titanic <a class="anchor" id="skimpy_titanic"></a>
[Back to Top](#skimpy_toc)

In [13]:
titanic = sns.load_dataset("titanic")

In [14]:
titanic.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [15]:
skim(titanic)