# Feature Profiling with PyRasgo

This notebook explains how to use `pyrasgo` to create feature profiles of a `pandas` dataframe.

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)
* [statsmodels](https://www.statsmodels.org/stable/index.html)
    * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)
* [numpy](https://numpy.org/doc/stable/)
* [PyRasgo](https://app.gitbook.com/@rasgo/s/rasgo-docs/pyrasgo-0.1/dataframe-prep)

In [1]:
import statsmodels.api as sm
import pandas as pd
import numpy as np

import pyrasgo

## Connect to Rasgo

Enter your email and password to create an account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis.  In addition, this account allows you to maintain access to your analysis and share with your colleagues.

**Note** This only needs to be run the first time you use pyrasgo.  

In [2]:
#pyrasgo.register(email='<your email>', password='<your password>')

Enter the email and password you used at registration to connect to Rasgo.

In [3]:
rasgo = pyrasgo.login(email='<your email>', password='<your password>')

## Reading the data

The data is from `rdatasets` imported using the Python package `statsmodels`.

In [4]:
df = sm.datasets.get_rdataset('flights', 'nycflights13').data

## Feature Engineering

### Convert the times from floats or ints to hour and minutes

Convert some of the fields into more meaningful fields to better understand the time flights depart and arrive.  Next the original fields are dropped as they are now redundant.

In [5]:
df.dropna(inplace=True)
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)
df.drop(columns=['time_hour', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'], inplace=True)

## Profile Features

In [6]:
response = rasgo.evaluate.profile(df)
response

{'columnProfiles': [{'columnName': 'year',
   'dataType': 'integer',
   'featureStats': {'recCt': 327346,
    'distinctCt': 1,
    'nullRecCt': 0,
    'zeroValRecCt': 0,
    'meanVal': 2013.0,
    'medianVal': 2013.0,
    'maxVal': 2013.0,
    'minVal': 2013.0,
    'sumVal': 658947498.0,
    'stdDevVal': 0.0,
    'varianceVal': 0.0,
    'skewVal': 0.0,
    'kurtosisVal': 0.0,
    'q1Val': 2013.0,
    'q3Val': 2013.0,
    'pct5Val': 2013.0,
    'pct95Val': 2013.0},
   'commonValues': [{'val': 2013, 'recCt': 327346, 'freq': 1.0}],
   'histogram': [{'height': 327346,
     'bucketFloor': 2012.5,
     'bucketCeiling': 2012.5}]},
  {'columnName': 'month',
   'dataType': 'integer',
   'featureStats': {'recCt': 327346,
    'distinctCt': 12,
    'nullRecCt': 0,
    'zeroValRecCt': 0,
    'meanVal': 6.564802991330274,
    'medianVal': 7.0,
    'maxVal': 12.0,
    'minVal': 1.0,
    'sumVal': 2148962.0,
    'stdDevVal': 3.4134443809918524,
    'varianceVal': 11.65160254212485,
    'skewVal': -0.0