## What Is a Dataset?

A dataset is a collection of data points with a common schema. The Cortex Python SDK provides transformations and visualizations to facilitate data cleaning, feature identification and feature construction. In this notebook we demonstrate how to build a dataset and how to view the contents of datasets.

## How is a Dataset Built? 
First, import the Cortex library and instantiate a builder.

In [None]:
%run ./config.ipynb

In [None]:
#Run this cell if your token/session has expired.

from cortex import Cortex
Cortex.login()

In [None]:
from cortex import Cortex

builder = Cortex.client().builder()


Builder is the top level factory object in the Cortext Python SDK. The builder returns a factory object that is customized to handle the context for the particular class it builds. A dataset requires a collection of data to be useful, so the factory object returns a dataset builder that can take data in a number of different forms.

For example, you can associate a CSV file with a dataset:

In [None]:
# to help with making this dataset distinct in class we will use an input generator here for the dataset name.  
# This variable will be stored throughout this example.
dataset_name1 = input("namespace/dataset name")
    
csv_data_set_builder = builder.dataset(dataset_name1)

csv_example_data_set = csv_data_set_builder.from_csv('./data/sample_large.csv').build()

Or a dataset with JSON:

In [None]:
# to help with making this dataset distinct in class we will use an input generator here for the dataset name.  
# This variable will be stored throughout this example.
dataset_name2 = input("namespace/dataset name")

json_data_set_builder = builder.dataset(dataset_name2)

json_example_data_set = json_data_set_builder.from_json('./data/sample.json').build()

Or from a pandas DataFrame:

In [None]:
# to help with making this dataset distinct in class we will use an input generator here for the dataset name.  
# This variable will be stored throughout this example.
dataset_name3 = input("namespace/dataset name")

# two columns of random numbers, indexed a through e
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
q = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

# make a data frame by composing the columns together and labeling them
pdf = pd.DataFrame({'c1':s,'c2':q})

pd_data_set_builder = builder.dataset(dataset_name3)

data_frame_data_set = pd_data_set_builder.from_df(pdf).build()

In [None]:
data_frame_data_set.as_pandas()

## Set the title and description of your datasets 

(this pulls the last dataset you created above from memory)

In [None]:
csv_example_data_set.title = 'A Title for the example csv dataset by <your name>'
csv_example_data_set.description = 'A somewhat longer piece of text that describes the purpose of the dataset by <your name>.'

In [None]:
json_example_data_set.title = 'A Title for the example json dataset by <your name>'
json_example_data_set.description = 'A somewhat longer piece of text that describes the purpose of the dataset by <your name>.'

Once constructed, you can explicitly persist a dataset, here we will use the csv_example_data_set.

In [None]:
x=csv_example_data_set.save()

In [None]:
df = x.as_pandas()
print(df)

Here you can see the json_example_data_set

In [None]:
y=json_example_data_set.save()

In [None]:
df = y.as_pandas()
print(df)

Note that with the `Cortex.local()` client, the dataset is persisted to the local disk. When using the Cortex client `Cortex.client()`, the dataset is persisted in Cortex.

Now in the Cortex CLI you can see the persisted datasets and their relevant information by using the "cortex datasets list" command.  If you want to see more information about a specifice dataset use the "cortex datasets describe <dataset_name>" command. 

## Dataset Feature Construction

Datasets help in feature construction through the use of pipelines. Pipelines allow functions to be chained together to modify and combine columns to create and clarify new features in the dataset. To find out how to create and persist pipelines, see [Pipeline](https://docs.cortex.insights.ai/docs/cortex-python-sdk-guide/pipeline/).

## View Datasets

Datasets can be viewed in tables or through visualizations. 

### Data Dictionary
A Dataset can generate a data dictionary:

In [None]:
csv_example_data_set.get_dataframe()

### pandas DataFrame

Datasets can also generate pandas DataFrames. 

In [None]:
jdf = json_example_data_set.as_pandas()

In [None]:
cdf = csv_example_data_set.as_pandas()

pandas' DataFrames include several different methods for [viewing data](https://pandas.pydata.org/pandas-docs/stable/10min.html#viewing-data) .

In [None]:
jdf.head()

In [None]:
cdf.head()

### With Visualizations 

Here are the built-in visualizations that you get with datasets. Visualizations require a dataframe. Most commonly the dataframe is constructed by running a pipeline on the data set: 

In [None]:
#it is okay to get an error here, if the name is not defined it just means the pipeline hasnt been defined and run yet.
clean_csv_pl.reset()

In [None]:
clean_csv_pl = csv_example_data_set.pipeline('clean_csv_pl')

def add_new_column(pipeline, df):
    x = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
    y = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
    z = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
    pdf = pd.DataFrame({'c1':x, 'c2':y,'c3':z})
    return pdf

In [None]:
clean_csv_pl.add_step(add_new_column)

In [None]:
cleaned_csv_df = clean_csv_pl.run(csv_example_data_set.as_pandas())

cleaned_csv_df.describe()

In [None]:
v = csv_example_data_set.visuals(cleaned_csv_df)

In [None]:
v.show_corr_heatmap()

In [None]:
v.show_corr('c1')

In [None]:
v.show_corr_pairs('c1')

In [None]:
v.show_dist('c1')

In [None]:
v.show_probplot('c1')