<img src="https://assets-global.website-files.com/5b26e3fda3234fe366aa392d/5c2d0132868913d90eba845b_v7_logo_Nobg.png" loading="lazy" sizes="40px" width="200" srcset="https://assets-global.website-files.com/5b26e3fda3234fe366aa392d/5c2d0132868913d90eba845b_v7_logo_Nobg-p-500.png 500w, https://assets-global.website-files.com/5b26e3fda3234fe366aa392d/5c2d0132868913d90eba845b_v7_logo_Nobg-p-800.png 800w, https://assets-global.website-files.com/5b26e3fda3234fe366aa392d/5c2d0132868913d90eba845b_v7_logo_Nobg.png 1000w" alt="" class="nav-2022---logo">
<br>
<img data-canonical-src="https://img.shields.io/badge/GitHub-100000?logo=github&amp;logoColor=white" alt="GitHub" src="https://camo.githubusercontent.com/8fdc8a65f5384d2285b19d3985fa80f21c23634c6de3a0f0d2aff988c96bef9f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4769744875622d3130303030303f6c6f676f3d676974687562266c6f676f436f6c6f723d7768697465">

# V7 Connector for Databricks
In this notebook, we will demonstrate how to use the **DarwinPyspark** and **DarwinPy** libraries to manage your data in Databricks and interact with the V7 platform. The library allows you to upload data from a PySpark DataFrame to V7 and download V7 exports as PySpark DataFrames.


_If you have any questions or comments, please reach out to us [here](mailto:databricks+partners@v7labs.com)_

# Requirements

#### Data

- An `object_url` column
  - this column must contain open or presigned URLs that point to the asset to be uploaded.
- A `file_name` column 
  - this column contains the name you want to represent the asset in V7 with.

#### API Key
<br>
<div class="alert alert-block alert-info"> <b>Tip:</b> You can generate your API key <a style="font_size: 120%; text-decoration: underline; cursor: pointer;" href="https://darwin.v7labs.com/?settings=api-keys"> here </a> 

>  Run the cell below for a video guide on how to generate an API key </div>

In [0]:
from IPython.display import IFrame
IFrame("https://www.loom.com/embed/3d414eb9c0444b0f9309d49b8e393414", width=950, height=400)

# Code

## Set-Up & Imports

Install `darwinpyspark` and `darwin-py`

In [0]:
%pip install darwinpyspark
%pip install darwin-py

Import the required modules

In [0]:
# Darwin-Py Modules
import darwin
from darwin.client import Client

# DarwinPyspark Modules
from darwinpyspark import DarwinPyspark

# Other
import pandas as pd
from datetime import datetime

In [0]:
pd.set_option('max_colwidth', None)

### Authenticate

In [0]:
API_KEY = "<your-api-key>"

client = Client.from_api_key(API_KEY)

## Interact With Datasets

### Create / Load Datasets

We now either create and connect to a new dataset, or connect with an existing.
> Simply change the name of 'dataset_name' in the below cell:

In [0]:
dataset_name = "Databricks Demo"
slug_dataset_name = dataset_name.lower().strip().replace(' ', '-')

try:
    dataset = client.create_dataset(dataset_name)
    print('New dataset created.')
except darwin.exceptions.NameTaken:
    dataset = client.get_remote_dataset(slug_dataset_name)
    print('Dataset name already created, connected to existing.')

### View Datasets

If you want to view information about your existing datasets, run the below code.

In [0]:
[{'name': i.name, 'id': i.dataset_id, 'slug_name': i.slug, 'no_files': i.item_count, 'progress': i.progress} for i in client.list_remote_datasets()]

### Upload Data To V7

##### Load spark table
Load the visual data into a pyspark dataframe. This dataframe should contain the **file_name** and **object_url** of the assets you want to load to V7.

In [0]:
df = spark.read.table("v7_sandbox.demo_table")
df.display()

Initialise `darwinpsypark` using your _API Key_, your _team name_ and your _dataset name_ in V7

In [0]:
dp = DarwinPyspark(API_KEY, "<your-team-name", "<your-dataset-name>")

Then you can call the `upload_items()` method, to iterate over your table and send the data to V7.

In [0]:
dp.upload_items(df)

### Download Data

#### Create an export

In [0]:
export_name = f"databricks-export-{datetime.now().strftime('%H:%M')}".lower().strip().replace(' ', '-') #this should be in slug format
dataset.export(export_name)
export_name

#### Download an export

In [0]:
export_df = dp.download_export(export_name)
export_df.toPandas()