Welcome to SimSearch! This tutorial will show you how to upload your data to simsearch and perform covariate search.<br>
The first step will be to setup the project variables.

In [None]:
import requests

API_key = '<API_key>'
base_url = 'https://m18px3fd0m.execute-api.eu-west-1.amazonaws.com/beta'

client_id = '<client_id>'
project_name = '<project_name>' # choose a custom project name

## Create a new project

Our first step is creating a project. A project is essentially a space where you can upload your data.<br>
Once you will start to perform covariate searches, it will only take into account the data uploaded to a project.

In [74]:
headers = {
	'x-api-key': API_key,
	'Content-Type': 'application/json'
}

data = {
	"client_id": client_id,
	"project_name": project_name, #project_name
}

url = f"{base_url}/project/create"
response = requests.post(url, json=data, headers=headers)
response.json()

{'statuscode': 200, 'outcome': 'project has been created'}

Every project has a limit of samples and tags that you can request to extend.<br><br>
Now that we have created a new project, let us now check the project limit:<br>
your plan allows for a maximum of 20.000 samples and 1000 unique tags.

In [75]:
headers = {
	'x-api-key': API_key,
	'Content-Type': 'application/json'
}

data = {
	"client_id": client_id,
	"project_name" : project_name
}

url = f"{base_url}/project/list_stats"
response = requests.post(url, json=data, headers=headers)
response.json()

{'statuscode': 200,
 'stats': {'limit': {'tags': 1000, 'samples': 50000},
  'hosting': {},
  'storage': {'samples_processed': 0,
   'samples_uploaded': 0,
   'tags_processed': 0,
   'tag_uploaded': 0},
  'processing_id': '875f38c1-901e-48e2-bdd9-a28d684e4e9a'}}

## Prepare your data

To prepare your data, you need to follow two steps:<br>
1. **Upload**<br>
Coviarate search, is different from semantic search: a model needs to be prepared **on a set of unique tags**, <br>
which means that you will have to first upload all your data in a dedicated space. 

2. **Processing**<br>
Simsearch will create a lightweigth model out of that initial data: this phase is called **processing**.

We can start loading the Steam library in a pandas dataframe.<br>
To upload the data into our API, it will have to be converted into a list of dictionaries.<br>

There are two mandatory fields when uploading our data:
- **_id**
- **_tags**

We can start by renaming our columns accordingly.

In [2]:
import pandas as pd

# make sure to use dropna() to avoid breaking the API with nan data
df = pd.read_parquet('games.parquet').dropna()
df = df[['Name', 'Tags']]
# make sure the ids and the indexes are aligned
df = df.reset_index(drop=True)
df = df.reset_index()
df.columns = ['_id', 'name', '_tags']
df['_tags'] = df['_tags'].apply(lambda x : x.split(','))
df

Unnamed: 0,_id,name,_tags
0,0,Galactic Bowling,"[Indie, Casual, Sports, Bowling]"
1,1,Train Bandit,"[Indie, Action, Pixel Graphics, 2D, Retro, Arc..."
2,2,TD Worlds,"[Tower Defense, Rogue-lite, RTS, Replay Value,..."
3,3,MazM: Jekyll and Hyde,"[Adventure, Simulation, RPG, Strategy, Singlep..."
4,4,Deadlings: Rotten Edition,"[Action, Indie, Adventure, Puzzle-Platformer, ..."
...,...,...,...
41890,41890,Drop Doll,"[Mature, Sexual Content, Casual, Relaxing, NSF..."
41891,41891,Ant Farm Simulator,"[Simulation, Casual, Sandbox, Farming Sim, Lif..."
41892,41892,The Holyburn Witches,"[Casual, Adventure, Point & Click, Exploration..."
41893,41893,Digital Girlfriend,"[Casual, Sexual Content, Nudity, Adventure, Ma..."


The data needs to be prepared in list of dictionaries, weach one called **batch**: each dictionary in a batch contains a **unique id**.<br>
In the following example, we are splitting our data into 9 batches of 5000 samples each (the last batch will count of the excess 193 samples).

In [3]:
def create_ranges(n, length):
    ranges = []
    start = 0
    while start < n:
        end = min(start + length, n)
        ranges.append([start, end])
        start += length
    return ranges

# 8 batches of 5000 each
ranges = create_ranges(n=len(df), length=5000)

json_batch_list = list()
for r in ranges:
	dict_list = df.to_dict(orient='records')[r[0]:r[1]]
	json_batch_list.append(dict_list)
     
# let us have a look at one batch
json_batch_list[0][0:3]

[{'_id': 0,
  'name': 'Galactic Bowling',
  '_tags': ['Indie', 'Casual', 'Sports', 'Bowling']},
 {'_id': 1,
  'name': 'Train Bandit',
  '_tags': ['Indie',
   'Action',
   'Pixel Graphics',
   '2D',
   'Retro',
   'Arcade',
   'Score Attack',
   'Minimalist',
   'Comedy',
   'Singleplayer',
   'Fast-Paced',
   'Casual',
   'Funny',
   'Parody',
   'Difficult',
   'Gore',
   'Violent',
   'Western',
   'Controller',
   'Blood']},
 {'_id': 2,
  'name': 'TD Worlds',
  '_tags': ['Tower Defense',
   'Rogue-lite',
   'RTS',
   'Replay Value',
   'Perma Death',
   '2D',
   'Isometric',
   'Difficult',
   'Rogue-like',
   'Dynamic Narration',
   'Stylized',
   'Real Time Tactics',
   'Strategy',
   'Minimalist',
   'Abstract',
   'Tactical',
   'Atmospheric',
   'Singleplayer',
   'Sci-fi',
   'Mystery']}]

### Upload

We can proceed to upload our dataset!

In [78]:
for batch_index in range(len(json_batch_list)):
	data = {
		"client_id": client_id,
		"project_name" : project_name,
		"json_batch" : json_batch_list[batch_index],
		"batch_id" : str(batch_index),
	}

	url = f"{base_url}/data/upload/add"
	response = requests.post(url, json=data, headers=headers)
	print(response.json())

{'statuscode': 200, 'batch_id': '0', 'outcome': 'batch added to queue'}
{'statuscode': 200, 'batch_id': '1', 'outcome': 'batch added to queue'}
{'statuscode': 200, 'batch_id': '2', 'outcome': 'batch added to queue'}
{'statuscode': 200, 'batch_id': '3', 'outcome': 'batch added to queue'}
{'statuscode': 200, 'batch_id': '4', 'outcome': 'batch added to queue'}
{'statuscode': 200, 'batch_id': '5', 'outcome': 'batch added to queue'}
{'statuscode': 200, 'batch_id': '6', 'outcome': 'batch added to queue'}
{'statuscode': 200, 'batch_id': '7', 'outcome': 'batch added to queue'}
{'statuscode': 200, 'batch_id': '8', 'outcome': 'batch added to queue'}


Remember that the data will not be processed until we call the **/data/process/start** command.<br>
This means that we are free to add/remove our batches at any time. We can always check out the list of uploaded batches with the following command:

In [79]:
headers = {
	'x-api-key': API_key,
	'Content-Type': 'application/json'
}

data = {
	"client_id": client_id,
	"project_name" : project_name,
}

url = f"{base_url}/data/upload/list_batches"
response = requests.post(url, json=data, headers=headers)
response.json()

{'statuscode': 200,
 'to_process_list': ['8', '0', '4', '2', '6', '1', '5', '3', '7']}

In addition, we can always check how many unique ids and tags have been uploaded by calling **/project/list_stats**<br>
As we can see, we have successfully uploaded 41895 ids, for a total of 446 tags

In [80]:
headers = {
	'x-api-key': API_key,
	'Content-Type': 'application/json'
}

data = {
	"client_id": client_id,
	"project_name" : project_name
}

url = f"{base_url}/project/list_stats"
response = requests.post(url, json=data, headers=headers)
response.json()

{'statuscode': 200,
 'stats': {'limit': {'tags': 1000, 'samples': 50000},
  'hosting': {},
  'storage': {'samples_processed': 0,
   'samples_uploaded': 41895,
   'tags_processed': 0,
   'tag_uploaded': 446},
  'processing_id': '875f38c1-901e-48e2-bdd9-a28d684e4e9a'}}

### Process

We can start processing our data with the following endpoint.<br>
All our batches will be removed from the upload queue and sent to processing, which means if that we check our uploaded queue again, it will be empty.

In [81]:
headers = {
	'x-api-key': API_key,
	'Content-Type': 'application/json'
}

data = {
	"client_id": client_id,
	"project_name" : project_name
}

url = f"{base_url}/data/process/start"
response = requests.post(url, json=data, headers=headers)
response.json()

{'statuscode': 200, 'outcome': 'queue sent to processing'}

Just after your data has been processed, you can check the status of the queue: initially, it will be shown as **"queued"**.<br>When the processing will begin, the status will change to **"in_progress"**.<br>
After the processing has completed (this may take several minutes, depending on the overall demand), the status will change into **"successful"** or **"failed"**.<br>
Note that whenever you decide to process your data to create the encoding model, it will overwrite the available data.

In [86]:
headers = {
	'x-api-key': API_key,
	'Content-Type': 'application/json'
}

data = {
	"client_id": client_id,
	"project_name" : project_name
}

url = f"{base_url}/data/process/status"
response = requests.post(url, json=data, headers=headers)
response.json()

{'statuscode': 200, 'outcome': 'successful'}

## Search

Our data is ready to be searched! We can perform a covariate search by inputting the tag list we wish to find and the number of desired results (k).><br>
You can choose up to a limit of k=100. 

In [9]:
headers = {
	'x-api-key': API_key,
	'Content-Type': 'application/json'
}

data = {
	"client_id": client_id,
    "API_key": "",
	"project_name": project_name,
	"query_tag_list": ['Horror', 'Sci-fi'],
	"k": 5
}

url = f"{base_url}/search/covariate"
response = requests.post(url, json=data, headers=headers)
response.json()['hits']

[{'name': 'Shadowrain',
  '_tags': ['Indie',
   'Horror',
   'Dystopian',
   'Psychological Horror',
   'Sci-fi',
   'Survival Horror'],
  '_id': '18731'},
 {'name': 'WREST',
  '_tags': ['Adventure', 'VR', 'Horror', 'Sci-fi', 'Mystery'],
  '_id': '2075'},
 {'name': 'Quadrant',
  '_tags': ['Indie',
   'Horror',
   'Sci-fi',
   'Survival Horror',
   'Stealth',
   'Atmospheric'],
  '_id': '3141'},
 {'name': 'Smile',
  '_tags': ['Indie',
   'Strategy',
   'Horror',
   'First-Person',
   'Survival Horror',
   'Thriller',
   'Singleplayer',
   'Psychological Horror',
   'Sci-fi',
   'Futuristic'],
  '_id': '20942'},
 {'name': 'Hollow',
  '_tags': ['Horror',
   'Nudity',
   'Gore',
   'Action',
   'Adventure',
   'Violent',
   'Indie',
   'Survival Horror',
   'Sci-fi',
   'FPS'],
  '_id': '20685'}]