## Bunny Studio
### Search Data Scientist Test Part II

Objective:

- Usage of the class **SearchRecommender** to find samples given a search pattern (string)


In [1]:
##lets import some libraries to read and manipulate data
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from samples_finder import *
import pickle
##pre-defined search terms 
with open("search_terms.txt","r") as f:
    samples_search=[line.replace("\n","").strip() for line in f.readlines()]

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jhonp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\jhonp\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Creating an instance of SearchRecommender

This auxiliar class was created in order to concatenate every step involved in the problem of finding samples based on an specific search pattern given by the client. To use the class follow the next steps:

1. Instanciate the class
2. Define a search pattern
3. Use the class built-in method to suggest samples
4. (optional) use the built-in generator method to return fixed-size batches with non-repeated pro's id per batch


In [2]:
#1 Instanciate the class
finder=SearchRecommender()

In [3]:
#2 define search pattern
search_pattern="Child voice for videogame character"
#3 apply transformation steps and compute similarities
full_results=finder.get_suggestions(search=search_pattern)

[INFO] Similarities created


In [4]:
#4 using results_by_batch, we obtain a list of similarities and results. For every item in those lists, one can find if the batch has acceptable similarity with the search pattern. This metric ranges from -1 to 1, where the closer to 1, the better.
similarities,sample_batches=finder.results_by_batch(full_results)
print("[INFO] Average Similarity of result with the {}st batch: {:.2f}".format(1,similarities[0]))
sample_batches[0].head(10)

[INFO] Average Similarity of result with the 1st batch: 0.64


Unnamed: 0,sample_id,category,pro_id,attribute_value,tag_name,similarity,performance_score,bookings,expired_samples,samples_rejected_internally,speed_to_book,average_review,num_favorites,successful_bookings,successful_projects
42157,68040,audio,1B363,english american videogames child boy,,0.639075,98.571911,88,1,0,12652.0,4.9,36,88,0
86535,168404,audio,45057,english american videogames child boy,,0.639075,94.760723,25,3,4,8149.5,5.0,5,25,58
74029,131483,audio,23C1,english american videogames child boy,,0.639075,84.596874,18,0,10,587.0,4.4,3,17,47
104893,379244,audio,64320,english american videogames child boy,child neutral conversational,0.657727,70.0,0,0,0,0.0,0.0,0,0,0
68062,119644,audio,F4DE,english american videogames child boy,,0.639075,70.0,0,0,0,0.0,0.0,0,0,0
35688,197699,audio,1FCE,english american videogames child boy,,0.639075,70.0,0,0,0,0.0,0.0,0,0,0
39093,214916,audio,1ADE,english american videogames child boy,,0.639075,70.0,0,0,0,0.0,0.0,0,0,0
42673,70401,audio,153B8,english american videogames child boy,,0.639075,70.0,0,0,0,0.0,0.0,0,0,0
43278,92948,audio,EC2F,english american videogames child boy,,0.639075,70.0,0,0,0,0.0,0.0,0,0,0
45491,81895,audio,1FC24,english american videogames child boy,,0.639075,70.0,0,0,0,0.0,0.0,0,0,0


### Generating samples for every search pattern



In [5]:
for search_ in samples_search:
    full_results=finder.get_suggestions(search=search_)
    similarities,sample_batches=finder.results_by_batch(full_results)
    search_out=search_.lower().replace(" ","_")

    out_path=f"recommendations/{search_out}"
    if not os.path.exists(out_path):
        os.makedirs(out_path)
    with open(f'{out_path}/data_batches.pkl', 'wb') as f:
        pickle.dump(sample_batches, f)
    with open(f'{out_path}/similarites_per_batch.pkl', 'wb') as f:
        pickle.dump(similarities, f)


[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created
[INFO] Similarities created


### Final Remarks

- SearchRecommender is a small solution to the problem of suggesting samples given a search pattern
- SearchRecommender can be improved by using a more robust performance score for PRO's and an adequate tool to compute attributes and tags embeddings like Haystack/ElasticSearch