# Fetching data using `SqlFetcher`
Translating using a SQL database. This notebook assumes that the ***Prepare for `SqlFetcher` demo***-step from the [PickleFetcher](../pickle-translation/PickleFetcher.ipynb) demo notebook has been completed.

In [1]:
import sys
import rics
import id_translation

# Print relevant versions
print(f"{rics.__version__=}")
print(f"{id_translation.__version__=}")
print(f"{sys.version=}")
rics.configure_stuff(rics_level="DEBUG", id_translation_level="DEBUG")
!git log --pretty=oneline --abbrev-commit -1

rics.__version__='3.0.0'
id_translation.__version__='0.3.1.dev1'
sys.version='3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]'
👻 Configured some stuff just the way I like it!
[33md2093a4[m[33m ([m[1;36mHEAD[m[33m, [m[1;31morigin/main[m[33m, [m[1;31morigin/HEAD[m[33m, [m[1;32mmain[m[33m)[m Update help link in TOML files


## Load database

In [2]:
import tomli

with open("config.toml", "rb") as f:
    connection_string = tomli.load(f)["fetching"]["SqlFetcher"]["connection_string"]
    connection_string = connection_string.format(password="your_password")
    print(f"{connection_string=}")

connection_string='postgresql+pg8000://postgres:your_password@localhost:5432/imdb'


In [3]:
import sqlalchemy
import tomli
from data import load_imdb


engine = sqlalchemy.create_engine(connection_string)

for source in ["name.basics", "title.basics"]:
    df = load_imdb(source)[0]
    df.to_sql(source.replace(".", "_"), engine, if_exists="replace")

2023-03-25T11:35:34.942 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/name.basics.tsv.gz'.
2023-03-25T11:35:34.943 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/name.basics.tsv.gz'.
2023-03-25T11:35:34.943 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/clean_and_fix_ids/name.basics.tsv.pkl'.
2023-03-25T11:36:11.181 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/title.basics.tsv.gz'.
2023-03-25T11:36:11.182 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/title.basics.tsv.gz'.
2023-03-25T11:36:11.183 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/clean_

## Create translator from config
Click [here](config.toml) to see the file.

In [4]:
from id_translation import Translator

translator = Translator.from_config("config.toml")
translator

2023-03-25T11:36:21.816 [id_translation.fetching.config-toml.sql.discovery:DEBUG] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Metadata created in 0.0342428 sec.
2023-03-25T11:36:21.817 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'id'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics'.
2023-03-25T11:36:21.818 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='title_basics' for ['id']x['isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.819 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. A

Translator(online=True: fetcher=SqlFetcher(Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']))

## Make some data to translate

In [5]:
import pandas as pd

engine = translator._fetcher._engine


def first_title(seed=None, n=1000):
    df = pd.read_sql("SELECT * FROM name_basics;", engine).sample(n, random_state=seed)
    df["firstTitle"] = df.knownForTitles.str.split(",").str[0]
    return df[["nconst", "firstTitle"]]

In [6]:
translator.store().cache

2023-03-25T11:36:21.853 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'to', 'name', 'from'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics'.
2023-03-25T11:36:21.854 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='title_basics' for ['to', 'name', 'from']x['isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.855 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 3 matches={'to': 'endYear', 'name': 'primaryTitle', 'from': 'startYear'} in the given values=['to', 'name', 'from'].


TranslationMap('name_basics': 172326 IDs, 'title_basics': 48979 IDs)

## Get the name and the "first" appearance for actors
In the IMDb list anyway. I have no idea how they're ordered in "knownForTitles".

In [7]:
df = first_title(seed=5)
df.head()

Unnamed: 0,nconst,firstTitle
102407,nm0807090,tt0063794
105345,nm0831760,tt0019348
1893,nm0008280,tt0012541
107276,nm0845966,tt0066501
170646,nm8380982,tt5990574


## Translate

In [8]:
translator.translate(df).head(5)

2023-03-25T11:36:23.814 [id_translation.Translator:DEBUG] Begin translation of 'DataFrame' using sources=['title_basics', 'name_basics']. Names to translate: Will be derived based on 'DataFrame'.
2023-03-25T11:36:23.815 [id_translation.Translator:DEBUG] Begin name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sources=['title_basics', 'name_basics'].
2023-03-25T11:36:23.816 [id_translation.mapping.name-to-source:DEBUG] Begin computing match scores for ['nconst', 'firstTitle']x['title_basics', 'name_basics'] using HeuristicScore([like_database_table()] -> modified_hamming).
2023-03-25T11:36:23.817 [id_translation.mapping.name-to-source:DEBUG] All values mapped by overrides. Applied 3 overrides, and found 2 matches={'nconst': 'name_basics', 'firstTitle': 'title_basics'} in the given values=['nconst', 'firstTitle'].
2023-03-25T11:36:23.818 [id_translation.Translator:DEBUG] Finished name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sour

Unnamed: 0,nconst,firstTitle
102407,nm0807090:Zoya Smirnova-Nemirovich *1909†1986,tt0063794 not translated; default name=Title unknown
105345,nm0831760:Billy Stone *1884†1931,tt0019348 not translated; default name=Title unknown
1893,nm0008280:Achmed Abdullah *1881†1945,tt0012541 not translated; default name=Title unknown
107276,nm0845966:André Tabet *1902†1981,tt0066501 not translated; default name=Title unknown
170646,nm8380982:Lee Botts *1928†2019,tt5990574 not translated; default name=Title unknown


In [9]:
translator.translate(df, inplace=True)  # returns None
df.head(5)

2023-03-25T11:36:24.024 [id_translation.Translator:DEBUG] Begin translation of 'DataFrame' using sources=['title_basics', 'name_basics']. Names to translate: Will be derived based on 'DataFrame'.
2023-03-25T11:36:24.026 [id_translation.Translator:DEBUG] Begin name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sources=['title_basics', 'name_basics'].
2023-03-25T11:36:24.027 [id_translation.mapping.name-to-source:DEBUG] Begin computing match scores for ['nconst', 'firstTitle']x['title_basics', 'name_basics'] using HeuristicScore([like_database_table()] -> modified_hamming).
2023-03-25T11:36:24.029 [id_translation.mapping.name-to-source:DEBUG] All values mapped by overrides. Applied 3 overrides, and found 2 matches={'nconst': 'name_basics', 'firstTitle': 'title_basics'} in the given values=['nconst', 'firstTitle'].
2023-03-25T11:36:24.030 [id_translation.Translator:DEBUG] Finished name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sour

Unnamed: 0,nconst,firstTitle
102407,nm0807090:Zoya Smirnova-Nemirovich *1909†1986,tt0063794 not translated; default name=Title unknown
105345,nm0831760:Billy Stone *1884†1931,tt0019348 not translated; default name=Title unknown
1893,nm0008280:Achmed Abdullah *1881†1945,tt0012541 not translated; default name=Title unknown
107276,nm0845966:André Tabet *1902†1981,tt0066501 not translated; default name=Title unknown
170646,nm8380982:Lee Botts *1928†2019,tt5990574 not translated; default name=Title unknown
