This notebook allows you to load `wikis.tsv` into the `canonical_data.wikis` in the Data Lake. 

Note that you must have the permission to sudo as `analytics-product`, which owns the `canonical_data` database. You will usually get this by being part of the [`analytics-product-users` production access group](https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L959).

In [3]:
import sys
from pathlib import Path

import numpy as np
import pandas as pd

import wmfdata as wmf

sys.path.insert(0, "..")
from utils import assert_tsv_loaded_correctly

In [4]:
# To avoid having to run this code as the `analytics-product` user, we first upload the data to
# a database which the "normal" non-sudo user running this notebook can write to. A personal database
# would work, but using the `tmp` database eliminates the need for each new user to pick a different
# database.
DATABASE = "tmp"

wmf.hive.load_csv(
    "wikis.tsv",
    field_spec="""
        database_code   STRING  COMMENT 'Same as wiki_db in MediaWiki History, e.g. enwiki', 
        domain_name     STRING  COMMENT 'e.g. en.wikipedia.org', 
        database_group  STRING  COMMENT 'e.g. wikipedia',
        language_code   STRING  COMMENT 'e.g. en',
        mobile_domain_name STRING COMMENT 'e.g. en.m.wikipedia.org',
        language_name   STRING  COMMENT 'e.g. English',
        status          STRING  COMMENT 'open/closed',
        visibility      STRING  COMMENT 'public/private',
        editability     STRING  COMMENT 'public/private',
        english_name    STRING  COMMENT 'e.g. English Wikipedia'
    """,
    db_name=DATABASE,
    table_name="wikis",
    sep="\t"
)

  response = pd.read_sql(command, conn)


Now, since it's not possible to run `sudo` commands in our Jupyter environment, open a plain SSH connection and in it run the command output by the following cell.

In [5]:
print(
    "sudo -u analytics-product kerberos-run-command analytics-product "
    "hive -e 'DROP TABLE IF EXISTS canonical_data.wikis; "
    f"CREATE TABLE canonical_data.wikis AS SELECT * FROM {DATABASE}.wikis'"
)

sudo -u analytics-product kerberos-run-command analytics-product hive -e 'DROP TABLE IF EXISTS canonical_data.wikis; CREATE TABLE canonical_data.wikis AS SELECT * FROM tmp.wikis'


Once that has completed, verify that the newly loaded data matches the local copy.

In [6]:
assert_tsv_loaded_correctly("wikis.tsv", "canonical_data.wikis")