## Upload Twitter URLs as documents in Elasticsearch

**Purpose**: ES queries will help find URL matches between Media Cloud URLs and Twitter URLs.
1. Load the URLs from `politicians_tweeted_urls_urlexpander.pkl`.
2. Add INCA-related fields for each URL.
3. Convert the content into JSON and write it to a .jsonl file where each line represents a URL-document.
4. Use `inca.importers_exporters.import_json()` to add the Twitter URLs to Elasticsearch.

In [1]:
from inca import Inca
import pandas as pd
import os



In [2]:
from usrightmedia.shared.loggers import get_logger
LOGGER = get_logger(filename = '01-upload-twitter-urls-to-inca', logger_type='main_file')

In [3]:
dir_url = os.path.join('..', '..', 'data', '02-intermediate', '02-twitter')

- load the URLs and inspect the columns

In [4]:
urls = pd.read_pickle(os.path.join(dir_url, f'politicians_tweeted_urls_urlexpander.pkl'))

In [5]:
urls.columns

Index(['tweet_id', 'created_at', 'created_week', 'created_month',
       'created_year', 'text', 'author_id', 'username', 'tweet_url', 'url_id',
       'url', 'expanded_url', 'display_url', 'unwound_url',
       'most_unrolled_url', 'most_unrolled_field', 'is_dupe', 'is_from_tw',
       'resolved_url', 'resolved_netloc', 'resolved_domain',
       'standardized_url', 'is_generic_url', 'urlexpander_error'],
      dtype='object')

- add fields:
    - label the document as part of the same project
    - add a `doctype` field
    - add a `_id` field

In [2]:
urls['PROJECT'] = "usmedia"
urls['doctype'] = "tweets2_url"
urls['_id'] = f"{urls['username']}_{urls['url_id']}"

- select fields and create ES mapping

In [8]:
mapping = {
    "PROJECT": "PROJECT",
    "doctype": "doctype",
    "_id": "_id",
    "tweet_id": "tweet_id",
    "created_at": "created_at",
    "text": "text",
    "author_id": "author_id",
    "username": "username",
    "tweet_url": "tweet_url",
    "url_id": "url_id",
    "url": "url",
    "expanded_url": "expanded_url",
    "display_url": "display_url",
    "unwound_url": "unwound_url",
    "most_unrolled_url": "most_unrolled_url",
    "most_unrolled_field": "most_unrolled_field",
    "resolved_url": "resolved_url",
    "resolved_netloc": "resolved_netloc",
    "resolved_domain": "resolved_domain",
    "standardized_url": "standardized_url",
    "is_generic_url": "is_generic_url",
    "urlexpander_error": "urlexpander_error",
}

In [None]:
urls.head()

In [10]:
urls.to_json(os.path.join(dir_url, 'politicians_tweeted_urls_urlexpander.jsonl'), orient='records', lines=True)

In [11]:
myinca = Inca()

In [12]:
myinca.importers_exporters.import_json(
    mapping=mapping,
    path=os.path.join(dir_url, "politicians_tweeted_urls_urlexpander.jsonl"),
    compression=False,
)