# Preparing the KEN datasets

In this notebook I will be looking at the raw versions of the KEN datasets, then 
follow the pre-processing steps used for the KEN baseline. Note that I am using 
YAGO 2022, so that might introduce discrepancies between this and the original
version. 

Datasets:
- [The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset)
- [7+ Million Company Dataset](https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset)
- [US Accidents (2016 - 2021)](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents)

In [63]:
cd ~/work/prepare-data-lakes

/home/soda/rcappuzz/work/prepare-data-lakes


In [64]:
import polars as pl
import pandas as pd
from pathlib import Path
import src.yago.utils as utils
import numpy as np

In [3]:
data_dir = Path("data/ken_datasets")

Loading YAGO fact triplets to drop entities not found in the KB.

In [4]:
yago_path = Path("/storage/store3/work/jstojano/yago3/")
facts_path = Path(yago_path, "facts_parquet/yago_updated_2022_part2")
fname = "yagoFacts"
yagofacts_path = Path(facts_path, f"{fname}.tsv.parquet")
yagofacts_categorical = utils.import_from_yago(yagofacts_path, engine="polars")
fname = "yagoLiteralFacts"
yagoliteralfacts_path = Path(facts_path, f"{fname}.tsv.parquet")
yagofacts_numerical = utils.import_from_yago(yagoliteralfacts_path, engine="polars")
fname = "yagoDateFacts"
yagodatefacts_path = Path(facts_path, f"{fname}.tsv.parquet")
yagofacts_dates = utils.import_from_yago(yagodatefacts_path, engine="polars")

yagofacts = pl.concat(
    [
        yagofacts_categorical,
        yagofacts_numerical,
        yagofacts_dates
    ]
)

# US Accidents dataset

Archive `us-accidents.zip` contains the file `US_Accidents_Dec21_updated.csv`, 
which I renamed manually to `us-accidents.csv` for simplicity. 

I also had to copy the file `datasets/us_accidents/state_codes.csv` from the 
KEN repository for some of the steps. 

In [52]:
dataset_dir = Path(data_dir, "us-accidents")

In [53]:
df = pl.read_csv(Path(dataset_dir, "us-accidents.csv"))
df = df.rename({"State": "Code"})

In [54]:
state_codes_path =  Path(dataset_dir,"state_codes.csv")
state_codes = pl.read_csv(state_codes_path)
df = df.join(
    state_codes, on="Code"
)

Adding a new column, `col_to_embed`, that formats the city and state name to have
the same format that is found in YAGO. 

In [55]:
df = df.with_columns(
    ("<" + pl.col("City") + ",_"+ pl.col("State") + ">").alias("col_to_embed")
)

Filtering out the rows not found in `yagofacts["subject"]`.

In [56]:
df_filtered=df.lazy().filter(
    pl.col("col_to_embed").is_in(
        yagofacts["subject"]
    )
).collect()

Completing the preparation by grouping the number of accidents by city, then applying the log10 to the count. This is 
the target used in KEN. 


In [57]:
df_counts = df_filtered.groupby(
    [
        "col_to_embed", "City", "Code"
    ]
    ).count()

In [58]:
df_final = df_counts.with_columns(
    (df_counts["City"] + ", " + df_counts["Code"]).alias("raw_entities") 
).select(
    [
        pl.col("raw_entities"),
        pl.col("col_to_embed"),
        pl.col("count").alias("target").log10()
    ]
).sort("raw_entities")

In [61]:
df_filtered

ID,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,Number,Street,Side,City,County,Code,Zipcode,Country,Timezone,Airport_Code,Weather_Timestamp,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight,State,Abbrev,col_to_embed
str,i64,str,str,f64,f64,f64,f64,f64,str,f64,str,str,str,str,str,str,str,str,str,str,f64,f64,f64,f64,f64,str,f64,f64,str,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,str,str,str,str,str,str,str
"""A-1""",3,"""2016-02-08 00:…","""2016-02-08 06:…",40.10891,-83.09286,40.11206,-83.03187,3.23,"""Between Sawmil…",,"""Outerbelt E""","""R""","""Dublin""","""Franklin""","""OH""","""43017""","""US""","""US/Eastern""","""KOSU""","""2016-02-08 00:…",42.1,36.1,58.0,29.76,10.0,"""SW""",10.4,0.0,"""Light Rain""",false,false,false,false,false,false,false,false,false,false,false,false,false,"""Night""","""Night""","""Night""","""Night""","""Ohio""","""Ohio""","""<Dublin,_Ohio>…"
"""A-2""",2,"""2016-02-08 05:…","""2016-02-08 11:…",39.86542,-84.0628,39.86501,-84.04873,0.747,"""At OH-4/OH-235…",,"""I-70 E""","""R""","""Dayton""","""Montgomery""","""OH""","""45424""","""US""","""US/Eastern""","""KFFO""","""2016-02-08 05:…",36.9,,91.0,29.68,10.0,"""Calm""",,0.02,"""Light Rain""",false,false,false,false,false,false,false,false,false,false,false,false,false,"""Night""","""Night""","""Night""","""Night""","""Ohio""","""Ohio""","""<Dayton,_Ohio>…"
"""A-4""",2,"""2016-02-08 06:…","""2016-02-08 12:…",41.06213,-81.53784,41.06217,-81.53547,0.123,"""At Dart Ave/Ex…",,"""I-77 N""","""R""","""Akron""","""Summit""","""OH""","""44311""","""US""","""US/Eastern""","""KAKR""","""2016-02-08 06:…",39.0,,55.0,29.65,10.0,"""Calm""",,,"""Overcast""",false,false,false,false,false,false,false,false,false,false,false,false,false,"""Night""","""Night""","""Day""","""Day""","""Ohio""","""Ohio""","""<Akron,_Ohio>"""
"""A-6""",2,"""2016-02-08 08:…","""2016-02-08 14:…",39.06324,-84.03243,39.06731,-84.05851,1.427,"""At Dela Palma …",,"""State Route 32…","""R""","""Williamsburg""","""Clermont""","""OH""","""45176""","""US""","""US/Eastern""","""KI69""","""2016-02-08 08:…",35.6,29.2,100.0,29.66,10.0,"""WSW""",8.1,,"""Overcast""",false,false,false,false,false,false,false,false,false,false,false,true,false,"""Day""","""Day""","""Day""","""Day""","""Ohio""","""Ohio""","""<Williamsburg,…"
"""A-7""",2,"""2016-02-08 08:…","""2016-02-08 14:…",39.77565,-84.18603,39.77275,-84.18805,0.227,"""At OH-4/Exit 5…",,"""I-75 S""","""R""","""Dayton""","""Montgomery""","""OH""","""45404""","""US""","""US/Eastern""","""KFFO""","""2016-02-08 08:…",33.8,,100.0,29.63,3.0,"""SW""",2.3,,"""Mostly Cloudy""",false,false,false,false,false,false,false,false,false,false,false,false,false,"""Day""","""Day""","""Day""","""Day""","""Ohio""","""Ohio""","""<Dayton,_Ohio>…"
"""A-9""",2,"""2016-02-08 14:…","""2016-02-08 20:…",40.702247,-84.075887,40.69911,-84.084293,0.491,"""At OH-65/Exit …",,"""E Hanthorn Rd""","""R""","""Lima""","""Allen""","""OH""","""45806""","""US""","""US/Eastern""","""KAOH""","""2016-02-08 13:…",39.0,31.8,70.0,29.59,10.0,"""WNW""",11.5,,"""Overcast""",false,false,false,false,false,false,false,false,false,false,false,false,false,"""Day""","""Day""","""Day""","""Day""","""Ohio""","""Ohio""","""<Lima,_Ohio>"""
"""A-10""",2,"""2016-02-08 15:…","""2016-02-08 21:…",40.10931,-82.96849,40.11078,-82.984,0.826,"""At I-71/Exit 2…",,"""Outerbelt W""","""R""","""Westerville""","""Franklin""","""OH""","""43081""","""US""","""US/Eastern""","""KCMH""","""2016-02-08 15:…",32.0,28.7,100.0,29.59,0.5,"""West""",3.5,0.05,"""Snow""",false,false,false,false,false,false,false,false,false,false,false,false,false,"""Day""","""Day""","""Day""","""Day""","""Ohio""","""Ohio""","""<Westerville,_…"
"""A-14""",2,"""2016-02-08 17:…","""2016-02-08 23:…",39.582242,-83.677814,39.603013,-83.637319,2.59,"""Between OH-72/…",,"""I-71 N""","""R""","""Jamestown""","""Greene""","""OH""","""45335""","""US""","""US/Eastern""","""KSGH""","""2016-02-08 17:…",33.8,28.6,93.0,29.64,1.0,"""West""",5.8,0.01,"""Light Snow""",false,false,false,false,false,false,false,false,false,false,false,false,false,"""Day""","""Day""","""Day""","""Day""","""Ohio""","""Ohio""","""<Jamestown,_Oh…"
"""A-15""",3,"""2016-02-08 18:…","""2016-02-09 00:…",40.151785,-81.312635,40.151747,-81.312682,0.004,"""At Shipley Rd …",48999.0,""" Titus Rd""","""R""","""Freeport""","""Guernsey""","""OH""","""43973""","""US""","""US/Eastern""","""KPHD""","""2016-02-08 18:…",33.1,,92.0,29.62,10.0,"""Calm""",,,"""Mostly Cloudy""",false,false,false,false,false,false,false,false,false,false,false,false,false,"""Night""","""Day""","""Day""","""Day""","""Ohio""","""Ohio""","""<Freeport,_Ohi…"
"""A-16""",3,"""2016-02-08 18:…","""2016-02-09 00:…",40.151747,-81.312682,40.151785,-81.312635,0.004,"""At Titus Rd - …",22549.0,""" Cadiz Rd""","""L""","""Freeport""","""Harrison""","""OH""","""43973-8626""","""US""","""US/Eastern""","""KPHD""","""2016-02-08 18:…",33.1,,92.0,29.62,10.0,"""Calm""",,,"""Mostly Cloudy""",false,false,false,false,false,false,false,false,false,false,false,false,false,"""Night""","""Day""","""Day""","""Day""","""Ohio""","""Ohio""","""<Freeport,_Ohi…"


In [60]:
df_counts

col_to_embed,City,Code,count
str,str,str,u32
"""<Birnamwood,_W…","""Birnamwood""","""WI""",7
"""<Romeoville,_I…","""Romeoville""","""IL""",29
"""<Sumner,_Washi…","""Sumner""","""WA""",222
"""<Lacon,_Illino…","""Lacon""","""IL""",2
"""<Socorro,_Texa…","""Socorro""","""TX""",23
"""<Fordyce,_Arka…","""Fordyce""","""AR""",17
"""<Manor,_Pennsy…","""Manor""","""PA""",4
"""<Copperhill,_T…","""Copperhill""","""TN""",2
"""<Templeton,_Ma…","""Templeton""","""MA""",10
"""<Arnold,_Nebra…","""Arnold""","""NE""",1


In [24]:
df_final.write_parquet(Path(dataset_dir, "us-accidents-target.parquet"))

Reading original to see if the two versions look similar. 

In [25]:
path_original=Path("/storage/store3/work/acvetkov/gitlab/KEN/experiments/datasets/us_accidents/counts.parquet")
df_counts_og = pl.read_parquet(path_original)
df_counts_og

raw_entities,col_to_embed,target
str,str,f64
"""Aaronsburg, PA…","""<Aaronsburg,_P…",0.30103
"""Abbeville, LA""","""<Abbeville,_Lo…",0.0
"""Abbotsford, WI…","""<Abbotsford,_W…",0.954243
"""Abbottstown, P…","""<Abbottstown,_…",1.041393
"""Aberdeen, MD""","""<Aberdeen,_Mar…",2.396199
"""Aberdeen, MS""","""<Aberdeen,_Mis…",0.477121
"""Aberdeen, OH""","""<Aberdeen,_Ohi…",0.0
"""Aberdeen, WA""","""<Aberdeen,_Was…",1.342423
"""Abernathy, TX""","""<Abernathy,_Te…",0.0
"""Abilene, TX""","""<Abilene,_Texa…",1.0


As expected, the number of entities in the original file is smaller. 

# Company Employees Dataset

In [14]:
dataset_dir = Path(data_dir, "company-employees")

In [15]:
df = pl.read_csv(Path(dataset_dir, "companies_sorted.csv"))
df_selected = df.filter(
    pl.col("current employee estimate") >= 1000
)

In [16]:
df_selected

Unnamed: 0_level_0,name,domain,year founded,industry,size range,locality,country,linkedin url,current employee estimate,total employee estimate
i64,str,str,f64,str,str,str,str,str,i64,i64
5872184,"""ibm""","""ibm.com""",1911.0,"""information te…","""10001+""","""new york, new …","""united states""","""linkedin.com/c…",274047,716906
4425416,"""tata consultan…","""tcs.com""",1968.0,"""information te…","""10001+""","""bombay, mahara…","""india""","""linkedin.com/c…",190771,341369
21074,"""accenture""","""accenture.com""",1989.0,"""information te…","""10001+""","""dublin, dublin…","""ireland""","""linkedin.com/c…",190689,455768
2309813,"""us army""","""goarmy.com""",1800.0,"""military""","""10001+""","""alexandria, vi…","""united states""","""linkedin.com/c…",162163,445958
1558607,"""ey""","""ey.com""",1989.0,"""accounting""","""10001+""","""london, greate…","""united kingdom…","""linkedin.com/c…",158363,428960
3844889,"""hewlett-packar…","""hpe.com""",1939.0,"""information te…","""10001+""","""palo alto, cal…","""united states""","""linkedin.com/c…",127952,412952
2959148,"""cognizant tech…","""cognizant.com""",1994.0,"""information te…","""10001+""","""teaneck, new j…","""united states""","""linkedin.com/c…",122031,210020
5944912,"""walmart""","""walmartcareers…",1962.0,"""retail""","""10001+""","""withee, wiscon…","""united states""","""linkedin.com/c…",120753,272827
3727010,"""microsoft""","""microsoft.com""",1975.0,"""computer softw…","""10001+""","""redmond, washi…","""united states""","""linkedin.com/c…",116196,276983
3300741,"""at&t""","""att.com""",1876.0,"""telecommunicat…","""10001+""","""dallas, texas,…","""united states""","""linkedin.com/c…",115188,269659


I am using a slightly different euristic compared to what Alexis was using. 

Adding a new column to `yagofacts` with lowercased subjects. 

In [17]:
yagofacts = yagofacts.with_columns(
    pl.col("subject").str.to_lowercase().alias("subject_formatted")
)

In [18]:
df_filtered=df_selected.lazy().with_columns(
    ("<" + pl.col("name").str.to_lowercase().str.replace(" ", "_") + ">").alias("formatted_name")
).filter(
    pl.col("formatted_name").is_in(yagofacts["subject_formatted"])
).collect()

Here I am preparing a mapping between the name in the original dataset and the match found in YAGO.
Note that there is a relatively low recall, though it is higher than what is used in the original. 

In [19]:
mapping_name_subject = df_filtered.lazy().join(
    yagofacts.lazy(),
    left_on="formatted_name",
    right_on="subject_formatted"
).select(
    [
        pl.col("name"),
        pl.col("formatted_name"),
        pl.col("subject")
    ]
).unique().collect()


Joining on with the mapping on `formatted_name` to guarantee that col `col_to_embed` uses the same format (and 
capitalization) used in YAGO. 

In [20]:
df_final = df_filtered.join(
    mapping_name_subject, on="formatted_name"
).select(
    [
        pl.col("name").alias("raw_entities"),
        pl.col("subject").alias("col_to_embed"),
        pl.col("current employee estimate").alias("target").log10()
    ]
)
df_final


In [22]:
df_final.write_parquet(Path(dataset_dir, "company-employees-target.parquet"))

In [29]:
df_filtered.columns

['',
 'name',
 'domain',
 'year founded',
 'industry',
 'size range',
 'locality',
 'country',
 'linkedin url',
 'current employee estimate',
 'total employee estimate',
 'formatted_name']

In [33]:
df_filtered

Unnamed: 0_level_0,name,domain,year founded,industry,size range,locality,country,linkedin url,current employee estimate,total employee estimate,formatted_name
i64,str,str,f64,str,str,str,str,str,i64,i64,str
5872184,"""ibm""","""ibm.com""",1911.0,"""information te…","""10001+""","""new york, new …","""united states""","""linkedin.com/c…",274047,716906,"""<ibm>"""
21074,"""accenture""","""accenture.com""",1989.0,"""information te…","""10001+""","""dublin, dublin…","""ireland""","""linkedin.com/c…",190689,455768,"""<accenture>"""
3844889,"""hewlett-packar…","""hpe.com""",1939.0,"""information te…","""10001+""","""palo alto, cal…","""united states""","""linkedin.com/c…",127952,412952,"""<hewlett-packa…"
5944912,"""walmart""","""walmartcareers…",1962.0,"""retail""","""10001+""","""withee, wiscon…","""united states""","""linkedin.com/c…",120753,272827,"""<walmart>"""
3727010,"""microsoft""","""microsoft.com""",1975.0,"""computer softw…","""10001+""","""redmond, washi…","""united states""","""linkedin.com/c…",116196,276983,"""<microsoft>"""
3300741,"""at&t""","""att.com""",1876.0,"""telecommunicat…","""10001+""","""dallas, texas,…","""united states""","""linkedin.com/c…",115188,269659,"""<at&t>"""
3972223,"""wells fargo""","""wellsfargo.com…",,"""financial serv…","""10001+""","""san francisco,…","""united states""","""linkedin.com/c…",109532,264101,"""<wells_fargo>"""
1454663,"""infosys""","""infosys.com""",1981.0,"""information te…","""10001+""","""bangalore, kar…","""india""","""linkedin.com/c…",104752,215718,"""<infosys>"""
3221953,"""deloitte""","""deloitte.com""",1900.0,"""management con…","""10001+""","""new york, new …","""united states""","""linkedin.com/c…",104112,329145,"""<deloitte>"""
1685535,"""nokia""","""nokia.com""",1865.0,"""telecommunicat…","""10001+""","""espoo, uusimaa…","""finland""","""linkedin.com/c…",84327,251795,"""<nokia>"""


In [35]:
df_prepared = df_filtered.lazy().join(
    df_final.lazy(),
    left_on="name",
    right_on="raw_entities",
    how="inner"
).drop(
    "",
    "formatted_name",
    "current employee estimate",
    "total employee estimate",
).collect()

In [37]:
df_prepared.write_parquet(Path(dataset_dir, "company-employees-prepared.parquet"))

In [36]:
path_original=Path("/storage/store3/work/acvetkov/gitlab/KEN/experiments/datasets/company_employees/target.parquet")
df_target_og = pl.read_parquet(path_original)
df_target_og

raw_entities,col_to_embed,target
str,str,f64
"""accenture""","""<Accenture>""",5.280326
"""walmart""","""<Walmart>""",5.081898
"""microsoft""","""<Microsoft>""",5.065191
"""infosys""","""<Infosys>""",5.020162
"""deloitte""","""<Deloitte>""",5.017501
"""nokia""","""<Nokia>""",4.925967
"""capgemini""","""<Capgemini>""",4.925204
"""google""","""<Google>""",4.875692
"""ericsson""","""<Ericsson>""",4.830537
"""boeing""","""<Boeing>""",4.827763


Here I am checking which values in `col_to_embed` are not found in `yagofacts["subject"]`. These companies are missing 
because the name of the company is not the same in the original dataset and in YAGO. 

In [34]:
df_target_og.filter(
    ~pl.col("col_to_embed").is_in(yagofacts["subject"])
)

raw_entities,col_to_embed,target
str,str,f64
"""raytheon""","""<Raytheon>""",4.375353
"""thales""","""<Thales>""",4.321868
"""herbalife""","""<Herbalife>""",4.289433
"""flextronics""","""<Flextronics>""",4.279644
"""adecco""","""<Adecco>""",4.160799
"""altran""","""<Altran>""",4.090187
"""statoil""","""<Statoil>""",4.081563
"""symantec""","""<Symantec>""",4.034628
"""syntel""","""<Syntel>""",3.994229
"""arup""","""<Arup>""",3.98304


# The Movies Dataset

In [65]:
dataset_dir = Path(data_dir, "the-movies-dataset")
df = pl.read_csv(Path(dataset_dir, "movies_metadata.csv"), infer_schema_length=0)


The target variable is the revenue, so movies with revenue 0 are dropped. I am also reformatting the release date for creating the title mappings.

In [66]:
# Filtering
df_filtered = df.filter(
    pl.col("revenue").cast(int) > 0
).with_columns(
    (pl.col("release_date").str.slice(0, 4)).alias("release_date")
)

`title` and `release_date` together should approximate a unique key quite well: I am looking to see whether this is the case or not. 

In [67]:
df_filtered.groupby(["title", "release_date"]).count().sort("count",descending=True)

title,release_date,count
str,str,u32
"""Clockstoppers""","""2002""",2
"""Pokémon 4Ever:…","""2001""",2
"""A Farewell to …","""1932""",2
"""Le Samouraï""","""1967""",2
"""Black Gold""","""2011""",2
"""Force Majeure""","""2014""",2
"""The Congress""","""2013""",2
"""Camille Claude…","""2013""",2
"""Pokémon: Spell…","""2000""",2
"""Confessions of…","""2002""",2


In [68]:
df_filtered

adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""False""","""{'id': 10194, …","""30000000""","""[{'id': 16, 'n…","""http://toystor…","""862""","""tt0114709""","""en""","""Toy Story""","""Led by Woody, …","""21.946943""","""/rhIRbceoE9lR4…","""[{'name': 'Pix…","""[{'iso_3166_1'…","""1995""","""373554033""","""81.0""","""[{'iso_639_1':…","""Released""",,"""Toy Story""","""False""","""7.7""","""5415"""
"""False""",,"""65000000""","""[{'id': 12, 'n…",,"""8844""","""tt0113497""","""en""","""Jumanji""","""When siblings …","""17.015539""","""/vzmL6fP7aPKNK…","""[{'name': 'Tri…","""[{'iso_3166_1'…","""1995""","""262797249""","""104.0""","""[{'iso_639_1':…","""Released""","""Roll the dice …","""Jumanji""","""False""","""6.9""","""2413"""
"""False""",,"""16000000""","""[{'id': 35, 'n…",,"""31357""","""tt0114885""","""en""","""Waiting to Exh…","""Cheated on, mi…","""3.859495""","""/16XOMpEaLWkrc…","""[{'name': 'Twe…","""[{'iso_3166_1'…","""1995""","""81452156""","""127.0""","""[{'iso_639_1':…","""Released""","""Friends are th…","""Waiting to Exh…","""False""","""6.1""","""34"""
"""False""","""{'id': 96871, …","""0""","""[{'id': 35, 'n…",,"""11862""","""tt0113041""","""en""","""Father of the …","""Just when Geor…","""8.387519""","""/e64sOI48hQXyr…","""[{'name': 'San…","""[{'iso_3166_1'…","""1995""","""76578911""","""106.0""","""[{'iso_639_1':…","""Released""","""Just When His …","""Father of the …","""False""","""5.7""","""173"""
"""False""",,"""60000000""","""[{'id': 28, 'n…",,"""949""","""tt0113277""","""en""","""Heat""","""Obsessive mast…","""17.924927""","""/zMyfPUelumio3…","""[{'name': 'Reg…","""[{'iso_3166_1'…","""1995""","""187436818""","""170.0""","""[{'iso_639_1':…","""Released""","""A Los Angeles …","""Heat""","""False""","""7.7""","""1886"""
"""False""",,"""35000000""","""[{'id': 28, 'n…",,"""9091""","""tt0114576""","""en""","""Sudden Death""","""International …","""5.23158""","""/eoWvKD60lT95S…","""[{'name': 'Uni…","""[{'iso_3166_1'…","""1995""","""64350171""","""106.0""","""[{'iso_639_1':…","""Released""","""Terror goes in…","""Sudden Death""","""False""","""5.5""","""174"""
"""False""","""{'id': 645, 'n…","""58000000""","""[{'id': 12, 'n…","""http://www.mgm…","""710""","""tt0113189""","""en""","""GoldenEye""","""James Bond mus…","""14.686036""","""/5c0ovjT41KnYI…","""[{'name': 'Uni…","""[{'iso_3166_1'…","""1995""","""352194034""","""130.0""","""[{'iso_639_1':…","""Released""","""No limits. No …","""GoldenEye""","""False""","""6.6""","""1194"""
"""False""",,"""62000000""","""[{'id': 35, 'n…",,"""9087""","""tt0112346""","""en""","""The American P…","""Widowed U.S. p…","""6.318445""","""/lymPNGLZgPHuq…","""[{'name': 'Col…","""[{'iso_3166_1'…","""1995""","""107879496""","""106.0""","""[{'iso_639_1':…","""Released""","""Why can't the …","""The American P…","""False""","""6.5""","""199"""
"""False""","""{'id': 117693,…","""0""","""[{'id': 10751,…",,"""21032""","""tt0112453""","""en""","""Balto""","""An outcast hal…","""12.140733""","""/gV5PCAVCPNxlO…","""[{'name': 'Uni…","""[{'iso_3166_1'…","""1995""","""11348324""","""78.0""","""[{'iso_639_1':…","""Released""","""Part Dog. Part…","""Balto""","""False""","""7.1""","""423"""
"""False""",,"""44000000""","""[{'id': 36, 'n…",,"""10858""","""tt0113987""","""en""","""Nixon""","""An all-star ca…","""5.092""","""/cICkmCEiXRhvZ…","""[{'name': 'Hol…","""[{'iso_3166_1'…","""1995""","""13681765""","""192.0""","""[{'iso_639_1':…","""Released""","""Triumphant in …","""Nixon""","""False""","""7.1""","""72"""


In [69]:
# Prepare 3 different mappings, to try and cover as many cases as possible
mapping_to_yago = df_filtered.select([pl.col("title"), pl.col("release_date"), pl.col("revenue")])

mapping_to_yago=mapping_to_yago.with_columns(
    [
        ("<" + pl.col("title").str.replace(" ", "_") + ">").alias("title_format_1"),
        ("<" + pl.col("title").str.replace(" ", "_") + "_(film)>").alias("title_format_2"),
        ("<" + pl.col("title").str.replace(" ", "_") +  "_(" + pl.col("release_date") + "_film)>").alias("title_format_3"),
        pl.Series(list(range(len(mapping_to_yago)))).alias("index")
    ]
)

Here I am looking for movies that are present in YAGO according to one of the three formats defined above. 

I also reformat the output to reflect the "target dataset" schema. 

In [10]:
tgt_indices = []
selected = []
for jj in [3,2,1]:
    g1 = mapping_to_yago.filter(
        (pl.col(f"title_format_{jj}").is_in(yagofacts["subject"])) & 
        (~pl.col(f"index").is_in(tgt_indices))
    ).select(pl.col("index"))

    tgt_indices=g1["index"].to_list()
    
    newdf = mapping_to_yago.filter(
        pl.col("index").is_in(tgt_indices)
    ).select(
        [
            pl.col("title"),
            pl.col("release_date"),
            pl.col(f"title_format_{jj}").alias("col_to_embed"),
            pl.col("revenue").log10(),
        ]
    )
    selected.append(newdf)

df_final=pl.concat(selected)

# df_final.write_parquet(Path(dataset_dir, "movie-revenues-target.parquet"))

In [13]:
df_final

title,release_date,col_to_embed,revenue
str,str,str,f64
"""Heat""","""1995""","""<Heat_(1995_fi…",8.272855
"""Sudden Death""","""1995""","""<Sudden_Death_…",7.80855
"""Casino""","""1995""","""<Casino_(1995_…",8.064879
"""Assassins""","""1995""","""<Assassins_(19…",7.481487
"""Mortal Kombat""","""1995""","""<Mortal_Kombat…",8.087057
"""Pocahontas""","""1995""","""<Pocahontas_(1…",8.539176
"""Friday""","""1995""","""<Friday_(1995_…",7.450494
"""Fair Game""","""1995""","""<Fair_Game_(19…",7.061998
"""Screamers""","""1995""","""<Screamers_(19…",6.762069
"""Black Sheep""","""1996""","""<Black_Sheep_(…",1.50515


In [15]:
# df_with_index = df.
df_joined = df_filtered.lazy().join(df_final.lazy(), left_on=["title", "release_date"], right_on=["title", "release_date"], how="inner").collect()
df_joined.write_parquet(Path(dataset_dir, "movies-revenues.parquet"))

Now I am loading the target dataset that was prepared for KEN and compare it to what I have. 

In [57]:
path_original=Path("/storage/store3/work/acvetkov/gitlab/KEN/experiments/datasets/movie_revenues/target.parquet")
df_target_og = pl.read_parquet(path_original)
df_target_og

raw_entities,col_to_embed,target
str,str,f64
"""Heat""","""<Heat_(1995_fi…",8.272855
"""Sudden Death""","""<Sudden_Death_…",7.80855
"""Cry, the Belov…","""<Cry,_the_Belo…",5.830284
"""Pocahontas""","""<Pocahontas_(1…",8.539176
"""Friday""","""<Friday_(1995_…",7.450494
"""Fair Game""","""<Fair_Game_(19…",7.061998
"""Bed of Roses""","""<Bed_of_Roses_…",7.279455
"""Screamers""","""<Screamers_(19…",6.762069
"""Black Sheep""","""<Black_Sheep_(…",1.50515
"""Broken Arrow""","""<Broken_Arrow_…",8.176873


Looking for movies whose `col_to_embed` is not found in the `subject` column (for whatever reason).

In [58]:
df_target_og.filter(
    ~pl.col("col_to_embed").is_in(yagofacts["subject"])
)

raw_entities,col_to_embed,target
str,str,f64
"""Dracula""","""<Dracula_(1931…",6.005262
"""Runaway Bride""","""<Runaway_Bride…",8.490601
"""Niagara""","""<Niagara_(1953…",6.929419
"""Love's Labour'…","""<Love's_Labour…",5.47682
"""The Fury""","""<The_Fury_(197…",7.380211
"""Salsa""","""<Salsa_(1988_f…",6.949028
"""The Changeling…","""<The_Changelin…",7.079181
"""Ocean's Eleven…","""<Ocean's_Eleve…",8.653904
"""Asoka""","""<Asoka_(2001_f…",7.278754
"""Party Monster""","""<Party_Monster…",5.870929


In [59]:
# Movies that are in the KEN version and not in the new version. 
df_target_og.filter(
    (pl.col("col_to_embed").is_in(yagofacts["subject"])) &
    (~pl.col("col_to_embed").is_in(df_final["col_to_embed"]))
)

raw_entities,col_to_embed,target
str,str,f64
"""Cry, the Belov…","""<Cry,_the_Belo…",5.830284
"""Bed of Roses""","""<Bed_of_Roses_…",7.279455
"""Man of the Yea…","""<Man_of_the_Ye…",5.322085
"""The Scarlet Le…","""<The_Scarlet_L…",7.016298
"""The Tie That B…","""<The_Tie_That_…",6.761928
"""Before the Rai…","""<Before_the_Ra…",5.883006
"""Miracle on 34t…","""<Miracle_on_34…",7.665247
"""I Love Trouble…","""<I_Love_Troubl…",7.792022
"""The Age of Inn…","""<The_Age_of_In…",7.508603
"""For Love or Mo…","""<For_Love_or_M…",7.04713


In [60]:
df_final.filter(
    ~(pl.col("col_to_embed").is_in(yagofacts["subject"]))
)

title,col_to_embed,revenue
str,str,f64


In [61]:
df_final

title,col_to_embed,revenue
str,str,f64
"""Heat""","""<Heat_(1995_fi…",8.272855
"""Sudden Death""","""<Sudden_Death_…",7.80855
"""Casino""","""<Casino_(1995_…",8.064879
"""Assassins""","""<Assassins_(19…",7.481487
"""Mortal Kombat""","""<Mortal_Kombat…",8.087057
"""Pocahontas""","""<Pocahontas_(1…",8.539176
"""Friday""","""<Friday_(1995_…",7.450494
"""Fair Game""","""<Fair_Game_(19…",7.061998
"""Screamers""","""<Screamers_(19…",6.762069
"""Black Sheep""","""<Black_Sheep_(…",1.50515


# US Presidential elections

In [57]:
dataset_dir = Path(data_dir, "presidential-results")
df = pl.read_csv(Path(dataset_dir, "presidential-results.csv"), infer_schema_length=0)
df = df.to_pandas()

In [58]:
df = df[df["year"] == "2020"]
df["county_name"] = df["county_name"].str.title()
df["state"] = df["state"].str.title()
df["col_to_embed"] = "<" + df["county_name"] + "_County,_" + df["state"] + ">"
df["target"] = np.log10(df["candidatevotes"].astype(int) + 1)
df["raw_entities"] = df["county_name"] + " " + df["state"]
# df = df[["raw_entities", "col_to_embed", "party", "target"]]
df.dropna(inplace=True)


In [61]:
df.drop(["raw_entities", "candidatevotes"], axis=1).to_parquet(Path(dataset_dir, "presidential-results-prepared.parquet"), index=False)

In [47]:

df=df.groupby(["raw_entities", "col_to_embed", "party"], as_index=False).sum()

df["col_to_embed"] = df["col_to_embed"].str.replace(" ", "_")
mask = df["col_to_embed"].str.contains("Louisiana")
df.loc[mask, "col_to_embed"] = df.loc[mask, "col_to_embed"].str.replace(
    "County", "Parish"
)
df["col_to_embed"] = df["col_to_embed"].str.replace("_City_County", "")


In [62]:
df

Unnamed: 0,year,state,state_po,county_name,county_fips,office,candidate,party,candidatevotes,totalvotes,version,mode,col_to_embed,target,raw_entities
50524,2020,Alabama,AL,Autauga,01001,US PRESIDENT,JOSEPH R BIDEN JR,DEMOCRAT,7503,27770,20220315,TOTAL,"<Autauga_County,_Alabama>",3.875293,Autauga Alabama
50525,2020,Alabama,AL,Autauga,01001,US PRESIDENT,OTHER,OTHER,429,27770,20220315,TOTAL,"<Autauga_County,_Alabama>",2.633468,Autauga Alabama
50526,2020,Alabama,AL,Autauga,01001,US PRESIDENT,DONALD J TRUMP,REPUBLICAN,19838,27770,20220315,TOTAL,"<Autauga_County,_Alabama>",4.297520,Autauga Alabama
50527,2020,Alabama,AL,Baldwin,01003,US PRESIDENT,JOSEPH R BIDEN JR,DEMOCRAT,24578,109679,20220315,TOTAL,"<Baldwin_County,_Alabama>",4.390564,Baldwin Alabama
50528,2020,Alabama,AL,Baldwin,01003,US PRESIDENT,OTHER,OTHER,1557,109679,20220315,TOTAL,"<Baldwin_County,_Alabama>",3.192567,Baldwin Alabama
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72612,2020,Wyoming,WY,Washakie,56043,US PRESIDENT,DONALD J TRUMP,REPUBLICAN,3245,4032,20220315,TOTAL,"<Washakie_County,_Wyoming>",3.511349,Washakie Wyoming
72613,2020,Wyoming,WY,Weston,56045,US PRESIDENT,JOSEPH R BIDEN JR,DEMOCRAT,360,3560,20220315,TOTAL,"<Weston_County,_Wyoming>",2.557507,Weston Wyoming
72614,2020,Wyoming,WY,Weston,56045,US PRESIDENT,JO JORGENSEN,LIBERTARIAN,46,3560,20220315,TOTAL,"<Weston_County,_Wyoming>",1.672098,Weston Wyoming
72615,2020,Wyoming,WY,Weston,56045,US PRESIDENT,OTHER,OTHER,47,3560,20220315,TOTAL,"<Weston_County,_Wyoming>",1.681241,Weston Wyoming
