# Preparing the KEN datasets

In this notebook I will be looking at the raw versions of the KEN datasets, then 
follow the pre-processing steps used for the KEN baseline. Note that I am using 
YAGO 2022, so that might introduce discrepancies between this and the original
version. 

Datasets:
- [The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset)
- [7+ Million Company Dataset](https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset)
- [US Accidents (2016 - 2021)](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents)

In [3]:
cd ~/work/prepare-data-lakes

/home/soda/rcappuzz/work/prepare-data-lakes


In [1]:
import polars as pl
import pandas as pd
from pathlib import Path
import src.yago.utils as utils


ModuleNotFoundError: No module named 'src'

In [5]:
data_dir = Path("data/kaggle")

Loading YAGO fact triplets to drop entities not found in the KB.

In [6]:
yago_path = Path("/storage/store3/work/jstojano/yago3/")
facts_path = Path(yago_path, "facts_parquet/yago_updated_2022_part2")
fname = "yagoFacts"
yagofacts_path = Path(facts_path, f"{fname}.tsv.parquet")
yagofacts_categorical = utils.import_from_yago(yagofacts_path, engine="polars")
fname = "yagoLiteralFacts"
yagoliteralfacts_path = Path(facts_path, f"{fname}.tsv.parquet")
yagofacts_numerical = utils.import_from_yago(yagoliteralfacts_path, engine="polars")
fname = "yagoDateFacts"
yagodatefacts_path = Path(facts_path, f"{fname}.tsv.parquet")
yagofacts_dates = utils.import_from_yago(yagodatefacts_path, engine="polars")

yagofacts = pl.concat(
    [
        yagofacts_categorical,
        yagofacts_numerical,
        yagofacts_dates
    ]
)

# US Accidents dataset

Archive `us-accidents.zip` contains the file `US_Accidents_Dec21_updated.csv`, 
which I renamed manually to `us-accidents.csv` for simplicity. 

I also had to copy the file `datasets/us_accidents/state_codes.csv` from the 
KEN repository for some of the steps. 

In [7]:
dataset_dir = Path(data_dir, "us-accidents")

In [8]:
df = pl.read_csv(Path(dataset_dir, "us-accidents.csv"))
df = df.rename({"State": "Code"})

In [9]:
state_codes_path =  Path(dataset_dir,"state_codes.csv")
state_codes = pl.read_csv(state_codes_path)
df = df.join(
    state_codes, on="Code"
)

Adding a new column, `col_to_embed`, that formats the city and state name to have
the same format that is found in YAGO. 

In [10]:
df = df.with_columns(
    ("<" + pl.col("City") + ",_"+ pl.col("State") + ">").alias("col_to_embed")
)

Filtering out the rows not found in `yagofacts["subject"]`.

In [11]:
df_filtered=df.lazy().filter(
    pl.col("col_to_embed").is_in(
        yagofacts["subject"]
    )
).collect()

Completing the preparation by grouping the number of accidents by city, then applying the log10 to the count. This is 
the target used in KEN. 


In [12]:
df_counts = df_filtered.groupby(
    [
        "col_to_embed", "City", "Code"
    ]
    ).count()

In [13]:
df_final = df_counts.with_columns(
    (df_counts["City"] + ", " + df_counts["Code"]).alias("raw_entities") 
).select(
    [
        pl.col("raw_entities"),
        pl.col("col_to_embed"),
        pl.col("count").alias("target").log10()
    ]
).sort("raw_entities")

In [14]:
df_final

raw_entities,col_to_embed,target
str,str,f64
"""Abbeville, AL""","""<Abbeville,_Al…",0.778151
"""Abbeville, LA""","""<Abbeville,_Lo…",0.0
"""Abbotsford, WI…","""<Abbotsford,_W…",0.954243
"""Abbottstown, P…","""<Abbottstown,_…",1.39794
"""Aberdeen, ID""","""<Aberdeen,_Ida…",0.0
"""Aberdeen, MD""","""<Aberdeen,_Mar…",2.71433
"""Aberdeen, MS""","""<Aberdeen,_Mis…",0.477121
"""Aberdeen, OH""","""<Aberdeen,_Ohi…",0.0
"""Aberdeen, WA""","""<Aberdeen,_Was…",1.556303
"""Abernathy, TX""","""<Abernathy,_Te…",0.0


In [15]:
df_final.write_parquet(Path(dataset_dir, "us-accidents-target.parquet"))

Reading original to see if the two versions look similar. 

In [16]:
path_original=Path("/storage/store3/work/acvetkov/gitlab/KEN/experiments/datasets/us_accidents/counts.parquet")
df_counts_og = pl.read_parquet(path_original)
df_counts_og

raw_entities,col_to_embed,target
str,str,f64
"""Aaronsburg, PA…","""<Aaronsburg,_P…",0.30103
"""Abbeville, LA""","""<Abbeville,_Lo…",0.0
"""Abbotsford, WI…","""<Abbotsford,_W…",0.954243
"""Abbottstown, P…","""<Abbottstown,_…",1.041393
"""Aberdeen, MD""","""<Aberdeen,_Mar…",2.396199
"""Aberdeen, MS""","""<Aberdeen,_Mis…",0.477121
"""Aberdeen, OH""","""<Aberdeen,_Ohi…",0.0
"""Aberdeen, WA""","""<Aberdeen,_Was…",1.342423
"""Abernathy, TX""","""<Abernathy,_Te…",0.0
"""Abilene, TX""","""<Abilene,_Texa…",1.0


As expected, the number of entities in the original file is smaller. 

# Company Employees Dataset

In [17]:
dataset_dir = Path(data_dir, "company-employees")

In [18]:
df = pl.read_csv(Path(dataset_dir, "companies_sorted.csv"))
df_selected = df.filter(
    pl.col("current employee estimate") >= 1000
)

I am using a slightly different euristic compared to what Alexis was using. 

Adding a new column to `yagofacts` with lowercased subjects. 

In [19]:
yagofacts = yagofacts.with_columns(
    pl.col("subject").str.to_lowercase().alias("subject_formatted")
)

In [20]:
df_filtered=df_selected.lazy().with_columns(
    ("<" + pl.col("name").str.to_lowercase().str.replace(" ", "_") + ">").alias("formatted_name")
).filter(
    pl.col("formatted_name").is_in(yagofacts["subject_formatted"])
).collect()

Here I am preparing a mapping between the name in the original dataset and the match found in YAGO.
Note that there is a relatively low recall, though it is higher than what is used in the original. 

In [21]:
mapping_name_subject = df_filtered.lazy().join(
    yagofacts.lazy(),
    left_on="formatted_name",
    right_on="subject_formatted"
).select(
    [
        pl.col("name"),
        pl.col("formatted_name"),
        pl.col("subject")
    ]
).unique().collect()


Joining on with the mapping on `formatted_name` to guarantee that col `col_to_embed` uses the same format (and 
capitalization) used in YAGO. 

In [22]:
df_final = df_filtered.join(
    mapping_name_subject, on="formatted_name"
).select(
    [
        pl.col("name").alias("raw_entities"),
        pl.col("subject").alias("col_to_embed"),
        pl.col("current employee estimate").alias("target").log10()
    ]
)

df_final.write_parquet(Path(dataset_dir, "company-employees-target.parquet"))

In [23]:
df_final

raw_entities,col_to_embed,target
str,str,f64
"""ibm""","""<IBM>""",5.437825
"""accenture""","""<Accenture>""",5.280326
"""hewlett-packar…","""<Hewlett-Packa…",5.107047
"""walmart""","""<Walmart>""",5.081898
"""microsoft""","""<Microsoft>""",5.065191
"""at&t""","""<AT&T>""",5.061407
"""wells fargo""","""<Wells_Fargo>""",5.039541
"""infosys""","""<Infosys>""",5.020162
"""deloitte""","""<Deloitte>""",5.017501
"""nokia""","""<Nokia>""",4.925967


In [24]:
path_original=Path("/storage/store3/work/acvetkov/gitlab/KEN/experiments/datasets/company_employees/target.parquet")
df_target_og = pl.read_parquet(path_original)
df_target_og

raw_entities,col_to_embed,target
str,str,f64
"""accenture""","""<Accenture>""",5.280326
"""walmart""","""<Walmart>""",5.081898
"""microsoft""","""<Microsoft>""",5.065191
"""infosys""","""<Infosys>""",5.020162
"""deloitte""","""<Deloitte>""",5.017501
"""nokia""","""<Nokia>""",4.925967
"""capgemini""","""<Capgemini>""",4.925204
"""google""","""<Google>""",4.875692
"""ericsson""","""<Ericsson>""",4.830537
"""boeing""","""<Boeing>""",4.827763


Here I am checking which values in `col_to_embed` are not found in `yagofacts["subject"]`. These companies are missing 
because the name of the company is not the same in the original dataset and in YAGO. 

In [25]:
df_target_og.filter(
    ~pl.col("col_to_embed").is_in(yagofacts["subject"])
)

raw_entities,col_to_embed,target
str,str,f64
"""raytheon""","""<Raytheon>""",4.375353
"""thales""","""<Thales>""",4.321868
"""herbalife""","""<Herbalife>""",4.289433
"""flextronics""","""<Flextronics>""",4.279644
"""adecco""","""<Adecco>""",4.160799
"""altran""","""<Altran>""",4.090187
"""statoil""","""<Statoil>""",4.081563
"""symantec""","""<Symantec>""",4.034628
"""syntel""","""<Syntel>""",3.994229
"""arup""","""<Arup>""",3.98304


# The Movies Dataset

In [26]:
dataset_dir = Path(data_dir, "the-movies-dataset")
df = pl.read_csv(Path(dataset_dir, "movies_metadata.csv"), infer_schema_length=0)


The target variable is the revenue, so movies with revenue 0 are dropped. I am also reformatting the release date for creating the title mappings.

In [27]:
# Filtering
df_filtered = df.filter(
    pl.col("revenue").cast(int) > 0
).with_columns(
    (pl.col("release_date").str.slice(0, 4)).alias("release_date")
)

`title` and `release_date` together should approximate a unique key quite well: I am looking to see whether this is the case or not. 

In [68]:
df_filtered.groupby(["title", "release_date"]).count().sort("count",descending=True)

title,release_date,count
str,str,u32
"""A Farewell to …","""1932""",2
"""Confessions of…","""2002""",2
"""Black Gold""","""2011""",2
"""Force Majeure""","""2014""",2
"""Camille Claude…","""2013""",2
"""Pokémon 4Ever:…","""2001""",2
"""Le Samouraï""","""1967""",2
"""Clockstoppers""","""2002""",2
"""Pokémon: Spell…","""2000""",2
"""The Congress""","""2013""",2


In [69]:
# Prepare 3 different mappings, to try and cover as many cases as possible
mapping_to_yago = df_filtered.select([pl.col("title"), pl.col("release_date"), pl.col("revenue")])

mapping_to_yago=mapping_to_yago.with_columns(
    [
        ("<" + pl.col("title").str.replace(" ", "_") + ">").alias("title_format_1"),
        ("<" + pl.col("title").str.replace(" ", "_") + "_(film)>").alias("title_format_2"),
        ("<" + pl.col("title").str.replace(" ", "_") +  "_(" + pl.col("release_date") + "_film)>").alias("title_format_3"),
        pl.Series(list(range(len(mapping_to_yago)))).alias("index")
    ]
)

Here I am looking for movies that are present in YAGO according to one of the three formats defined above. 

I also reformat the output to reflect the "target dataset" schema. 

In [54]:
tgt_indices = []
selected = []
for jj in [3,2,1]:
    g1 = mapping_to_yago.filter(
        (pl.col(f"title_format_{jj}").is_in(yagofacts["subject"])) & 
        (~pl.col(f"index").is_in(tgt_indices))
    ).select(pl.col("index"))

    tgt_indices=g1["index"].to_list()
    
    newdf = mapping_to_yago.filter(
        pl.col("index").is_in(tgt_indices)
    ).select(
        [
            pl.col("title"),
            pl.col(f"title_format_{jj}").alias("col_to_embed"),
            pl.col("revenue").log10(),
        ]
    )
    selected.append(newdf)

df_final=pl.concat(selected)
df_final.write_parquet(Path(dataset_dir, "movie-revenues-target.parquet"))

Now I am loading the target dataset that was prepared for KEN and compare it to what I have. 

In [71]:
path_original=Path("/storage/store3/work/acvetkov/gitlab/KEN/experiments/datasets/movie_revenues/target.parquet")
df_target_og = pl.read_parquet(path_original)
df_target_og

raw_entities,col_to_embed,target
str,str,f64
"""Heat""","""<Heat_(1995_fi…",8.272855
"""Sudden Death""","""<Sudden_Death_…",7.80855
"""Cry, the Belov…","""<Cry,_the_Belo…",5.830284
"""Pocahontas""","""<Pocahontas_(1…",8.539176
"""Friday""","""<Friday_(1995_…",7.450494
"""Fair Game""","""<Fair_Game_(19…",7.061998
"""Bed of Roses""","""<Bed_of_Roses_…",7.279455
"""Screamers""","""<Screamers_(19…",6.762069
"""Black Sheep""","""<Black_Sheep_(…",1.50515
"""Broken Arrow""","""<Broken_Arrow_…",8.176873


Looking for movies whose `col_to_embed` is not found in the `subject` column (for whatever reason).

In [70]:
df_target_og.filter(
    ~pl.col("col_to_embed").is_in(yagofacts["subject"])
)

raw_entities,col_to_embed,target
str,str,f64
"""Dracula""","""<Dracula_(1931…",6.005262
"""Runaway Bride""","""<Runaway_Bride…",8.490601
"""Niagara""","""<Niagara_(1953…",6.929419
"""Love's Labour'…","""<Love's_Labour…",5.47682
"""The Fury""","""<The_Fury_(197…",7.380211
"""Salsa""","""<Salsa_(1988_f…",6.949028
"""The Changeling…","""<The_Changelin…",7.079181
"""Ocean's Eleven…","""<Ocean's_Eleve…",8.653904
"""Asoka""","""<Asoka_(2001_f…",7.278754
"""Party Monster""","""<Party_Monster…",5.870929


In [72]:
# Movies that are in the KEN version and not in the new version. 
df_target_og.filter(
    (pl.col("col_to_embed").is_in(yagofacts["subject"])) &
    (~pl.col("col_to_embed").is_in(df_final["col_to_embed"]))
)

raw_entities,col_to_embed,target
str,str,f64
"""Cry, the Belov…","""<Cry,_the_Belo…",5.830284
"""Bed of Roses""","""<Bed_of_Roses_…",7.279455
"""Man of the Yea…","""<Man_of_the_Ye…",5.322085
"""The Scarlet Le…","""<The_Scarlet_L…",7.016298
"""The Tie That B…","""<The_Tie_That_…",6.761928
"""Before the Rai…","""<Before_the_Ra…",5.883006
"""Miracle on 34t…","""<Miracle_on_34…",7.665247
"""I Love Trouble…","""<I_Love_Troubl…",7.792022
"""The Age of Inn…","""<The_Age_of_In…",7.508603
"""For Love or Mo…","""<For_Love_or_M…",7.04713


In [74]:
df_final.filter(
    ~(pl.col("col_to_embed").is_in(yagofacts["subject"]))
)

title,col_to_embed,revenue
str,str,f64
