# Motivation
The purpose of this notebook is to analyse the metadata for some manually reviewed texts provided by the organizers.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import os

### EDA + cleanup

In [None]:
path = '/kaggle/input/hackathon'
file = f'{path}/task_1-google_search_manually_reviewed_metadata.csv'
df = pd.read_csv(file, encoding = "ISO-8859-1")

In [None]:
f"There are {df.shape[0]} manually reviewed texts with metadata"

In [None]:
df.head()

In [None]:
df.info()

No information about query and char_number.

In [None]:
df[df['filename'].isnull()]

We don't have a corresponding text file for those sources.

In [None]:
df.drop(df[df['filename'].isna()].index, inplace=True)

In [None]:
df['language'].value_counts()

All reviewed texts are in English.

In [None]:
df['is_pdf'].value_counts()

In [None]:
df['is_translated'].value_counts() # All original

In [None]:
df['is_downloaded'].value_counts() # All downloaded

All texts are originally in English are were downloaded sucessfully.

In [None]:
df['Is Processed'].value_counts()

All were processed, whatever that means :)

In [None]:
df['country'].unique().size

In [None]:
pd.set_option('display.max_colwidth', None)

In [None]:
df[df['Comments'].notna()]['Comments']

Comments are rare and don't seem very useful.

In [None]:
df['Snippet'].isna().mean()

34% of entries have no Snippet extracted, which seems weird and those should be ignored.

In [None]:
df.drop(df[df['Snippet'].isna()].index, inplace=True)

In [None]:
df['snippet_len'] = df['Snippet'].astype(str).apply(len)

In [None]:
df['snippet_len'].hist()

In [None]:
def is_snippet_in_text(row):
    code = row['alpha_2_code']
    file = row['filename']
    if not file.endswith('.txt'):
        file += '.txt'
    filename=f'{path}/task_1-google_search_txt_files_v2/{code}/{file}'
    if os.path.isfile(filename):
        with open(filename, 'r') as file:
            data = file.read()
        return row['Snippet'] in data

In [None]:
df['snippet_in_text'] = df.apply(is_snippet_in_text, axis=1)

In [None]:
df['snippet_in_text'].mean()

Only 37% of snippets actually appear in the text in the respective file. However, manual inspection showed that some snippets appear in the text with slight differences in punctuation or with missing words.

In [None]:
def get_content_len(row):
    code = row['alpha_2_code']
    file = row['filename']
    if not file.endswith('.txt'):
        file += '.txt'
    filename=f'{path}/task_1-google_search_txt_files_v2/{code}/{file}'
    if os.path.isfile(filename):
        with open(filename, 'r') as file:
            data = file.read()
        return len(data)
    else:
        print(f"Could not find file {filename} in folder {code}")

In [None]:
df['text_len'] = df.apply(get_content_len, axis=1)

In [None]:
df[df['text_len'].isna()]

We see an issue with two entries, where the filename does not correspond to the alpha_2_code and the country. Those should be ignored.

In [None]:
df.drop(df[df['text_len'].isna()].index, inplace=True)

In [None]:
df.shape

In [None]:
df.drop(['query','language','is_translated','is_downloaded','char_number','Is Processed'], inplace=True, axis=1)

In [None]:
df.to_csv('/kaggle/working/manually_reviewed_cleaned.csv', index=False)

Finally, the cleaned file can be modified so that the snippets are split into smaller ones which answer the specific questions required for a submission. Then it can be used as an evaluation dataset for future models.