# Motivation
This notebook aims to:
* explore the provided metadata files from the organizers, i.e *task_1-google_search_english_original_metadata* and *task_1-google_search_translated_to_english_metadata*
* gain insight about how the data was collected
* understand what useful information it contains about the source texts

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import os

# EDA on English original metadata

As the name suggests, this file contains metadata for the texts which were crawled originally in the English language. This however does not mean that the countries represented are only English speaking ones as we'll see in the below analysis.

In [None]:
path = '/kaggle/input/hackathon'
file = f'{path}/task_1-google_search_english_original_metadata.csv'
df = pd.read_csv(file)

In [None]:
f"There are {df.shape[0]} texts originally in English with metadata"

In [None]:
df.head()

### Country

In [None]:
df['country'].value_counts()

240 countries are represented in this data, which is a lot.

In [None]:
df['country'].value_counts().hist()

Some countries are represented with more texts than others.

### Pdf, translated, downloaded

In [None]:
df['is_pdf'].value_counts()

In [None]:
df['language'].value_counts()

Everything is in English, as expected.

In [None]:
df['is_translated'].value_counts()

Nothing is translated.

In [None]:
df['is_downloaded'].value_counts()

Small number of docs were not downloaded.

In [None]:
df[df['is_downloaded']==False]['char_number'].describe()

We should ignore all that were not downloaded - they contain empty documents.

In [None]:
df.drop(df[df['is_downloaded']==False].index, inplace=True)

### Char number

In [None]:
(df['char_number']== 0).mean()

8% of the texts have 0 chars - they should be ignored as well.

In [None]:
df.drop(df[df['char_number']==0].index, inplace=True)

Let's print the texts with min char_number.

In [None]:
for _, row in df.sort_values('char_number').head(20).iterrows():
    code = row['alpha_2_code']
    filename=row['filename']
    filename = f'/kaggle/input/hackathon/task_1-google_search_txt_files_v2/{code}/{filename}.txt'
    
    with open(filename, 'r') as file:
        data = file.read()
    print(row['char_number'])
    print(data)
    print('--'*10)

We see they are not meaningful - they contain forbidden messages and indicate that the source was not crawled correctly.

### Url

In [None]:
(df['url'].size - df.url.unique().size)/df['url'].size

A lot of identical urls - 54% of all.

Most common duplicated urls are:

In [None]:
df['url'].value_counts().head()

This duplication means that the text is relevant for multiple countries. This can be problematic when trying to extract answer snippets from the texts, because we would have to determine which country the answer applies to. We might consider removing such sources to make the problem easier.

# EDA on translated to English metadata

In [None]:
path = '/kaggle/input/hackathon'
file = f'{path}/task_1-google_search_translated_to_english_metadata.csv'
df = pd.read_csv(file)

In [None]:
f"There are {df.shape[0]} texts translated to English with metadata"

In [None]:
df.head()

We'll do a similar analysis as the above.

In [None]:
df['country'].value_counts()

Only 74 countries represented here.

In [None]:
df['is_pdf'].value_counts()

In [None]:
df['language'].value_counts()

In [None]:
df['is_translated'].value_counts()

In [None]:
df['is_downloaded'].value_counts()

The 'is_downloaded' column seems pointless.

In [None]:
(df['char_number']== 0).mean()

In [None]:
(df['url'].size - df.url.unique().size)/df['url'].size

Not as many duplicating urls.

In [None]:
df['url'].value_counts().head()