# Wikipedia - Image/Caption Matching EDA

This looks like a very interesting competition (too bad it doesn't award points!!). Let's try to take a look at the data and try to submit a super simple baseline to check if our understanding is fine.

**Note this is a new version, previously we've helped discover some issues with test data :)**

In [None]:
!pip install rapidfuzz -qq

## Test data

Let's start with **sample submission**. We have an *id* column and *caption_title_and_reference_description* column, and for each id we predict 5 captions. We need to select them from a predefined set of captions we'll see next.

In [None]:
import pandas as pd
sub = pd.read_csv('../input/wikipedia-image-caption/sample_submission.csv')
sub.head(10)

In [None]:
captions = pd.read_csv('../input/wikipedia-image-caption/test_caption_list.csv')
print(len(captions))
captions.head()

The test file contains a list of id's and image urls. Let's print a few of those urls. 

In [None]:
test = pd.read_csv('../input/wikipedia-image-caption/test.tsv', sep='\t')
test.head()

In [None]:
for i in range(5):
    print(test.image_url.loc[i])

Let's take a look at some of the **test images**. It looks like we have the test image data in 5 csv files, and these include links to the images (upload and commons varieties) as well as the base64 string encoded version of the images. 

In [None]:
ls ../input/wikipedia-image-caption/image_data_test/image_pixels

In [None]:
tst0 = pd.read_csv('../input/wikipedia-image-caption/image_data_test/image_pixels/test_image_pixels_part-00000.csv', sep='\t', header=None)
tst0.head()

In [None]:
import base64 
from PIL import Image
import io

image_64_decode = base64.b64decode(tst0[1].loc[0])
img = Image.open(io.BytesIO(image_64_decode))
img

## Train data

Thanks to [this kernel](https://www.kaggle.com/udbhavpangotra/reading-the-data-datatable-works-like-a-charm) for showing how to accelerate data reading with datatable, please give it an upvote! Here we have a bunch of columns including page and image urls, as well as what looks as our target: caption_title_and_reference_description. There is a [SEP] string at the end of each entry here, or in the middle if multiple entries are provided. Not fully sure where this is coming from. 


In [None]:
import datatable as dt
train0 = dt.fread('../input/wikipedia-image-caption/train-00000-of-00005.tsv')
train0.head()

## Baseline

Looking at the train data, it seems that there is a connect between page url and our target. We also have page url in our test data, so let's try to exploit that and make a caption prediction only based on page url, without looking at the image. We'll verify that heuristic on train data, then we'll try to fuzzy match that caption prediction with the list of test captions.

In [None]:
from urllib.parse import unquote

t = test.image_url.loc[2]

def convert(t):
    t = t.rsplit('/',1)[1]
    t = unquote(t)
    t = t.replace('_', ' ')
    t = t + ' [SEP]'
    return(t)

In [None]:
for i in range(5):
    print(f'target: {train0[i,-1]}')
    print(f'prediction: {convert(train0[i,1])}')
    print()

In [None]:
test['prediction'] = test['image_url'].apply(convert)

In [None]:
test.head()

In [None]:
CAPTIONS = captions.caption_title_and_reference_description.values.tolist()
len(CAPTIONS)

In [None]:
from rapidfuzz import process, fuzz

In [None]:
%%time

for i in range(5):
    s = test.prediction.loc[i]
    print(f'image_url: {s}')
    res = process.extract(s, CAPTIONS, scorer=fuzz.ratio, processor=None, limit=5)
    print(f'closest captions:')
    for c in res:
        print(c[0])
    print('*'*60)
    print()  

In [None]:
def find_closest_match(s):
    res = process.extract(s, CAPTIONS, scorer=fuzz.ratio, processor=None, limit=5)
    res = [x[0] for x in res]
    return res

In [None]:
from tqdm.auto import tqdm
tqdm.pandas()

In [None]:
test['caption_title_and_reference_description'] = test['prediction'].progress_apply(find_closest_match)

In [None]:
sub = test[['id', 'caption_title_and_reference_description']]
sub = sub.explode('caption_title_and_reference_description')
sub.head()

In [None]:
sub.to_csv('submission.csv', index=False)

We've now been able to score above 0.0000 on the leaderboard. To be honest, I'm not sure if using page_url is expected by the host so I asked that question on the forum. For sure, the more challenging and interesting aspect is matching the captions directly with images, and we'll try to tackle that next :) 

## to be continued ...