# Wikipedia - Image/Caption Matching
## Retrieve captions based on images


<br>

### Description

A picture is worth a thousand words, yet sometimes a few will do. We all rely on online images for knowledge sharing, learning, and understanding. Even the largest websites are missing visual content and metadata to pair with their images. Captions and “alt text” increase accessibility and enable better search. The majority of images on Wikipedia articles, for example, don't have any written context connected to the image. Open models could help anyone improve accessibility and learning for all.

Current solutions rely on simple methods based on translations or page interlinks, which have limited coverage. Even the most advanced computer vision image captioning isn't suitable for images with complex semantics.


### Data

The objective of this competition is to predict the target caption_title_and_reference_description given information about an images. The targets for this competition are in multiple languages.

#### Files
* train-{0000x}-of-00005.tsv - the training data (tab delimited)
* test.tsv - the test data; the objective is to predict the target caption_title_and_reference_description for each row id
* sample_submission.csv - a sample submission file in the correct format; note that multiple predictions (up to 5) are allowed for each id in the test data.
* image_data_test/
* * image_pixels/test_image_pixels_part-{0000x}.csv.gz
* * * image_url: url to the original image file, e.g. https://upload.wikimedia.org/wikipedia/commons/e/ec/Hovden.jpg
* * * b64_bytes: base64 encoded bytes of the image file at a 300px resolution
* * * metadata_url: url to the commons page of the image, e.g. https://commons.wikimedia.org/wiki/File:Hovden.jpg
* * resnet_embeddings/test_resnet_embeddings_part-{0000x}.csv.gz
* * * image_url: url to the original image file, e.g. https://upload.wikimedia.org/wikipedia/commons/e/ec/Hovden.jpg
* * * embedding: a comma separated list of 2048 float values
* image_data_train - Due to the size of the training image data (~275 Gb), it is hosted separately and can be found here. Note that not all of the training observations have corresponding image data.

<code> kaggle competitions download -c wikipedia-image-caption </code>

### Submission

Submissions will be evaluated using NDCG@5 (Normalized Discounted Cumulative Gain).

The submission should be a list of id,caption_title_and_reference_description pairs ranked from top to bottom according to their relevance (i.e., the top id is the most relevant caption_title_and_reference_description), with up to 5 predictions per id. Each line should be a single id,caption_title_and_reference_description pair.


### Prizes

The top three winning teams will receive Wikipedia-branded merchandise

In [None]:
import os
import requests

# General packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px
import PIL.Image

from IPython.display import Image, display
import warnings
warnings.filterwarnings("ignore")

In [None]:
os.listdir('../input/wikipedia-image-caption/')

In [None]:
df = pd.read_csv('../input/wikipedia-image-caption/image_data_test/image_pixels/test_image_pixels_part-00000.csv', sep='\t', names=['image_url', 'b64_bytes', 'metadata_url'])
df

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
sub = pd.read_csv('../input/wikipedia-image-caption/sample_submission.csv')
sub

In [None]:
sub.shape

In [None]:
test_file = pd.read_csv('../input/wikipedia-image-caption/test.tsv', sep='\t')
test_file

In [None]:
fig = px.pie(test_file, values=test_file['language'].value_counts().values, names=test_file['language'].value_counts().index,
            title='Languages Distribution', color_discrete_sequence=px.colors.sequential.RdBu, hole=.3
            )
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

In [None]:
import urllib

In [None]:
def get_links(df, num):
    return df.image_url[:num].values

links = get_links(df, 10)


def load_images(links):
    images = []
    
    for link in links:
        URL = link
        try:

            with urllib.request.urlopen(URL) as url:
                with open('./temp.jpg', 'wb') as f:
                    f.write(url.read())

            img = PIL.Image.open('./temp.jpg')
            img = np.asarray(img)
            images.append(img)
        except:
            continue
    return images

def display_images(images, title=None): 
    f, ax = plt.subplots(2,5, figsize=(18,12))
    if title:
        f.suptitle(title, fontsize = 30)

    for i, image_id in enumerate(images):
        ax[i//5, i%5].imshow(image_id) 
   
        ax[i//5, i%5].axis('off')

    plt.show() 

In [None]:
images = load_images(links)

In [None]:
display_images(images)

In [None]:
links = df.image_url[20:30].values
images = load_images(links)
display_images(images)

In [None]:
links = df.image_url[30:40].values
images = load_images(links)
display_images(images)

In [None]:
links = df.image_url[40:50].values
images = load_images(links)
display_images(images)

In [None]:
links = df.image_url[50:60].values
images = load_images(links)
display_images(images)

#### Some images is not downloading and temp.jpg is empty

### Work in progress...