# About the competition

## Wikipedia - Image/Caption Matching
### Retrieve captions based on images


### Description
A picture is worth a thousand words, yet sometimes a few will do. We all rely on online images for knowledge sharing, learning, and understanding. Even the largest websites are missing visual content and metadata to pair with their images. Captions and “alt text” increase accessibility and enable better search. The majority of images on Wikipedia articles, for example, don't have any written context connected to the image. Open models could help anyone improve accessibility and learning for all.

Current solutions rely on simple methods based on translations or page interlinks, which have limited coverage. Even the most advanced computer vision image captioning isn't suitable for images with complex semantics.

### Data
The objective of this competition is to predict the target caption_title_and_reference_description given information about an images. The targets for this competition are in multiple languages.

### Files
- train-{0000x}-of-00005.tsv - the training data (tab delimited)
- test.tsv - the test data; the objective is to predict the target caption_title_and_reference_description for each row id
- sample_submission.csv - a sample submission file in the correct format; note that multiple predictions (up to 5) are allowed for each id in the test data.
- image_data_test/
 - image_pixels/test_image_pixels_part-{0000x}.csv.gz
 - image_url: url to the original image file, e.g. https://upload.wikimedia.org/wikipedia/commons/e/ec/Hovden.jpg
 - b64_bytes: base64 encoded bytes of the image file at a 300px resolution
 - metadata_url: url to the commons page of the image, e.g. https://commons.wikimedia.org/wiki/File:Hovden.jpg
 - resnet_embeddings/test_resnet_embeddings_part-{0000x}.csv.gz
 - image_url: url to the original image file, e.g. https://upload.wikimedia.org/wikipedia/commons/e/ec/Hovden.jpg
 - embedding: a comma separated list of 2048 float values
- image_data_train - Due to the size of the training image data (~275 Gb), it is hosted separately and can be found here. Note that not all of the training observations have corresponding image data.
 
*kaggle competitions download -c wikipedia-image-caption*

### Submission
Submissions will be evaluated using NDCG@5 (Normalized Discounted Cumulative Gain).

The submission should be a list of id,caption_title_and_reference_description pairs ranked from top to bottom according to their relevance (i.e., the top id is the most relevant caption_title_and_reference_description), with up to 5 predictions per id. Each line should be a single id,caption_title_and_reference_description pair.

## Prizes
The top three winning teams will receive Wikipedia-branded merchandise

# Importing the libs and data 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import os
import requests

import os
import gc
import glob

from tqdm.notebook import tqdm
tqdm.pandas()

import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True) 
import plotly.graph_objs as go

from plotnine import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px
import plotly.graph_objs as go
from geopy.geocoders import Nominatim
import folium
from folium.plugins import HeatMap
from folium.plugins import FastMarkerCluster
from plotly import tools
import re
from plotly.offline import init_notebook_mode, plot, iplot
from wordcloud import WordCloud, STOPWORDS 
from warnings import filterwarnings
filterwarnings('ignore')
import missingno as msno
import glob

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px
import PIL.Image
import cv2
import urllib
from IPython.display import Image, display
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import urllib

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
os.listdir('../input/wikipedia-image-caption/')

**Loading the test data**

In [None]:
test_file = pd.read_csv('../input/wikipedia-image-caption/test.tsv', sep='\t')
test_file.head(1)

In [None]:
tesdt_file = pd.read_csv('../input/wikipedia-image-caption/test_caption_list.csv')
tesdt_file.head()

**Loading the wiki data**

In [None]:
wiki_df = pd.read_csv('../input/wikipedia-image-caption/image_data_test/image_pixels/test_image_pixels_part-00000.csv', 
                      sep='\t', names=['image_url', 'b64_bytes', 'metadata_url'])
df = pd.read_csv('../input/wikipedia-image-caption/image_data_test/image_pixels/test_image_pixels_part-00000.csv', sep='\t', names=['image_url', 'b64_bytes', 'metadata_url'])
df
wiki_df.head(1)

**Loading the Submission data**

In [None]:
sub_file = pd.read_csv('../input/wikipedia-image-caption/sample_submission.csv')
sub_file.head(1)

**Shape of the data we have!**

In [None]:
print("the shape of the wiki file is : " , wiki_df.shape)
print("the shape of the test file is : ", test_file.shape)
print("the shape of the sub file is  : " ,sub_file.shape)

# EDA

## Loading the images and exploring the images 

In [None]:
def get_links(df, num):
    return df.image_url[:num].values

links = get_links(df, 10)


def load_images(links):
    images = []
    
    for link in links:
        URL = link
        try:

            with urllib.request.urlopen(URL) as url:
                with open('./temp.jpg', 'wb') as f:
                    f.write(url.read())

            img = PIL.Image.open('./temp.jpg')
            img = np.asarray(img)
            images.append(img)
        except:
            continue
    return images

def display_images(images, title=None): 
    f, ax = plt.subplots(2,5, figsize=(18,12))
    if title:
        f.suptitle(title, fontsize = 30)

    for i, image_id in enumerate(images):
        ax[i//5, i%5].imshow(image_id) 
   
        ax[i//5, i%5].axis('off')

    plt.show()

In [None]:
images = load_images(links)

In [None]:
display_images(images)

In [None]:
links = df.image_url[20:32].values
images = load_images(links)
display_images(images)

## Exploring the test file
### Languages in the test file

this won't work after the data was updated

In [None]:
# from matplotlib import rcParams

# a = sns.displot(x='language', data=test_file,color='#73C6B6',height=8, aspect=20/8)

In [None]:
# import plotly.graph_objects as go    

# fig = go.Figure(
#     data=[ go.Bar(x=test_file['language'].value_counts().index, 
#             y=test_file['language'].value_counts().values,
#             text=test_file['language'].value_counts().values,
#             textposition='auto',name='Count',
#            marker_color='#73C6B6')],
#     layout_title_text="Language Distribution : using plotly v2"
# )
# fig.show()

In [None]:
# import matplotlib.pyplot as plt
# import squarify    # pip install squarify (algorithm for treemap)
# plt.figure(figsize=(25,8))
# squarify.plot(sizes=test_file['language'].value_counts().values, 
#               label=test_file['language'].value_counts().index, 
#               color=["#73C6B6","lightgreen","cyan", "c"],
#               alpha=.8 )
# plt.title("A square graph for the same :D")
# plt.axis('off')
# plt.show()

## We look at a English artices in a bit more detail, I will be using just the first 10000 rows 

In [None]:
df = pd.read_csv('../input/wikipedia-image-caption/train-00000-of-00005.tsv', sep='\t',nrows=10000)
df.head(1)

In [None]:
df = df[df['language']=="en"]
df.head(1)

In [None]:
def display_images(images, title=None): 
    f, ax = plt.subplots(1,1, figsize=(18,12))
    if title:
        f.suptitle(title, fontsize = 30)

    for i, image_id in enumerate(images):
        ax.imshow(image_id) 
        ax.axis('off')

    plt.show()

links = df.image_url[0:1]
images = load_images(links)
print("The title of the image is ",df.page_title[0:1])
display_images(images)



In [None]:
import plotly.graph_objects as go    

fig = go.Figure(
    data=[ go.Bar(x=df['page_changed_recently'].value_counts().index, 
            y=df['page_changed_recently'].value_counts().values,
            text=df['page_changed_recently'].value_counts().values,
            textposition='auto',name='Count',
           marker_color='#73C6B6')],
    layout_title_text="Has the page been changed recently?"
)
fig.show()

In [None]:
import plotly.graph_objects as go    

fig = go.Figure(
    data=[ go.Bar(x=df['mime_type'].value_counts().index, 
            y=df['mime_type'].value_counts().values,
            text=df['mime_type'].value_counts().values,
            textposition='auto',name='Count',
           marker_color='#73C6B6')],
    layout_title_text="What is the Distribution of the various file types"
)
fig.show()

In [None]:
import plotly.graph_objects as go    

fig = go.Figure(
    data=[ go.Bar(x=df['is_main_image'].value_counts().index, 
            y=df['is_main_image'].value_counts().values,
            text=df['is_main_image'].value_counts().values,
            textposition='auto',name='Count',
           marker_color='#73C6B6')],
    layout_title_text="Is the image the main image of the article?"
)
fig.show()

In [None]:
cloud = WordCloud(width=1440, height=1080,stopwords={'nan'},colormap='Greens',background_color='white').generate(" ".join(df['page_title'].astype(str)))
plt.figure(figsize=(16, 10))
plt.title('A WordCloud of the various pages in the file',fontsize=20,pad=40)
plt.imshow(cloud)
plt.axis('off')

In [None]:
cloud = WordCloud(width=1440, height=1080,stopwords={'nan'},colormap='Greens',background_color='white').generate(" ".join(df['section_title'].astype(str)))
plt.figure(figsize=(16, 10))
plt.title('A WordCloud of the various section for the pages in the file',fontsize=20,pad=40)
plt.imshow(cloud)
plt.axis('off')

## Now looking at some data with German language 

In [None]:
df = pd.read_csv('../input/wikipedia-image-caption/train-00000-of-00005.tsv', sep='\t',nrows=10000)
df.head(1)
df = df[df['language']=="de"]
df.head(1)

In [None]:
def display_images(images, title=None): 
    f, ax = plt.subplots(1,1, figsize=(18,12))
    if title:
        f.suptitle(title, fontsize = 30)

    for i, image_id in enumerate(images):
        ax.imshow(image_id) 
        ax.axis('off')

    plt.show()

links = df.image_url[0:1]
images = load_images(links)
print("The title of the image is ",df.page_title[0:1])
display_images(images)



In [None]:
cloud = WordCloud(width=1440, height=1080,stopwords={'nan'},colormap='Greens',background_color='white').generate(" ".join(df['page_title'].astype(str)))
plt.figure(figsize=(16, 10))
plt.title('A WordCloud of the various pages in the file',fontsize=20,pad=40)
plt.imshow(cloud)
plt.axis('off')

I'll try to read the whole tsv

If there are any suggesion for the notebook please comment, that would be helpful. Also please upvote if you liked it! Thank you!!

Some of my other works:

* [TPS- APR](https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model) 
* [HEART ATTACKS](https://www.kaggle.com/udbhavpangotra/heart-attacks-extensive-eda-and-visualizations) 
* [YOUTUBE DATA EXPLORATION](https://www.kaggle.com/udbhavpangotra/what-do-people-use-youtube-for-in-great-britain)
* [TPS MAY](https://www.kaggle.com/udbhavpangotra/tps-may-21-extensive-eda-catboost-shap)
* [COVID-19 DIGITAL LEARNING](https://www.kaggle.com/udbhavpangotra/how-did-covid-19-impact-digital-learning-eda)
* [TPS - SEPT](https://www.kaggle.com/udbhavpangotra/extensive-eda-baseline-shap)

* [also try this dataset ReliefWeb Crisis Figures Data](https://www.kaggle.com/udbhavpangotra/reliefweb-crisis-figures-data)

In [None]:
%%html
<marquee style='width: 90% ;height:70%; color: #45B39D ;'>
    <b>Do UPVOTE if you like my work, I will be adding some more content to this kernel post understanding the files :) </b></marquee>

Credits to the people who did some EDA before me helping me in doing the same! 

1.  [KALILUR RAHMAN](https://www.kaggle.com/kalilurrahman/wikimedia-image-text-matching-eda), I loved the square plot! 
2.  [RADMIR ZOSIMOV](https://www.kaggle.com/hijest/wikipedia-image-caption-matching-starter-eda)
3.  [MARÍLIA PRATA](https://www.kaggle.com/mpwolke/wikimedia-urllib)