# Problem we want to address

Hello Kagglers! I hope everyone is busy with the competitions. This competition is very interesting (and a bit hard TBH). One of the things that I found while looking at the data is that the `TSV` files are huge. If you try to read the file in `pandas`, then it is highly likely that your kernel memory will blow up, and the kernel will eventually crash.

So, we will be doing something clever here to extract the relevant information from these TSVs, but without blowing up the memory. Let's start!

# Import the required libraries

In [None]:
import os
import gc
import cv2
import time
import requests
import numpy as np
import pandas as pd
import seaborn as sns
from pathlib import Path
import matplotlib.pyplot as plt

import dask
from dask import delayed
import dask.dataframe as dd

sns.set()
seed = 1234
np.random.seed(seed)

%config IPCompleter.use_jedi = False

# Dataset

In [None]:
# Path to the directory where data is stored
data_path = Path("../input/wikipedia-image-caption/")

# Selecting only a subset of columns as there
# many columns that we don't need.
columns_to_select = ["language",
                     "page_url",
                     "image_url",
                     "caption_title_and_reference_description"
                    ]

# Get the list of all tsv files we need to read for training
tsvs = sorted(list(data_path.glob("*.tsv")))

# Remove test tsv file as we don't need it for now
tsvs.remove(data_path / "test.tsv")

print("Number of TSV files found: ", len(tsvs))

# Data Processing

As said earlier, we will be using Dask for reading the data. Why?
1. Dask Dataframe can read data in chunks/partitions. When you read data with Dask, you will get a `Delayed` object. This makes it easier to read data that doesn't fit into RAM
2. The files are tab-separated. So, we need to pass this info while reading the file. Bonus point: Dask Dataframe almost has the same API as pandas, so whatever parameter you pass in pandas, you can pass it here as well. There are some differences though because of the parallel stuff that Dask does. You can read about it in detail [here](https://docs.dask.org/en/latest/dataframe.html)
3. I am selecting only a few columns that seem relevant to me but you can add/remove more columns if you want
4. I will be dropping the null values as well. Again, just a choice. It's up to you how you want to deal with the missing data
5. We will convert the `dask dataframe` to `pandas dataframe` afterward and save it in `feather` format for the future use case. Why feather? Because it is much faster to read from feather as compared to csv/tsv, and in most cases, it consumes less disk space as well

In [None]:
for tsv in tsvs:
    # We will record the time taken to process
    # each file to convert it into desired format
    start_time = time.time()
    
    # 1. Name of the file to read
    name = tsv.name
    
    # 2. Use Dask dataframe
    df = dd.read_csv((data_path / name), 
                       sep="\t",
                       quoting=3,
                       escapechar="\n",
                       usecols=columns_to_select,
                       on_bad_lines="skip",
                       dtype="string"
                    )
    # 3. Dropping the null values for now
    df = df.dropna()
    
    # 4. Convert to pandas dataframe now
    df = df.compute()
    
    # 5. Reset the index
    df = df.reset_index(drop=True)
    
    # 6. Save to feather format for future use
    df.to_feather(name.split(".tsv")[0])
    
    # 7. Del the dataframe and do garbage collection
    del df
    gc.collect()
    
    print(f"File {name} processed and saved successfully in feather format", end=" =>")
    print(f"Time taken: {time.time() - start_time:.2f} seconds")
    print("")
    print("="*50)

# Sanity-check

Let's load one of the processed files to see how much time and memory it takes while reading. Also, we will be looking at a few data entries as well

In [None]:
# File path
filepath = "train-00000-of-00005"

# Read using pandas now
start_time = time.time()
df = pd.read_feather(filepath)
print(f"Time taken to read {filepath} in feather format: {time.time()-start_time:.2f} seconds")
print("")

df.head()

Hooray! So, we just have to read these processed files now without waiting for too long. Hope you enjoyed this simple kernel. Moaarrrr coming soon! 