# Introduction

In this notebook, I'll explore the texts by making images using the labels of their words. I don't know if this has any particular use case in this competition, but I thought it would be fun to implement it anyways, and maybe someone can use it. I don't know. Anyways, let's get going.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if dirname.split('/')[-1] not in ['train', 'test']:
            print(os.path.join(dirname, filename))

In [None]:
train_df = pd.read_csv('/kaggle/input/feedback-prize-2021/train.csv')
train_df.head()

# How can I implement this?

First I think that either need to impose a cut off on the length of texts as when using transformers, so maybe this cutoff can be 1024 words. Or maybe I can use the the dimensions of the largest text as default.

So let's check the length of texts to get an idea about how we shall proceed.

In [None]:
def read_essay_txt(essay_id, path='train'):
    essay_file_path = f"../input/feedback-prize-2021/{path}/{essay_id}.txt"
    with open(essay_file_path, 'r') as essay_file:
        return essay_file.read()

In [None]:
texts = {id_: read_essay_txt(id_) for id_ in train_df.id.unique()}
texts_len = {id_: len(text.split()) for id_, text in texts.items()}

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


sns.histplot(texts_len.values());
print('Mean Length:', np.mean(list(texts_len.values())))
print('Length > 1024: {:.2f}%'.format(np.mean([len_ > 1024 for len_ in texts_len.values()]) * 100))

It seems that the mean is 421 tokens, and almost 1% of texts are larger than 1024. **So I'll stick with 1024.**

## Getting RGB values

Now I need to set an rgb value for each label, then I need to make a 3d array `(32, 32, 3)` for each text, and then I can visualize them.

In [None]:
train_df.discourse_type.unique()

In [None]:
from matplotlib import colors

label_color_dict = {
    'Lead': 'royalblue',
    'Position': 'violet',
    'Evidence': 'crimson',
    'Claim': 'magenta',
    'Counterclaim': 'darkorange',
    'Rebuttal': 'lime',
    'Concluding Statement': 'red'
}


label_rgb_dict = {label: colors.to_rgb(color) for label, color in label_color_dict.items()}

label_rgb_dict

Now that we have the rgb values, the next part is supposed to be easy. For each text we shall create a 2d array of of dims `(1024, 3)` then reshape this array into a 3d array of dims `(32, 32, 3)`.

The easiest way I can think of is to use the prediction string of each text to create a dict where keys are token ids and values are label. If the a token isn't in the dict, then we shall skip it and not set it's value in the array, and hence it will remain `(0, 0, 0)` which is black.

In [None]:
def get_text_arr(text_id, max_len=1024):
    text_arr = np.zeros((max_len, 3))
    text = texts[text_id]

    token_label_dict = {}
    for i, row in train_df.query('id == @text_id')[['discourse_type', 'predictionstring']].iterrows():
        for token in row['predictionstring'].split(' '):
            token_label_dict[int(token)] = row['discourse_type']

    for i in range(max_len):
        if i in token_label_dict.keys():
            text_arr[i] = label_rgb_dict[token_label_dict[i]]
    
    dims = (int(np.sqrt(max_len)), int(np.sqrt(max_len)), 3)
    text_arr = np.reshape(text_arr, dims)
    
    return text_arr

In [None]:
text_ids = train_df.id.unique()

ncols = 5
nrows = 10
max_len = 1024

for nrow in range(nrows):
    fig, axes = plt.subplots(1, ncols, figsize=(20, 5))
    for i, text_id in enumerate(text_ids[nrow*ncols:(nrow*ncols)+ncols]):
        axes[i].imshow(get_text_arr(text_id, max_len))
        axes[i].set_title(text_id)

### And that's it. Thanks for reading.