# What's this?
I will share how I load training texts and store as a `pandas.DataFrame`. This notebook creates a DataFrame as described below.  
The created DataFrame is saved as a pickle file.

| Column name       | Description                                                                                                                   | 
| ----------------- | ----------------------------------------------------------------------------------------------------------------------------- | 
| section<br>title  | section title.<br>Loaded from given json files.                                                                               | 
| text              | text data.<br>Loaded from given json files.                                                                                   | 
| section<br>num    | The section numbers.                                                                                                          | 
| entities          | The position of detected datasets.<br>List of tuple(start_position, end_position, 'DATASET').<br>This is spaCy's data format. | 
| spans             | The position of detected datasets.<br>List of tuple(start_position, end_position).<br>Almost the same content as "entities".  | 
| matched<br>labels | The list of matched dataset labels.<br>Same order as "entities" and "spans".                                                  | 
| is_label<br>exist | Whether there are matches within the section. (Bool)                                                                          | 
| document<br>id    | The document_id of the section.                                                                                               | 
| true<br>labels    | The true_labels of the document.                                                                                              | 

# How to use this DataFrame?
You can use the created DataFrame just by loading this notebook in your notebook.

- Click "+ Add data" at the top right of Notebook edit page.
- Click "Notebook output files".
- In the search box, type "for-beginners-load-all-datas-as-dataframe".
- If this notebook is found, click "Add data". (If this notebook is not found, you can upvote this notebook and find this notebook clicking "Your Favorites", as a last resort.)

Then, run this code.

```python
import pandas as pd
train_text_df = pd.read_pickle('../input/for-beginners-read-train-texts-as-dataframe/train_text_df.pkl')
```

Now you can use the created DataFrame as `train_text_df`. Of course, you can copy and paste the code below and use it. 
Please note that it takes about 10 minutes to run this code at Kaggle environment.

If there are any problems, please let me know. Good luck! ðŸ˜‰

In [None]:
import pandas as pd
import re
import pathlib
import pickle
import json

from collections import defaultdict
from tqdm import tqdm

In [None]:
input_path = pathlib.Path('../input/coleridgeinitiative-show-us-the-data')
train_df = pd.read_csv(input_path / 'train.csv')
submission_df = pd.read_csv(input_path / 'sample_submission.csv')
train_files = input_path.glob('train/*.json')
test_files = input_path.glob('test/*.json')

In [None]:
def clean_text(text: str) -> str:
    """
    https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/overview/evaluation
    """
    
    return re.sub('[^A-Za-z0-9]+', ' ', str(text).lower())

In [None]:
tmp_1 = [x.lower() for x in train_df['dataset_label'].unique()]
tmp_2 = [x.lower() for x in train_df['dataset_title'].unique()]
tmp_3 = [x.lower() for x in train_df['cleaned_label'].unique()]

existing_labels = set(tmp_1 + tmp_2 + tmp_3)
matching_text = '|'.join(existing_labels)

In [None]:
# about 8 minutes loop
train_text_dic = defaultdict(list)
test_text_dic = defaultdict(list)
train_files = input_path.glob('train/*.json')

# Load all training files.
for filepath in tqdm(train_files):
    
    with open(filepath, 'rb') as f:
        datas = json.load(f)
    
    for i, data in enumerate(datas):
        train_text_dic['section_title'].append(data['section_title'])
        train_text_dic['text'].append(data['text'])
        train_text_dic['section_num'].append(i)
        matches = re.finditer(matching_text, data['text'].lower())
        
        # Are there any matches with true labels?
        entities = []
        spans = []
        matched_labels = []
        for match in matches:
            entities.append((match.start(), match.end(), 'DATASET'))
            spans.append((match.start(), match.end()))
            matched_labels.append(data['text'][match.start():match.end()])
        
        train_text_dic['entities'].append(entities)
        train_text_dic['spans'].append(spans)
        train_text_dic['matched_labels'].append(matched_labels)
        
        train_text_dic['is_label_exist'].append(len(matched_labels) != 0)
    
    fileid = filepath.stem
    train_text_dic['document_id'] += [fileid] * (i + 1)
    
    true_labels = list(train_df.query('Id == @fileid')['dataset_label'])
    train_text_dic['true_labels'] += [true_labels] * (i + 1)
    
# # for test_files (without matching detection)
# for filepath in test_files:
#     with open(filepath, 'rb') as f:
#         datas = json.load(f)
    
#     for i, data in enumerate(datas):
#         test_text_dic['section_title'] += [data['section_title']]
#         test_text_dic['text'] += [data['text']]
#         test_text_dic['section_num'] += [i]
    
#     fileid = filepath.stem
#     test_text_dic['document_id'] += [fileid] * (i + 1)

train_text_df = pd.DataFrame(train_text_dic)
# test_text_df = pd.DataFrame(test_text_dic)

train_text_df.to_pickle('train_text_df.pkl')
# test_text_df.to_pickle('test_text_df.pkl')

In [None]:
train_text_df