Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Prepare the Dataset for FastText

In this notebook, we will demonstrate how to prepare the dataset used for FastText. 
The structure of data files look like:
```
repository
    data
        data_for_pipeline
            data.txt
            label.txt
            word_to_index.json
        data_for_batch_inference
            file1
            file2
            ...
            fileN
```

## Step 1
Follow the above structure and create three directories: ```data```, ```data_for_pipeline```, and ```data_for_batch_inference```

## Step 2
### Prepare the files for fasttext_pipeline.ipynb
1. ```data.txt```. Each line of ```data.txt``` contains the text and label which are separated by ```\t```. 
For example    

```sentence1``` **\t** ```label1```    

```sentence2``` **\t** ```label2```    

2. ```label.txt```. Each line of ```label.txt``` contains a unique label used in ```data.txt```. For example, if there are four kinds of labels in your dataset, then the ```label.txt``` looks like
```
label1
label2
label3
label4
```

3. ```word_to_index.json```. This file contains the key-value pairs. The key is a word and the value is the word index. Not all words occured in ```data.txt```. Actually, we should take the most representative ones into account. You could prepare this file yourself. In this notebook, we also support a simple method to select these words according to the frequency of occurrence.

### Prepare the files for fasttext_batch_inference.ipynb
You could determine the file names yourself, such as file1, file2, and so on.

Each file contains the text you want to make a prediction. Please remove the line breaks beforehand so as to keep the text only in one line.

> **Tip**
If you don't want to prepare your data, you could use the dataset we have prepared in advance. and just refer to fasttext_pipeline.ipynb, fasttext_realtime_inference.ipynb, and fasttext_batch_inference.ipynb for a further experience.

The outline of this notebook is as follows:

- Create word_to_index.json according to the word frequency.
- Calculate the average length of the text.
- Prepare the data for batch inference.
- Copy files for unittest.

### Create word_to_index.json according to the word frequency

In [1]:
import json
# Count the word frequency
word_count = {}
with open('data/data_for_pipeline/data.txt', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        text = line.split('\t')[0]
        for word in text.split(' '):
            word_count[word] = word_count.get(word,0)+1
word_count_list = sorted(word_count.items(), key=lambda x : x[1], reverse=True)

# Take out the vocab_size most frequent words to form word_to_index.json
vocab_size = 10000
word_to_index = {"[PAD]": 0, "[UNK]": 1}
index = 2
for w_c in word_count_list:
    if index == vocab_size:
        break
    word = w_c[0]
    word_to_index[word] = index
    index +=1
with open('data/data_for_pipeline/word_to_index.json', 'w', encoding='utf-8') as f:
    json.dump(word_to_index, f)

### Calculate the average length of the text
Our model needs to set a parameter named ```max_length``` to control the length of each text. Suppose the average length of the text is ```avg_length```. In order to prevent the difference between ```max_length``` and ```avg_length``` becoming too much, we suggest make ```max_length``` equal to ```avg_length```.

> **Tip**
If the variance of the text length is too large, then you should use the median of the text length rather than ```avg_length```

In [3]:
total_length = 0
line_num = 0
with open('data/data_for_pipeline/data.txt', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        line_num += 1
        text = line.split('\t')[0]
        total_length += len(text.split(' '))
avg_length = int(total_length/line_num)
avg_length

10

### Prepare the data for batch inference
We just select a few files as the inputs for batch inference

In [4]:
import os
# We select 200 files for demo
num = 100
with open('data/data_for_pipeline/data.txt', 'r', encoding='utf-8') as f:
    lines = []
    for i, line in enumerate(f.readlines()):
        if i==num:
            break
        lines.append(line.split('\t')[0])
       
# Save these texts to files
dir_ = 'data/data_for_batch_inference'
os.makedirs(dir_, exist_ok=True)
for index, line in enumerate(lines):
    # We use the index as the name of each file
    path = os.path.join(dir_, str(index))
    with open(path, 'w', encoding='utf-8') as f:
        f.write(line)

### Copy files for unittest
If you want to execute unit tests locally, then you need to copy some files to the relevant directory.

In [5]:
# Copy files for train/evaluation
import shutil
src_dir = 'data/data_for_pipeline'
dst_dir = 'split_data_txt/data/split_data_txt/inputs/input_dir'
for file in os.listdir(src_dir):
    src = os.path.join(src_dir, file)
    shutil.copy(src, dst_dir)

# Copy files for score
src_dir = 'data/data_for_batch_inference'
dst_dir = 'fasttext_score/data/fasttext_score/inputs/input_files'
for file in os.listdir(src_dir):
    src = os.path.join(src_dir, file)
    shutil.copy(src, dst_dir)