Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

## I: Using existing datasets
If you don't have any datasets, you could use the two existing datasets we have prepared in advance. 

### Chinese dataset: [THUCNews](https://www.dropbox.com/sh/k9w7icz74fe4f4k/AAATZBeIh4RBu1mLZ-NqonHXa?dl=0)
[This dataset](http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews)  is generated by filtering historical data of Sina News RSS subscription channels from 2005 to 2011. Here we only use a part of THUCNews which contains 154,921 headlines of THUCNews including nine categories: game, technology, entertainment, finance, society, realty, stock, education, and sport.

### English dataset: [IMDB](https://www.dropbox.com/sh/ifxoyyt0j9kuc8u/AAD_m2q3ghJqVWEeWCWnNiyYa?dl=0)
[This dataset](https://ai.stanford.edu/~amaas/data/sentiment/) contains 50,000 movie reviews. Each review has a label of "pos" or "neg" indicating the sentiment polarity of a user. The overall distribution of labels is balanced (25k pos and 25k neg). 

**Just execute the following cells, and then refer to fasttext_pipeline.ipynb, fasttext_realtime_inference.ipynb, and fasttext_batch_inference.ipynb for a further experience.**

## II: Using your dataset
**You only need to supply one file named ```data.txt```.**

Each line of ```data.txt``` contains the text and label which are separated by ```\t```. 
For example: ```sentence1``` **\t** ```label1```

**Please remove all line breaks in a text**

The other files will be created automatically through the following cells.

The structure of dataset files looks like:
```
repository
  data
    data_for_pipeline
      data.txt
      label.txt
      word_to_index.json
    data_for_batch_inference
      file1
      file2
      ...
      fileN
```

### According to the above structure, create three directories
They are data, data_for_pipeline, and data_for_batch_inference

In [1]:
import os
# files for fasttext_pipeline.ipynb will be placed in data/data_for_pipeline
os.makedirs('data/data_for_pipeline', exist_ok=True)
# files for fasttext_batch_inference.ipynb will be placed in data/data_for_batch_inference
os.makedirs('data/data_for_batch_inference', exist_ok=True)

### Prepare files for fasttext_pipeline.ipynb
Put ```data.txt``` in ```data/data_for_pipeline```. We will check the existence of ```data.txt```.

In [2]:
if not os.path.exists('data/data_for_pipeline/data.txt'):
    raise FileNotFoundError('data/data_for_pipeline/data.txt')

#### create ```label.txt```
Each line of ```label.txt``` contains a unique label used in ```data.txt```. For example, if there are three kinds of labels in your dataset, then the ```label.txt``` looks like
```
label1
label2
label3
```

In [3]:
labels = set()
# get all unique labels
with open('data/data_for_pipeline/data.txt', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        label = line.rstrip().split('\t')[1]
        labels.add(label)
labels = list(labels)
# save labels to 'label.txt'
with open('data/data_for_pipeline/label.txt', 'w', encoding='utf-8') as f:
    for label in labels:
        f.write(label+'\n')
print('number of labels: {}'.format(len(labels)))        

number of labels: 9


#### create ```word_to_index.json```
This is a vocabulary file consisting of the key-value pairs. The key is a word and the value is the word index. Not all words occurred in ```data.txt```. Actually, we should take the most representative ones into account. In this notebook, we select these words according to their frequency of occurrence.

In [4]:
import json
# Record the word frequency
word_count = {}
with open('data/data_for_pipeline/data.txt', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        text = line.split('\t')[0]
        for word in text.split(' '):
            word_count[word] = word_count.get(word,0)+1
word_count_list = sorted(word_count.items(), key=lambda x : x[1], reverse=True)

# Take out the vocab_size most frequent words to form word_to_index.json
vocab_size = 10000
word_to_index = {"[PAD]": 0, "[UNK]": 1}
index = 2
for w_c in word_count_list:
    if index == vocab_size:
        break
    word = w_c[0]
    word_to_index[word] = index
    index +=1
with open('data/data_for_pipeline/word_to_index.json', 'w', encoding='utf-8') as f:
    json.dump(word_to_index, f)

### Calculate the average length of the text
Our model needs to set a parameter named ```max_len``` which controls the length of each text. Suppose the average length of the dataset is ```avg_len```. In order to obtain a well-performed model, we need to prevent the difference between ```max_len``` and ```avg_len``` becoming too large, we suggest make ```max_len``` equal to ```avg_len```.

> **Tip**
If the variance of the text length is too large, you should use the median of the text length rather than ```avg_len```

In [5]:
total_length = 0
line_num = 0
with open('data/data_for_pipeline/data.txt', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        line_num += 1
        text = line.split('\t')[0]
        total_length += len(text.split(' '))
avg_length = int(total_length/line_num)
print('avg_len:{}'.format(avg_length))

avg_len:11


### Prepare  files for fasttext_batch_inference.ipynb
We just select a few files as the inputs for batch inference.

You could determine the file names yourself, such as file1, file2, and so on.

Each file contains the text you want to make a prediction. Please remove the line breaks beforehand so as to keep the text in one line.

In [6]:
# We select 100 files for demo
num = 100
with open('data/data_for_pipeline/data.txt', 'r', encoding='utf-8') as f:
    lines = []
    for i, line in enumerate(f.readlines()):
        if i==num:
            break
        lines.append(line.split('\t')[0])
       
# Save these texts to files
dir_ = 'data/data_for_batch_inference'
os.makedirs(dir_, exist_ok=True)
for index, line in enumerate(lines):
    # We use the index as the name of each file
    path = os.path.join(dir_, str(index))
    with open(path, 'w', encoding='utf-8') as f:
        f.write(line)

### Copy files for unittest
If you want to execute unit tests of the customized modules locally, then you need to copy related files to the specific directory.

In [7]:
# Copy files for customized modules: split_data_txt, fasttext_train, fasttext_evaluation, and compare_two_models
import shutil
src_dir = 'data/data_for_pipeline'
dst_dir = 'split_data_txt/data/split_data_txt/inputs/input_dir'
os.makedirs(dst_dir, exist_ok=True)
for file in os.listdir(src_dir):
    if not file.startswith('.'):
        src = os.path.join(src_dir, file)
        shutil.copy(src, dst_dir)

# Copy files for the customized module: fasttext_score
src_dir = 'data/data_for_batch_inference'
dst_dir = 'fasttext_score/data/fasttext_score/inputs/input_files'
os.makedirs(dst_dir, exist_ok=True)
for file in os.listdir(src_dir):
    if not file.startswith('.'):
        src = os.path.join(src_dir, file)
        shutil.copy(src, dst_dir)