In this notebook, we do some basic pre-processing to [Ubuntu Dialogue Corpus v2](https://github.com/rkadlec/ubuntu-ranking-dataset-creator). 
* Download tokenized and lower cased data
* Separate training csv file into `context`, `response` and `flag` files
* Sample validation file to have same distribution as training data: 1 positive and 1 negative
* Build a vocabulary file from training data. Only keep words that appear atleast 50 times. Also add UNK token

### P0: Download Data

We will work with [Ubuntu Dialogue Corpus v2](https://github.com/rkadlec/ubuntu-ranking-dataset-creator). 

[Petr Baudiš](https://github.com/pasky) has [kindly pre-processed and made available a tokenized version](https://github.com/brmson/dataset-sts/tree/master/data/anssel/ubuntu) of this corpus, and we will work with that!

We will lowercase the data though.

It might be a good idea to create a data directory (Let's call it `ubuntu-data`) and put all these files inside it.
Be careful, as this would donwnload three files of total size `196 MB`. On unzipping, this would take a space of `734 MB` so make sure you have that much free space (and bandwidth available)!

Open your bash terminal and download and unzip the data as follows:

```bash
    mkdir ubuntu-data
    cd ubuntu-data
    
    wget http://rover.ms.mff.cuni.cz/~pasky/ubuntu-dialog/v2-trainset.csv.gz
    wget http://rover.ms.mff.cuni.cz/~pasky/ubuntu-dialog/v2-valset.csv.gz
    wget http://rover.ms.mff.cuni.cz/~pasky/ubuntu-dialog/v2-testset.csv.gz
    gunzip v2*.gz
```

### P1: Take a peek!
The first thing you should do after downloading text data is to count number of lines in it, and have a peek at it ;) 

My favorite tools for counting is unix tool `wc` and for peeking `head`

```bash

wc -l *.csv
```

You should see something like this:

```bash
  189200 v2-testset.csv
 1000000 v2-trainset.csv
  195600 v2-valset.csv
 1384800 total
```

Okay, so we have `1M` lines in training csv, about `195K` in validation and `189K` in test.

Now let us have a peek at train file `v2-trainset.csv`.

```bash
head -1 v2-trainset.csv
```

I am not going to put the output returned here, as it will fill this notebook ;). But you would notice that the line has three parts: two sentences and a label. First sentence is the `context`. Second sentence is the `response`. Last part is the `flag`, which specifies if the response is correct.

Also, notice that each sentence has some special tokens
* End of utterance `__eou__`
* End of Turn `__eot__`

* Think of each sentence as a list of turns (User1 followed by User2). Each turn is in turn a list of utterances (from a single user)

For example, the sentence `i agree that we should kill the -novtswitch __eou__ __eot__ ok __eou__ __eot` should be understood as
```
User1: i agree that we should kill the -novtswitch
User2: ok
```

How about validation and test files? 
```bash
head -20 head v2-valset.csv
```
You would notice that the first sentence (`context`) is same for 10 sentences, and 9 sentences in 10 have label 0 and 1 sentence label 0. This dataset is designed to select correct response from 10.

### P2: Separate `v2-trainset.csv` into Context, Response and Flag

I still have not yet figured out how to separate CSV data directly using dataset API. I probably need to [read this more carefully](https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/examples/get_started/regression/imports85.py)!

I do however, understand how [`TextLineDataset`](https://www.tensorflow.org/versions/master/api_docs/python/tf/data/TextLineDataset) works when each line in text is a single sentence. Thus, in this section, we will separate out CSV to three different files: `train.context`, `train.response` and `train.flag` using plain old python! Probably an overkill, but we ought to make progress! :)

These three new files (`train.context`, `train.response` and `train.flag`) are definitely `TextLineDataset` friendly. Don't worry too much now, we will see how in much more detail than you like!

In [1]:
import os, csv
import numpy as np

#Set this to actual path of the folder where your CSV files reside.
DATA_DIR = 'ubuntu-data'

train_csv_file = os.path.join(DATA_DIR, 'v2-trainset.csv')
train_context_file = os.path.join(DATA_DIR, 'train.context')
train_response_file = os.path.join(DATA_DIR, 'train.response')
train_flag_file = os.path.join(DATA_DIR, 'train.flag')

In [2]:
fw_train_context = open(train_context_file, 'w')
fw_train_response = open(train_response_file, 'w')
fw_train_flag = open(train_flag_file, 'w')
    
num_rows = 0

with open(train_csv_file) as fr:
    reader = csv.reader(fr)
    for row in reader:
        assert len(row) == 3
        fw_train_context.write('%s\n'%row[0].strip().lower())
        fw_train_response.write('%s\n'%row[1].strip().lower())
        fw_train_flag.write('%s\n'%row[2].strip())
        num_rows += 1
print('Num rows in Train csv: {}'.format(num_rows))       

#Important, we do need to close the files, before using them.
fw_train_context.close()
fw_train_response.close()
fw_train_flag.close()

Num rows in Train csv: 1000000


Great, we should now have three new files in `ubuntu-data`. What do we do when we get text data? 

Count lines and take a peek to ensure everything looks good! 

* Check that number of lines is `1M` in all the files
    ```bash
    wc -l train.*
    ```
    
* Check first and last line of `v2-trainset.csv`. Does this match your new files? `train.context`, `train.response` and `train.flag`

### P3: sample `v2-valset.csv`
We saw in P1 that validation and test CSV files have 9 negative flags and 1 positive flags. 

Let us see what is the dsitribution of negative flags for training data. 

All we need to do is to count `0` in the new file we created in P2 `train.flag`

In [3]:
num_neg = 0
with open(train_flag_file) as fr:
    for line in fr:
        line = line.strip()
        if line == '0':
            num_neg += 1
print('Train neg_labels: {}'.format(num_neg))

Train neg_labels: 500127


Okay, so it seems the negative and positive labels are *almost* equally distributed.

The validation and test files however, do not have the same distribution. In order to check if our text correlation model has been trained well, we would like to compare loss on training and validation set. This would only be comparable if they *atleast* have similar distribution of labels!

Thus, here we would **sample** `v2-valset.csv`. We will randomly select 1 negative response out of 9. And ofcourse, keep the positive response.

In [4]:
#Paths for validation file
valid_csv_file = os.path.join(DATA_DIR, 'v2-valset.csv')

valid_context_file = os.path.join(DATA_DIR, 'valid.context')
valid_response_file = os.path.join(DATA_DIR, 'valid.response')
valid_flag_file = os.path.join(DATA_DIR, 'valid.flag')

In [5]:
#If you look closely at v2-valset.csv, you will see that the first example out of 10 always has `flag=1`
fw_valid_context = open(valid_context_file, 'w')
fw_valid_response = open(valid_response_file, 'w')
fw_valid_flag = open(valid_flag_file, 'w')

#Fixed random seed for reproducibility
np.random.seed(1543)

sample_num = 0
neg_responses = []
pos_response = None
context = None
with open(valid_csv_file) as fr:
    reader = csv.reader(fr)
    for row in reader:
        if sample_num == 0:
            context = row[0].strip().lower()
            pos_response = row[1].strip().lower()
        else:
            neg_responses.append(row[1].strip().lower())
        sample_num += 1

        #We 
        if sample_num == 10:
            random_int = np.random.randint(9)
            #Write positive data
            fw_valid_context.write('%s\n'%context)
            fw_valid_response.write('%s\n'%pos_response)
            fw_valid_flag.write('%d\n'%1)
            
            #Write negative data
            fw_valid_context.write('%s\n'%context)
            fw_valid_response.write('%s\n'%neg_responses[random_int])
            fw_valid_flag.write('%d\n'%0)

            neg_responses = []
            sample_num = 0
            
    assert not neg_responses

# Again, this step is important when writing to files within a notebook!
fw_valid_context.close()
fw_valid_response.close()
fw_valid_flag.close()

We will now have new validation files.`valid.context`,  `valid.response` and `valid.flag`

As always, verify number of lines and do a peek! 

You should find number of lines to be `19560 * 2 = 39120`

### P4: Vocab file

Okay, so we are done with separating our data.

Now, we are finally left with creating a vocab file. We will use [`Counter`](https://docs.python.org/3/library/collections.html#collections.Counter) to count words, and only retain words that appear atleast 50 times. You can experiment with a higher minimum frequency. 

We will also add a special symbol `UNK` which will refer to all out of vocabulary words. We will **only** use training file to build the vocabulary.

Let us begin!

In [6]:
from collections import Counter

words = []
counter = Counter()

for index, (context, response) in enumerate(zip(open(train_context_file), open(train_response_file))):
    words.extend(context.split())
    words.extend(context.split())
    
    #This is to keep memory foot print low
    #If we get 10M words we update counter
    if len(words) > 10000000:
        counter.update(words)
        print ('Index: {} Vocab: {}'.format(index, len(counter)))
        words = []
        
#Final update
if words:
    counter.update(words)
    print ('Final Vocab: {}'.format(len(counter)))
    words = []

print(counter.most_common(10))

Index: 59414 Vocab: 96881
Index: 119609 Vocab: 153941
Index: 178763 Vocab: 203333
Index: 236851 Vocab: 249512
Index: 294173 Vocab: 292395
Index: 352036 Vocab: 332966
Index: 409958 Vocab: 372606
Index: 468534 Vocab: 411856
Index: 526642 Vocab: 451210
Index: 584394 Vocab: 489346
Index: 641870 Vocab: 527417
Index: 699277 Vocab: 566036
Index: 755791 Vocab: 603585
Index: 812726 Vocab: 641210
Index: 870317 Vocab: 678494
Index: 928442 Vocab: 700066
Index: 988980 Vocab: 705089
Final Vocab: 706028
[('__eou__', 12445416), ('__eot__', 7908302), ('i', 5435158), ('the', 4550778), (',', 4294744), ('to', 3902622), ('?', 3749674), ('.', 3462904), ('it', 3251880), ('a', 2699816)]


Okay, that seems fine. Top 10 words have eou eot and other common words in English.

Now, we just need to write the words to our vocab file!

In [7]:
def write_vocab(fw, counter, min_freq=50):
    fw.write('%s\n'%'UNK')
    for w, f in counter.most_common():
        if f < min_freq:
            return
        fw.write('%s\n'%w)

vocab_file = os.path.join(DATA_DIR, 'vocab.txt')       
with open(vocab_file, 'w') as fw:
    write_vocab(fw, counter)

As always, let us have a look at our vocab file. Count number of lines. See top words, to make sure everything looks OK!
```bash
head vocab.txt
```

```bash
UNK
__eou__
__eot__
i
the
,
to
?
.
it
```