In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

# What do we have here?

Before doing any serious programming, let's just check out what we have (i.e. inputs), and what might be the expected submission work (i.e. output).

## Training Set - Quick Peek

Let's take a look at the training dataset

In [2]:
df_train = pd.read_csv('../input/en_train.csv')

In [3]:
print(repr(df_train.head(20)))

Turns out `class` is a Python reserved keyword. To make life easier a bit downstream, let's rename it to `token_class` for safety. (this turns out to be a good idea as things like `df.token_class` works, whereas `df.class` will spill out error).

In [4]:
df_train.columns = ["sentence_id", "token_id", "token_class", "before", "after"]
print(repr(df_train.head(20)))

For each sentence (`sentence_id`):

- we read the written word (`before`) row-by-row.
- we read the spoken word (`after`) row-by-row.

For example, the written word `2006` is spoken as `two thousand six`.

Each word within a sentence is defined by `token_id`.

Each `token_id` has an assigned class (e.g. `PLAIN` for normal English, `DATE` for date, `LETTERS` for acronyms, etc.)

## Test Set - Quick Peek

Let's take a look at the test dataset

In [5]:
df_test = pd.read_csv('../input/en_test.csv')

In [6]:
print(repr(df_test.head(20)))

Note that test set only contains `before` (written word). Ther there is no `after` (spoken word). We will need to predict what `after` is.

## Sample Submission - Quick Peek

In [7]:
df_sample_submission = pd.read_csv('../input/en_sample_submission.csv')

In [8]:
print(repr(df_sample_submission.head(20)))

Note that sample submission contains the `after` column (predicted spoken word), for the corresponding `before` column (written word) in the test set. The `id` column takes the syntax of `<sentence_id>_<token_id>`. e.g. second sentence, third token will have an `id` of `1_2`. (zero index).

# Explore Training Set

Now we know that our objective is to predict the `after` column (predicted spoken word), based on the `before` vs `after` mapping training data, let's get a high level overview of what we have in our training data set. For example:

- How many unique sentences? (`sentence_id`)
- How many unique `token_class`?
- How many unique `before`? (token in written form)
- How many unique `after`? (token in spoken form)
- How many sentences per `token_class`?
- etc.

In [9]:
print("Unique sentences: {:,d}".format(df_train.sentence_id.unique().size))
print("Unique token classes: {:,d}".format(df_train.token_class.unique().size))
print("Unique before: {:,d}".format(df_train.before.unique().size))
print("Unique after: {:,d}".format(df_train.after.unique().size))

In [10]:
print('sentences per class...')
print('======================')
print(repr(df_train.groupby(['token_class'])['sentence_id'].count()))

We have 16 unique token classes. Let's get a feel of what the tokens look like for each class. This is how we do a "peek" against one token class:

In [11]:
df_train[df_train['token_class'] == 'ADDRESS'].head(5)

To repeat this for all 16 token classes, we can write a simple function to do the job:

In [12]:
def peek_tokens_by_class(token_classes, view_x):
    for token_class in token_classes:
        print(df_train[df_train['token_class'] == token_class].head(view_x))

In [13]:
# Run it! Let's peak 10 samples from each token_class.
peek_tokens_by_class(df_train.token_class.unique(), 10)

We have just learnt what the before (written form) vs after (spoken form) looks like for the 16 unique token classes. Notice that the `VERBATIM` class contains some non English characters (not sure what this is).

**Note to self** More analysis to come (work in progress).