# Chapin style converstaion generator

In [1]:
import pandas as pd

## Transforming data into a script like format

In [2]:
data_path = "data/velmaxdata.csv"
threads = pd.read_csv(data_path)

In [3]:
threads.head()

Unnamed: 0.1,Unnamed: 0,threadid,author,date,text
0,0,4420,Mitsu-Elias,"10-Jul-2006, 22:21",Cuando lo acelero hace un ruidito que juro ke ...
1,1,4420,Thunderboy,"11-Jul-2006, 08:41",Pues si vos no sabes nosotros menos!! akfdjads...
2,2,4420,Mitsu-Elias,"11-Jul-2006, 11:18",Pues si vos no sabes nosotros menos!! akfdjads...
3,3,4420,Thunderboy,"11-Jul-2006, 11:38","Pues si, mas de una vez se me han bajado las r..."
4,4,531900,Roberto.1®,"13-Aug-2016, 09:27",ya se que me recomendaran las mejores marcas d...


In [4]:
fields_to_drop = ['Unnamed: 0','date']
threads = threads.drop(fields_to_drop,axis=1)
threads.head()

Unnamed: 0,threadid,author,text
0,4420,Mitsu-Elias,Cuando lo acelero hace un ruidito que juro ke ...
1,4420,Thunderboy,Pues si vos no sabes nosotros menos!! akfdjads...
2,4420,Mitsu-Elias,Pues si vos no sabes nosotros menos!! akfdjads...
3,4420,Thunderboy,"Pues si, mas de una vez se me han bajado las r..."
4,531900,Roberto.1®,ya se que me recomendaran las mejores marcas d...


### Building script blocks

In [5]:
thread_id = threads[0:1]['threadid'][0]
print(thread_id)

4420


In [6]:
with open('./data/script.txt','w') as out_file:
    
    for index, thread in threads.iterrows():
        
        if thread['threadid'] != thread_id:
            out_file.write('\n\n')

        line = "{}: {}".format(thread['author'],thread['text']).strip()
        if not line.endswith('.'):
            line+='.'

        out_file.write(line+'\n')
        
        thread_id = thread['threadid']

## Checkpoint 

After processing the data you can start here loading the transforming data directly. The preprocessed data has been saved to disk.

In [1]:
import helper

data_dir = './data/script.txt'
text = helper.load_data(data_dir)

## Explore the data

In [8]:
view_sentence_range = (0, 10)

import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))
scenes = text.split('\n\n')
print('Number of scenes: {}'.format(len(scenes)))
sentence_count_scene = [scene.count('\n') for scene in scenes]
print('Average number of sentences in each scene: {}'.format(np.average(sentence_count_scene)))

sentences = [sentence for scene in scenes for sentence in scene.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))

print()
print('The sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))

Dataset Stats
Roughly the number of unique words: 3754730
Number of scenes: 444505
Average number of sentences in each scene: 23.253801419556584
Number of lines: 10780936
Average number of words in each line: 12.654039593593728

The sentences 0 to 10:
Mitsu-Elias: Cuando lo acelero hace un ruidito que juro ke no identifico que es, suena como si las bujias estubieran humedas, cosa que ya me asegure que no fuera. Que podria ser? Otra cosa, que beneficios me puede dar comprar un catalizador de marca y cuanto vale mas o menos.
Thunderboy: Pues si vos no sabes nosotros menos!! akfdjadsfkljkalsfd.
Mitsu-Elias: Pues si vos no sabes nosotros menos!! akfdjadsfkljkalsfd.
Thunderboy: Pues si, mas de una vez se me han bajado las revoluciones y ha perdido potencia, cuando lo compre venia con una bujia diferente a las demas y por descuido no me fije en eso, pero cambie las bujias y ahora son iguales, no se si tendra que ver que se me hayan mojado mas de una vez (con esos diluvios que caen por aki), 

## Implementing Preprocessing Functions

In order to preprocess the dataset we are going to implement the following preprocessing functions:
* Lookup Table
* Tokenize Punctuation

### Lookup Table

To create a word embedding, we need to transform the words to ids. We will create two dictionaries:
* Dictionary to go from the words to an id, we'll call vocab_to_int
* Dictionary to go from the id to word, we'll call int_to_vocab

In [2]:
import numpy as np
from collections import Counter

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of velmax scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    # TODO: Implement Function
    word_counts = Counter(text)
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

    return vocab_to_int, int_to_vocab


## Tokenize punctuation

We'll be splitting the script into a word array using spaces as delimiters. However, punctuations like periods and exclamation marks make it hard for the neural network to distinguish between the word "bye" and "bye!".

The token_lookup function will return a dict that will be used to tokenize symbols like "?" into "||Question_Mark||"

This dictionary will be used to token the symbols and add the delimiter (space) around it. This separates the symbols as it's own word, making it easier for the neural network to predict on the next word

In [3]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenize dictionary where the key is the punctuation and the value is the token
    """
    # TODO: Implement Function
    punct_dic = {
        '.' : '||Period||',
        ',' : '||Comma||',
        '"' : '||Quotation_Mark||',
        ';' : '||Semicolon||',
        '!' : '||Exclamation_Mark||',
        '?' : '||Question_Mark||',
        '(' : '||Left_Parentheses||',
        ')' : '||Right_Parentheses||',
        '--': '||Dash||',
        '\n': '||Return||'
    }
    return punct_dic


# Pre process all data and save it

In [None]:
# Preprocess Training, Validation, and Testing Data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

## Checkpoint #2

The preprocessed data has been saved to disk. We can start from here the next time