# Generador de Tweets de Obama

Basandonos en un listado de tweets de Barack Obama, queremos empezar una frase y ver cómo la terminaria Obama


![texto alternativo](https://)

## Preparación

Al estar el notebook alojado en Google Colab, hay que cargar el dataset en el entorno de ejecución.

In [0]:
!mkdir datasets
!mv BarackObama.json datasets

In [0]:
import numpy as np
import pandas as pd
import re

In [9]:
!pip install spacy



In [10]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [0]:
twt = pd.read_json (r'datasets/BarackObama.json',typ='series')

In [0]:
df = twt.to_frame()

In [0]:
df.columns = ['tweet']

Vemos que hay tweets duplicados

In [14]:
df.shape

(2894, 1)

In [15]:
df = df.drop_duplicates()
df.shape

(2861, 1)

Limpiamos los tweets de la misma manera que en el primer ejercicio.

In [0]:
def replace_re (cad, regex, token):
    return re.sub(regex, token, cad)

In [0]:
# quitar espacios antes y despues del texto
df['tweet'] = df['tweet'].map(lambda SentimentText: SentimentText.strip())  
 

# Cambiar special HTML entities (de http://www.htmlhelp.com/reference/html40/entities/special.html)
#  &amp; 	&lt; 	&gt; &circ; &tilde;
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&amp;', '& '))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&lt;', '<'))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&gt;', '>'))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&circ;', '^'))

# &ensp; &emsp; &thinsp; -> ' '
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&ensp;', ' '))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&thinsp;', ' '))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&emsp;', ' '))

# &ndash; 	&mdash; -> '-'
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&ndash;', '-'))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&mdash;', '-'))

# ' &quot; &lsquo; &rsquo; &sbquo; &ldquo; &rdquo; &bdquo; &lsaquo; &rsaquo;  -> "'"
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('"', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&quot;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&lsquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&rsquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&sbquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&ldquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&rdquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&bdquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&lsaquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&rsaquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&quot;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('“', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('”', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('’', "'"))

# euro sign	&euro;	&#8364;	&#x20AC;	€	€	€
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&euro;', '€'))

# quitar /n
escape_char_re = r'\n|\t'
df['tweet'] = df['tweet'].apply(lambda x: replace_re(x, escape_char_re, ""))

# quitar @username (mentions)
twitterHandle_re = r'(^|[^@\w])@(\w{1,15})\b'
df['tweet'] = df['tweet'].apply(lambda x: replace_re(x, twitterHandle_re, "")) 

# hashtag 
# los quito por 
hashtag_re = r'(?:^|\s|\')[＃#]{1}(\w+)'
df['tweet'] = df['tweet'].apply(lambda x: replace_re(x, hashtag_re, "")) 

# quitar urls
url_re = r'[localhost|http|https|ftp|file]+://[\w\S(\.|:|/)]+'
df['tweet'] = df['tweet'].apply(lambda x: replace_re(x, url_re, "")) 

# quitar espacios multiples en la cadena 'The    quick  lazy    fox'->'The quick lazy fox' 
extraSpaces_re = r' +'
df['tweet'] = df['tweet'].apply(lambda x: replace_re(x, extraSpaces_re, " "))

# quitar espacios antes y despues del texto (los que se hayan podido meter despues de las sustituciones)
df['tweet'] = df['tweet'].map(lambda tweet: tweet.strip())  

# minusculas
df['tweet'] = df['tweet'].map(lambda tweet: tweet.lower()) 

Usaremos Spacy para generar el vocabulario

In [0]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [0]:
lista_tweets = df['tweet'].tolist()

In [0]:
frases = ""
for e in lista_tweets:
    frases += e + '\n'

## Modelo

Generamos un modelo con un LSTM, bla, bla, bla....

Basado en [Language Modelling and Text Generation using LSTMs — Deep Learning for NLP](https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275)

In [21]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 
import numpy as np


Using TensorFlow backend.


In [0]:
tokenizer = Tokenizer()

def dataset_preparation(data):

	# basic cleanup
	corpus = data.split("\n")

	# tokenization	
	tokenizer.fit_on_texts(corpus)
	total_words = len(tokenizer.word_index) + 1

	# create input sequences using list of tokens
	input_sequences = []
	for line in corpus:
		token_list = tokenizer.texts_to_sequences([line])[0]
		for i in range(1, len(token_list)):
			n_gram_sequence = token_list[:i+1]
			input_sequences.append(n_gram_sequence)

	# pad sequences 
	max_sequence_len = max([len(x) for x in input_sequences])
	input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

	# create predictors and label
	predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
	label = ku.to_categorical(label, num_classes=total_words)

	return predictors, label, max_sequence_len, total_words


In [0]:
def create_model(predictors, label, max_sequence_len, total_words, epochs=100):
	
	model = Sequential()
	model.add(Embedding(total_words, 10, input_length=max_sequence_len-1))
	model.add(LSTM(150, return_sequences = True))
	# model.add(Dropout(0.2))
	model.add(LSTM(100))
	model.add(Dense(total_words, activation='softmax'))

	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
	model.fit(predictors, label, epochs=epochs, verbose=1, callbacks=[earlystop])
	print (model.summary())
	return model 

In [0]:
def generate_text(seed_text, next_words, max_sequence_len, model):
    for j in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen= 
                             max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
  
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

In [0]:
predictors, label, max_sequence_len, total_words = dataset_preparation(frases)

In [27]:
model = create_model(predictors, label, max_sequence_len, total_words, epochs=20)

Epoch 1/20
Epoch 2/20




Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 10)            43720     
_________________________________________________________________
lstm_3 (LSTM)                (None, 50, 150)           96600     
_________________________________________________________________
lstm_4 (LSTM)                (None, 100)               100400    
_________________________________________________________________
dense_2 (Dense)              (None, 4372)              441572    
Total params: 682,292
Trainable params: 682,292
Non-trainable params: 0
_________________________________________________________________
None


In [30]:
print (generate_text("fire up", 8, max_sequence_len, model))

fire up the the the the the the the the
