# Generador de Tweets de Obama

Basandonos en un listado de tweets de Barack Obama, queremos empezar una frase y ver cómo la terminaria Obama


## Preparación

Al estar el notebook alojado en Google Colab, hay que cargar el dataset en el entorno de ejecución.

In [0]:
!mkdir datasets
!mv BarackObama.json datasets

mkdir: cannot create directory ‘datasets’: File exists
mv: cannot stat 'BarackObama.json': No such file or directory


Instalamos Spacy

In [0]:
!pip install spacy



In [0]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


## Importacion de Dataset

In [0]:
import numpy as np
import pandas as pd
import re

In [0]:
twt = pd.read_json (r'datasets/BarackObama.json',typ='series')

In [0]:
df = twt.to_frame()

In [0]:
df.columns = ['tweet']

Vemos que hay tweets duplicados

In [0]:
df.shape

(2894, 1)

In [0]:
df = df.drop_duplicates()
df.shape

(2861, 1)

Limpiamos los tweets de la misma manera que en el primer ejercicio.

In [0]:
def replace_re (cad, regex, token):
    return re.sub(regex, token, cad)

In [0]:
# quitar espacios antes y despues del texto
df['tweet'] = df['tweet'].map(lambda SentimentText: SentimentText.strip())  
 
# Cambiar special HTML entities (de http://www.htmlhelp.com/reference/html40/entities/special.html)
#  &amp; 	&lt; 	&gt; &circ; &tilde;
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&amp;', '& '))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&lt;', '<'))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&gt;', '>'))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&circ;', '^'))

# &ensp; &emsp; &thinsp; -> ' '
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&ensp;', ' '))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&thinsp;', ' '))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&emsp;', ' '))

# &ndash; 	&mdash; -> '-'
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&ndash;', '-'))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&mdash;', '-'))

# ' &quot; &lsquo; &rsquo; &sbquo; &ldquo; &rdquo; &bdquo; &lsaquo; &rsaquo;  -> "'"
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('"', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&quot;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&lsquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&rsquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&sbquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&ldquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&rdquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&bdquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&lsaquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&rsaquo;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&quot;', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('“', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('”', "'"))
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('’', "'"))

# euro sign	&euro;	&#8364;	&#x20AC;	€	€	€
df['tweet'] = df['tweet'].map(lambda tweet: tweet.replace('&euro;', '€'))

# quitar /n
escape_char_re = r'\n|\t'
df['tweet'] = df['tweet'].apply(lambda x: replace_re(x, escape_char_re, ""))

# quitar @username (mentions)
twitterHandle_re = r'(^|[^@\w])@(\w{1,15})\b'
df['tweet'] = df['tweet'].apply(lambda x: replace_re(x, twitterHandle_re, "")) 

# hashtag 
# los quito por 
hashtag_re = r'(?:^|\s|\')[＃#]{1}(\w+)'
df['tweet'] = df['tweet'].apply(lambda x: replace_re(x, hashtag_re, "")) 

# quitar urls
url_re = r'[localhost|http|https|ftp|file]+://[\w\S(\.|:|/)]+'
df['tweet'] = df['tweet'].apply(lambda x: replace_re(x, url_re, "")) 

# quitar espacios multiples en la cadena 'The    quick  lazy    fox'->'The quick lazy fox' 
extraSpaces_re = r' +'
df['tweet'] = df['tweet'].apply(lambda x: replace_re(x, extraSpaces_re, " "))

# quitar espacios antes y despues del texto (los que se hayan podido meter despues de las sustituciones)
df['tweet'] = df['tweet'].map(lambda tweet: tweet.strip())  

# minusculas
df['tweet'] = df['tweet'].map(lambda tweet: tweet.lower()) 

En las pruebas que he hecho con el dataset completo he visto que para lograr algun resultado decente se necesitaban muchas épocas, si no, el modelo, repetia sin cesar la salida 'in in in in'
Con el dataset completo, cada época lleva 10 minutos, y consigo buenos resultados a partir de  la época 200, lo que no es operativo.

Para poder mostrar algo, me he generado un conjunto mucho más pequeño, que se entrena relativamente rápido y genera frases con sentido.

In [0]:
smalldf = df.sample(50, random_state=42)

In [0]:
lista_tweets = df['tweet'].tolist()
small_lista_tweets = smalldf['tweet'].tolist()

In [0]:
frases = ""
for e in lista_tweets:
    frases += e + '\n'

small_frases = ""
for e in small_lista_tweets:
    small_frases += e + '\n'

## Modelo

Como he dicho, usaremos Spacy para el tokenizador.

In [0]:
import spacy

nlp = spacy.load('en_core_web_sm')

En clase, hemos visto Language Modelling con Bayes, el planteamiento que hago es similar, pero en lugar de usar Bayes como modelo uso una red neuronal con LSTMs.


In [0]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 
import numpy as np

Using TensorFlow backend.


Hay tres partes distintas en generar un modelo de lenguaje, preparar el dataset, entrenar el modelo con el dataset elegido y por último generar las frases.

Podría poner todo en un mismo bloque de código, pero como quiero repetir el experimento con distintos datasets lo separo en tres fucniones, que habrá que llamar de forma sucesiva.

In [0]:
tokenizer = Tokenizer()

def dataset_preparation(data):

	# Aunque los datos vienen en minusculas, viene bien de forma general asegurar
	# que estan todas las frases en minusculas
	corpus = data.lower().split("\n")

	# tokenizacion	
	tokenizer.fit_on_texts(corpus)
	total_words = len(tokenizer.word_index) + 1

	# Crear secuencia de entrada con la lista de los tokens
	input_sequences = []
	for line in corpus:
		token_list = tokenizer.texts_to_sequences([line])[0]
		for i in range(1, len(token_list)):
			n_gram_sequence = token_list[:i+1]
			input_sequences.append(n_gram_sequence)

	# pad sequences 
	max_sequence_len = max([len(x) for x in input_sequences])
	input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

	# crear predictore y etiquetas.
	predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
	label = ku.to_categorical(label, num_classes=total_words)

	return predictors, label, max_sequence_len, total_words


In [0]:
def create_model(predictors, label, max_sequence_len, total_words, epochs=100, verbose=1):
	#Definimos el modelo
	model = Sequential()
	model.add(Embedding(total_words, 10, input_length=max_sequence_len-1))
	model.add(LSTM(150, return_sequences = True))
	model.add(LSTM(100))
	model.add(Dense(total_words, activation='softmax'))

	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	earlystop = EarlyStopping(monitor='loss', min_delta=0, patience=50, verbose=0, mode='auto')
	model.fit(predictors, label, epochs=epochs, verbose=verbose, callbacks=[earlystop])
	if verbose != 0: 
		print (model.summary())
	return model 

In [0]:

def generate_text(seed_text, next_words, max_sequence_len, model):
    for j in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen= 
                             max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
  
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

Y ahora bastaría con preparar el dataset que queramos usar, entrenar la red y generar la frase.

In [0]:
predictors, label, max_sequence_len, total_words = dataset_preparation(small_frases)

#Si te gusta ver como un modelo entrena durante media vida, descomenta la siguiente linea
#predictors, label, max_sequence_len, total_words = dataset_preparation(frases)


In [0]:
model = create_model(predictors, label, max_sequence_len, total_words, epochs=500)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

![Barack Obama](https://media.metrolatam.com/2018/12/29/barackobama04-ba3c0c481c307ead52a4469c620532d5-1200x600.jpg)

In [0]:
print (generate_text("helping americans", 18, max_sequence_len, model))
print (generate_text("american conference is", 16, max_sequence_len, model))
print (generate_text("on jobs", 14, max_sequence_len, model))


helping americans as i'm president america will lead the world ' —president obama at hill air force base in utah
american conference is speaking at the consumer financial protection bureau may be but to be reformed ' —president obama
on jobs across the country are calling on senate leaders to fill the supreme court vacancy


Entrenar la red neuronal con todo el dataset de Obama es muy largo, con lo que el usar sólo el dataset reducido nos ha dado un resultado linguistico correcto.

Pese a todo, el modelo tiene poco vocabulario, con lo que hace que estemos un poco limitados a la hora de empezar el texto, para que genere decente. Además, tiende a repetirse en las salidas.

Pese a todo, con mucho tiempo y mucha máquina, se podría entrenar el modelo con  todo el dataset y evaluar como sale, pero como prueba de concepto funciona.

## Y como postre...

Pero me apetecía probar si el modelo puede aplicarse a otros usos. Así que, por puro entretenimiento, he intentado repetir los entrenos con letras de canciones, a ver que tal.

In [0]:
elvis = """Maybe I didn't treat you
Quite as good as I should have
Maybe I didn't love you
Quite as often as I could have
Little things I should have said and done
I just never took the time
You were always on my mind
You were always on my mind
Maybe I didn't hold you
All those lonely, lonely times
And I guess I never told you
I'm so happy that you're mine
If I make you feel second best
Girl, I'm so sorry I was blind
You were always on my mind
You were always on my mind
Tell me, tell me that your sweet love hasn't died
Give me, give me one more chance
To keep you satisfied, satisfied
Little things I should have said and done
I just never took the time
You were always on my mind
You were always on my mind
You were always on my mind
Maybe I didn't treat you
Quite as good as I should have
Maybe I didn't love you
Quite as often as I could have
Maybe I didn't hold you
All those lonely, lonely times
And I guess I never told you
I'm so happy that you're mine
Maybe I didn't treat you
Quite as good as I should have"""

In [0]:
abba = """Ooh
You can dance
You can jive
Having the time of your life
Ooh, see that girl
Watch that scene
Dig in the dancing queen
Friday night and the lights are low
Looking out for a place to go
Where they play the right music
Getting in the swing
You come to look for a king
Anybody could be that guy
Night is young and the music's high
With a bit of rock music
Everything is fine
You're in the mood for a dance
And when you get the chance
You are the dancing queen
Young and sweet
Only seventeen
Dancing queen
Feel the beat from the tambourine, oh yeah
You can dance
You can jive
Having the time of your life
Ooh, see that girl
Watch that scene
Dig in the dancing queen
You're a teaser, you turn 'em on
Leave 'em burning and then you're gone
Looking out…"""

In [0]:
gloria = """At first I was afraid, I was petrified
Kept thinking I could never live without you by my side
But then I spent so many nights thinking how you did me wrong
And I grew strong
And I learned how to get along
And so you're back
From outer space
I just walked in to find you here with that sad look upon your face
I should have changed that stupid lock, I should have made you leave your key
If I'd known for just one second you'd be back to bother me
Go on now, go, walk out the door
Just turn around now
'Cause you're not welcome anymore
Weren't you the one who tried to hurt me with goodbye
Do you think I'd crumble
Did you think I'd lay down and die?
Oh no, not I, I will survive
Oh, as long as I know how to love, I know I'll stay alive
I've got all my life to live
And I've got all my love to give and I'll survive
I will survive, hey, hey
It took all the strength I had not to fall apart
Kept trying hard to mend the pieces of my broken heart
And I spent oh-so many nights just feeling sorry for myself
I used to cry
But now I hold my head up high and you see me
Somebody new
I'm not that chained-up little person and still in love with you
And so you felt like dropping in and just expect me to be free
Well, now I'm saving all my lovin' for someone who's loving me
Go on now, go, walk out the door
Just turn around now
'Cause you're not welcome anymore
Weren't you the one who tried to break me with goodbye
Do you think I'd crumble
Did you think I'd lay down and die?
Oh no, not I, I will survive
Oh, as long as I know how to love, I know I'll stay alive
I've got all my life to live
And I've got all my love to give and I'll survive
I will survive
Oh
Go on now, go, walk out the door
Just turn around now
'Cause you're not welcome anymore
Weren't you the one who tried to break me with goodbye
Do you think I'd crumble
Did you think I'd lay down and die?
Oh no, not I, I will survive
Oh, as long as I know how to love, I know I'll stay alive
I've got all my life to live
And I've got all my love to give and I'll survive
I will survive
I will survive
"""

### Gloria Gaynor - I Will Survive
![Gloria Gaynor](https://timedotcom.files.wordpress.com/2016/03/library-of-congress-national-recording-registry-gloria-gaynor-billy-joel-metallica.jpg)

In [0]:
predictors, label, max_sequence_len, total_words = dataset_preparation(gloria)

In [0]:
model = create_model(predictors, label, max_sequence_len, total_words, epochs=500, verbose=0)

In [0]:
print (generate_text("the pieces are", 14, max_sequence_len, model))
print (generate_text("i change for", 8, max_sequence_len, model))

the pieces are so many nights thinking i was not i know i'll stay alive pieces of
i change for now i was afraid i was petrified look


### Elvis Prestley - Always On My Mind
![Elvis](https://www.duna.cl/media/2018/03/elvis-ok-e1520544500211.jpg)

In [0]:
predictors, label, max_sequence_len, total_words = dataset_preparation(elvis)
model = create_model(predictors, label, max_sequence_len, total_words, epochs=500, verbose=0)

In [0]:
print (generate_text("i should always guess", 6, max_sequence_len, model))
print (generate_text("you were second best", 4, max_sequence_len, model))


i should always guess i should have said and done
you were second best my best second best


### Abba - Dancing Queen
![Abba](https://culto.latercera.com/wp-content/uploads/2018/04/abba-900x600.jpg)

In [0]:
predictors, label, max_sequence_len, total_words = dataset_preparation(abba)
model = create_model(predictors, label, max_sequence_len, total_words, epochs=500, verbose=0)

In [0]:
print (generate_text("hard dancing girl", 5, max_sequence_len, model))
print (generate_text("gone girl in", 8, max_sequence_len, model))


hard dancing girl fine and the music's high
gone girl in the dancing queen fine and the dancing music


Al ser tan repetitiva la canción, no tenemos muy buen resultado.

### Todos juntos
Por último, vemos que pasa si entrenamos con las tres canciones.

![Oldies](https://images-na.ssl-images-amazon.com/images/I/9179F-G7-%2BL._SL1500_.jpg)

In [0]:
todos = abba + elvis + gloria

In [0]:
predictors, label, max_sequence_len, total_words = dataset_preparation(todos)
model = create_model(predictors, label, max_sequence_len, total_words, epochs=500, verbose=0)

In [0]:
print (generate_text("the pieces are", 14, max_sequence_len, model))
print (generate_text("i change for you", 8, max_sequence_len, model))

print (generate_text("i should always guess", 6, max_sequence_len, model))
print (generate_text("you were second best", 7, max_sequence_len, model))

print (generate_text("hard dancing girl", 5, max_sequence_len, model))
print (generate_text("gone girl in", 16, max_sequence_len, model))

the pieces are rock music to be free your your side with that sad look upon you
i change for you felt like dropping in and just expect me
i should always guess i should have changed that stupid
you were second best i spent have took the door free
hard dancing girl go walk out the door
gone girl in and just tell for myself so so many just expect me that your time to long
