<a href="https://colab.research.google.com/github/vladimiralencar/DeepLearning-LANA/blob/master/LSTM/StackedLSTMs_Tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Gerando Tweets com Stacked LSTMs 

Usaremos Tweets do Presidente dos EUA, Donald Trump, para treinar um modelo LSTM de 2 camadas e então ensinar o modelo a gerar tweets automaticamente.

O conjunto de dados está disponível no site de compretições em Data Science <a href="https://www.kaggle.com/kingburrito666/better-donald-trump-tweets">Kaggle</a> e foi extraído do Twitter de Donald Trump: https://twitter.com/realDonaldTrump

### 1. Feature Engineering

Os dados de texto bruto não podem ser fornecidos diretamente no modelo LSTM. Nós devemos fazer engenharia dos atributos primeiro antes de podermos seguir para a etapa de modelagem.

In [0]:
# Imports
import numpy as np
import pandas as pd

In [2]:
# Carregando o dataset
!wget https://raw.githubusercontent.com/vladimiralencar/DeepLearning-LANA/master/LSTM/data/tweets.csv
data = pd.read_csv('tweets.csv')

--2019-01-23 01:44:03--  https://raw.githubusercontent.com/vladimiralencar/DeepLearning-LANA/master/LSTM/data/tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1703362 (1.6M) [text/plain]
Saving to: ‘tweets.csv’


2019-01-23 01:44:03 (47.7 MB/s) - ‘tweets.csv’ saved [1703362/1703362]



In [3]:
data.head()

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,
3,16-11-11,2:19:44,Just had a very open and successful presidenti...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/796...,214001,67010,,
4,16-11-11,2:10:46,A fantastic day in D.C. Met with President Oba...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/796...,178499,36688,,


Tudo o que precisamos é o campo ** Tweet_Text **. Vamos combinar todas as linhas para criar um corpus de texto, concatenando tweets, mas separando-os com duas novas linhas:

In [4]:
text = '\n\n'.join(data['Tweet_Text'].values)
print(text[:400])

Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z

Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!

Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!

Just h


Para reduzir o tamanho do nosso espaço de recursos e o tempo de treinamento, removemos caracteres raros:

In [0]:
from collections import Counter
import re

In [0]:
cntr = Counter(text)
rare = list(np.asarray(list(cntr.keys()))[np.asarray(list(cntr.values())) < 300])
for c in rare:
    text = re.sub('[' + c + ']', '', text)

Aqui está como o início do corpus se parece:

In [7]:
print(text[:1000])

Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z

Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!

Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!

Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!

A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!

Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz2dhrXzo4

Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before

Watching the returns at 9:45pm.
#ElectionNight #MAGA__ https://t.co/HfuJeRZbod

RT @IvankaT

O corpus tem 857177 caracteres e há 78 caracteres únicos dentro dele:

In [8]:
print('Total de Caracteres no Corpus: {:,d}'.format(len(text)))
chars = sorted(list(set(text)))
print('Total de Caracteres Únicos:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Total de Caracteres no Corpus: 857,177
Total de Caracteres Únicos: 78


Agora, vamos cortar o texto em sequências semi-redundantes de caracteres * maxlen * para que ele possa ser alimentado em um modelo LSTM:

In [9]:
maxlen = 50
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Número de Sequências: {:,d}'.format(len(sentences)))

Número de Sequências: 285,709


Então, vamos vetorizar as frases:

In [0]:
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [11]:
X[0]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

### 2. Modelo Generativo

In [12]:
import random
import sys
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop

Using TensorFlow backend.


Vamos criar algumas funções reutilizáveis que podem que podem gerar texto para nosso modelo generativo.

In [0]:
cntr = Counter(text)
cntr_sum = sum(cntr.values())
char_probs = list(map(lambda c: cntr[c] / cntr_sum, chars))

In [0]:
def sample(preds):
    preds = np.asarray(preds).astype('float64')
    preds = preds / np.sum(preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [0]:
def generate(model, length, seed=''):
    
    if len(seed) != 0:
        sys.stdout.write(seed)
    
    generated = seed
    sentence = seed
    
    for i in range(length):
        x = np.zeros((1, maxlen, len(chars)))

        padding = maxlen - len(sentence)
        
        for i in range(padding):
            x[0, i] = char_probs # pad usando os anteriores
            
        for t, char in enumerate(sentence):
            x[0, padding + t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds)
        next_char = indices_char[next_index]

        sentence = sentence[1:] + next_char
        generated += next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
        
    return generated

Agora, vamos construir o grafo da nossa rede neural. Depois, vamos treinar nosso modelo e exibir algumas amostras em todas as épocas. No final do treinamento, salvamos o modelo para que possamos reutilizá-lo rapidamente no futuro.

In [18]:
from os.path import isfile
from keras.models import load_model

!rm stacked-lstm-2-layers-128-hidden.h5
!wget https://github.com/vladimiralencar/DeepLearning-LANA/raw/master/LSTM/data/stacked-lstm-2-layers-128-hidden.h5

MODEL_PATH = 'stacked-lstm-2-layers-128-hidden.h5'

if isfile(MODEL_PATH):
    model = load_model(MODEL_PATH)
else:
    N_HIDDEN = 128

    # Modelo
    model = Sequential()
    model.add(LSTM(N_HIDDEN, dropout = 0.1, input_shape = (maxlen, len(chars)), return_sequences = True))
    model.add(LSTM(N_HIDDEN, dropout = 0.1))
    model.add(Dense(len(chars), activation = 'softmax'))

    # Otimizador
    optimizer = RMSprop(lr = 0.01)
    
    # Compilação
    model.compile(loss = 'categorical_crossentropy', optimizer = optimizer)

    # Imprime amostras a cada época
    for iteration in range(1, 40):
        print('\n')
        print('-' * 50)
        print('\nIteração', iteration)
        model.fit(X, y, batch_size=3000, epochs=1)

        print('\n-------------------- Tweet Gerado Pelo Modelo Nesta Iteração ---------------------\n')

        rand = np.random.randint(len(text) - maxlen)
        seed = text[rand:rand + maxlen]
        generate(model, 400, seed)

    model.save(MODEL_PATH)

--2019-01-23 01:48:09--  https://github.com/vladimiralencar/DeepLearning-LANA/raw/master/LSTM/data/stacked-lstm-2-layers-128-hidden.h5
Resolving github.com (github.com)... 140.82.118.3, 140.82.118.4
Connecting to github.com (github.com)|140.82.118.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/vladimiralencar/DeepLearning-LANA/master/LSTM/data/stacked-lstm-2-layers-128-hidden.h5 [following]
--2019-01-23 01:48:09--  https://raw.githubusercontent.com/vladimiralencar/DeepLearning-LANA/master/LSTM/data/stacked-lstm-2-layers-128-hidden.h5
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2004136 (1.9M) [application/octet-stream]
Saving to: ‘stacked-lstm-2-layers-128-hidden.h5’


2019-01-23 01:48:09 (50.



Agora vamos experimentar o modelo!

Usando a primeira frase deste <a href='https://twitter.com/realDonaldTrump/status/890764622852173826'>tweet</a> como semente, vamos tentar continuar a frase de Trump e ver quais coisas interessantes o nosso modelo pode dizer:

In [19]:
sample_tweet_start = 'Go Republican Senators, Go!'
_ = generate(model, 200, sample_tweet_start)

Go Republican Senators, Go! Polling - theyWith, and insteacing



Can do what you cant wait will disguard their #Ispairwitb, in her, Clinton is beuting be, you are great. We then edwatcyl mevided be thoughts!

RT @TheNastan25 @

In [20]:
sample_tweet_start = 'immigration'
for i in range(10):
    _ = generate(model, 200, sample_tweet_start)
    print('\n=========================================\n')

immigration I watched #Mogey #MakeAmericaGurhts #MakeAmericaGU. #CrookedHillary seon you, we are not to see that Issue? @SaralPrump: .@StateForTMrisy #FIPrece Prayers - anyone it we can down ower started! Genera

immigration #ThankABir #Bruz they she will her watching they hope register @annennahoe #ICNND!

Lating Ohioss #1 @realDonaldTrump #Trump2016

Thank you! #Trump2016 @jick_bal3: CNN has in the U.S.

RT @ChrisCulid

immigration Trump on increase Crooked Hillary Clintons. Great Hillary, this news Donald. You are almosticianCore: We cannot ewnel https://t.coHaPresann @tedcruz:
TRUMP:
https://t.coSeSher atherfully #2016

Thank

immigrationa_ strick #_HallaryColumh

RT @ABC @UnionLeam 40 Paris: https://t.coL2Ami in 1002- @RealHelkop RANS a funness amazing crowd! This the werss. Way a deal. CampPertrop_
Watched 1 #WNN newspead! #VoteTrum

immigrationaus opening needs are Tited, Mexico. Helitics what just needs TEA!!!! Thank you
Lets Dend Great subbost been anwuldy they would be in 201

In [0]:
sample_tweet_start = 'America'
for i in range(10):
    _ = generate(model, 200, sample_tweet_start)
    print('\n=========================================\n')

In [21]:
sample_tweet_start = 'China'
for i in range(10):
    _ = generate(model, 200, sample_tweet_start)
    print('\n=========================================\n')

Chinastenfer hit will devated
#MakeAmerica_ Thad former he most just were http:/2F016 TRUMP Dicemene Mirdley #BigLead https: CNN Zecal Ragins: https:/5t0 Make Americen 2006 TRUMP2016 KOU ATpringbam Get the

China the likes hell  @CNNSOS08: Debate Jeb ABA OUT cooker  @realDonald8Traumed #MakeAmerAcal Make Ad Nation #MakeAmerMcarson: https:_It
#Trump: Team in @megyn 40 in Joe News Trump your thuge will leads Sh

China

"@Rkkndmanoge upman #Backersz

I has get women

.@FrankAmenoms The @realDin Estable #18 of Carsona: Xm_ http:_ 

Cwan!

RT @TheRFIRSED   http:/2016 #LastUr, Hen illegatienamong illege_ http:/2016 @r

China disgrace:
https: https: @FoxNews

I will be in he they lead wontera4ze Away #Trump 5 http:/6016: Trump Congres Trump 10.9 http:/hippapier: UNDE Clevel https:
_"

"@heafhenndy #SNL@@CNBC Blacked @foxa

China shown #Trump, @Ridgy http:_  http:/2-16 #SCP @chucklier

Just http:/6 This very This 434 Medicst

"@jesfuckes 4 excepeon offices, @realDongly _badica

Crooker disaste

In [22]:
sample_tweet_start = 'Democratic'
for i in range(10):
    _ = generate(model, 200, sample_tweet_start)
    print('\n=========================================\n')

Democratic

Crooked Financials just be massive speech when AbL &amp; T.VANG supporters They strong stopfed
Beliess BOC will MOVEMENT IN THE AND THUTIO!
#MakeAmerica: #SepsJeston, Dogal was subjickers: https://t

Democratic

Get out in just really success the RYC rest evening story comes Trump Voter, https://t.cohZay or Mr. Interview #Bateres2 @JebBush American Departy #GOPDebates #TrumpLeash #BigLeagytI VOTE #VPDebates

Democratic

New @GWS @MarkSTrump @Holothelove says and Ire looked Hey suprosting president slould have need to bad over #sed @fightlebDand Trump.8p @RockDFUSA #WHC #TrumpInam &amp; RIcL IN TRUMP"

"@JoeDICUS &a

Democratic

RT @TeamTrump: Mikl against they heads for self and they on evening it would do them heard 26

Crooked Hillary Clintonsw Get the
Dems will remember true--IT, Behn Rue--

"@Jichina_ #MakeAmericaFyree

Democratic

Why would re and other clars devistuages: https://t.co/XaD I will interviewed Bether! Lutc_

I will be &amp; netapeys Reluting Secrate recor