**PART 3 - Generating text with KERAS**

Reading the .csv files containing the male and female posts.

In [179]:
#!pip install pandas
import pandas 
data_female = pandas.read_csv('female_posts.csv', sep=',', na_values=".", encoding='ISO-8859-1')
data_male = pandas.read_csv('male_posts.csv', sep=',', na_values=".", encoding='ISO-8859-1')
data_female = data_female["status_message"]
data_male = data_male["status_message"]


Concatenating the female posts to a single text/string

In [180]:
female_text = ""
for i in range(len(data_female)):
    female_text = female_text + data_female[i] + " /n/r "  
female_text   



In [181]:
len(female_text)

143516

In [182]:
from keras import backend as K
K.set_image_dim_ordering('th')

**Cleaning the data**

In this part we remove unwanted characters from the text we created. 

In [206]:
sentence_start_token = "SENTENCESTART"
sentence_end_token = "SENTENCEEND"

female_text = female_text.replace(r'\[.*?\]|\(.*http.+\)|\(.*https.+\)|\<.*http.+\>', '')
female_text = female_text.replace(r'Rado([^\s]+)|Skarp([^\s]+)', '')
female_text = female_text.replace(r'\=[A-Z|0-9][A-Z|0-9]|\=', '')
#female_text = female_text.replace('\n',' '+ line_break + ' ')
female_text = female_text.replace('\r','')
female_text = female_text.replace('--',' ')
female_text = female_text.lower()
female_text



Now, we split the main text into individual words and create a bag of words with repeats. 

In [207]:
from keras.preprocessing.text import text_to_word_sequence
female_text2 = text_to_word_sequence(female_text, lower=False, split=" ") #using only 10000 first words


In [208]:
female_text2[0:50]

['mohammed',
 'nazili',
 "suddicqui's",
 'post',
 'advertising',
 'a',
 'payment',
 'gateway',
 'is',
 'removed',
 'for',
 'a',
 'second',
 'time',
 'third',
 'instance',
 'will',
 'result',
 'in',
 'the',
 'member',
 'being',
 'removed',
 'from',
 'the',
 'group',
 'n',
 'r',
 'new',
 'member',
 'post',
 'advertising',
 'a',
 'payment',
 'gateway',
 'has',
 'been',
 'removed',
 'n',
 'r',
 'this',
 'is',
 'amazing',
 'n',
 'r',
 'we',
 'need',
 '20',
 'volunteers',
 'to']

Now, initialize the tokenizer to create the sequences and fit the text onto it, with nb_words=900 representing top 900 words in the text.  

In [209]:
from keras.preprocessing.text import Tokenizer
token = Tokenizer(nb_words=900,char_level=False)
token.fit_on_texts(female_text2)

In [210]:
text_mtx = token.texts_to_matrix(female_text2, mode='binary')

Each word will be represented by a vector of size 900, and the the row will show 1 where the row word matched the word column if its in the top 900 words.

for that, we use text_to_matrix.

In [211]:
text_mtx.shape

(23170, 900)

In [212]:
len(female_text2)

23170

In [213]:
vocab = pd.DataFrame({'word':female_text2,'code':np.argmax(text_mtx,axis=1)})

In [214]:
vocab=vocab.drop_duplicates()


In [215]:
vocab.sort_values(by="code")

Unnamed: 0,code,word
0,0,mohammed
11064,0,kaggle
11070,0,reddit
11077,0,coming
11081,0,conf
11085,0,keynotes
11089,0,daiquiris
11095,0,end
11105,0,bucks
11106,0,extractconf


Shift to predict the next word.

In [216]:
input_ = text_mtx[:-1]
output_ = text_mtx[1:]

input_.shape, output_.shape

((23169, 900), (23169, 900))

In [217]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Flatten
from keras.layers.wrappers import TimeDistributed
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN

Now we create a sequential model format, which is a linear stack of neural network layers. This is one of the formats we learned in class.

In [219]:
model = Sequential()

We start by adding an embedding layer, that turns positive integersinto dense vectors of fixed size.

This layer can only be used as the first layer in a model.

In [220]:
model.add(Embedding(input_dim=input_.shape[1],output_dim= 42, input_length=input_.shape[1]))

In [None]:
Then, we flatten the results to the dense output layer 

In [221]:
model.add(Flatten())
model.add(Dense(output_.shape[1], activation='sigmoid'))

In [222]:
model.compile(loss='categorical_crossentropy', optimizer='rmsprop',metrics=["accuracy"])

Now, we fit model with the words. We chose 10 iterations because each epoch takes over 2 minutes to run and we were short on time.

In [223]:
model.fit(input_, y=output_, batch_size=200, nb_epoch=10, verbose=1, validation_split=0.2)

Train on 18535 samples, validate on 4634 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1c213080>

In [224]:
score = model.evaluate(input_,output_, verbose=0)
score

[3.1148451200151661, 0.23820622383357071]

We recieved a 23% accuracy.

In [226]:
get_next("hello",token,model,vocab)



'n'

In [227]:
vocab = vocab2

In [228]:
#vocab.shape
vocab

Unnamed: 0,code,word
0,0,Mohammed
1,0,Nazili
2,0,Suddicqui's
3,228,post
4,0,advertising
5,14,a
6,824,payment
7,0,gateway
8,16,is
9,631,removed


In [241]:
def get_next(text,token,model,vocabulary):
    
    #converting the word to 1-hot matrix represenation
    tmp = text_to_word_sequence(text, lower=False, split=" ")
    tmp = token.texts_to_matrix(tmp, mode='binary')
    #predicting next word
    bestMatch=model.predict_classes(tmp)[0]
    return vocabulary[vocabulary['code']==bestMatch]['word'].values[0]

This function returns a list of generated texts. First we generate FEMALE texts.

In [242]:
def generate_text(num_message,length,model,token,vocab):
   
    lst=[]
    for j in range(0,num_message):
           # pick a random seed
            start = np.random.randint(0, len(vocab)-1)
            pattern = vocab.iloc[start].word
            message=''+ pattern
            # generate characters
            for i in range(length):
               
                #predict
                prediction = get_next(pattern,token,model,vocab)
               
                message=message+' '+prediction
                pattern = prediction
            lst.append(message)             
    return lst
lst_generate=generate_text(10,40,model,token,vocab)
lst_generate





['SENTENCESTART \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r',
 '500GB \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r',
 '\r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r',
 'ly split \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r',
 'from r Join BIGDATA \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r',
 'Two \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r',
 'Calling hpc \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r',
 'SENTENCEEND \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \r

We convert the list to a dataframe and export the dataframe to a file. That file now contains all the texts generated.

In [245]:
female_df = pd.DataFrame(lst_generate)

female_df.to_csv('generated_female_posts.csv', sep=',', index=False)
female_df.head(10)

Unnamed: 0,0
0,SENTENCESTART \r \r \r \r \r \r \r \r \r \r \r...
1,500GB \r \r \r \r \r \r \r \r \r \r \r \r \r \...
2,\r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \...
3,ly split \r \r \r \r \r \r \r \r \r \r \r \r \...
4,from r Join BIGDATA \r \r \r \r \r \r \r \r \r...
5,Two \r \r \r \r \r \r \r \r \r \r \r \r \r \r ...
6,Calling hpc \r \r \r \r \r \r \r \r \r \r \r \...
7,SENTENCEEND \r \r \r \r \r \r \r \r \r \r \r \...
8,\r \r \r \r \r \r \r \r \r \r \r \r \r \r \r \...
9,and To goo with r Join BIGDATA \r \r \r \r \r ...


We now repeat the process with the MALE text.

In [246]:
male_text = ""
for i in range(len(data_male)):
    male_text = male_text + data_male[i] + " /n/r "  
male_text 



First we clean the text and remove any unwanted characters. also applying lower case. Then we separate the words.

In [247]:
male_text = male_text.replace(r'\[.*?\]|\(.*http.+\)|\(.*https.+\)|\<.*http.+\>', '')
male_text = male_text.replace(r'Rado([^\s]+)|Skarp([^\s]+)', '')
male_text = male_text.replace(r'\=[A-Z|0-9][A-Z|0-9]|\=', '')
#female_text = female_text.replace('\n',' '+ line_break + ' ')
male_text = male_text.replace('\r','')
male_text = male_text.replace('--',' ')
#female_text = female_text.replace('. ',' ' )
female_text = female_text.lower()
male_text



In [248]:
male_text2 = text_to_word_sequence(male_text, lower=False, split=" ")

In [249]:
male_text2[0:50]

['Does',
 'Gmail',
 'sell',
 'information',
 'is',
 "one's",
 'private',
 'emails',
 'Judging',
 'by',
 'adverts',
 'in',
 'my',
 'Facebook',
 'news',
 'feed',
 'I',
 'would',
 'say',
 'Yes',
 'But',
 'perhaps',
 'this',
 "isn't",
 'news',
 'for',
 'anyone',
 'and',
 'I',
 'have',
 'been',
 'under',
 'a',
 'rock',
 'for',
 'years',
 'Clarification',
 'please',
 'Is',
 'private',
 'email',
 'private',
 'in',
 'name',
 'only',
 'n',
 'r',
 'Hold',
 'the',
 'applause']

Now, initialize the tokenizer to create the sequences and fit the text onto it, with nb_words=900 representing top 900 words in the text.  


Each word will be represented by a vector of size 900, and the the row will show 1 where the row word matched the word column if its in the top 900 words.

for that, we use text_to_matrix.

In [250]:
from keras.preprocessing.text import Tokenizer
token = Tokenizer(nb_words=900,char_level=False)
token.fit_on_texts(male_text2)

In [251]:
text_mtx = token.texts_to_matrix(male_text2, mode='binary')

In [252]:
text_mtx.shape

(49466, 900)

In [253]:
vocab2 = pd.DataFrame({'word':male_text2,'code':np.argmax(text_mtx,axis=1)})
vocab2=vocab2.drop_duplicates()
vocab2.sort_values(by="code")


Unnamed: 0,code,word
17455,0,finally
24335,0,aux
24336,0,yeux
24340,0,viol
24341,0,vole
24343,0,tue
24345,0,cachette
24332,0,fois
24348,0,diable
24350,0,vient


Shift to predict the next word.

In [254]:
input_ = text_mtx[:-1]
output_ = text_mtx[1:]

input_.shape, output_.shape

((49465, 900), (49465, 900))

In [259]:
model2 = Sequential()

Now we create a sequential model format, which is a linear stack of neural network layers. This is one of the formats we learned in class.

In [260]:
model2.add(Embedding(input_dim=input_.shape[1],output_dim= 42, input_length=input_.shape[1]))

In [261]:
model2.add(Flatten())
model2.add(Dense(output_.shape[1], activation='sigmoid'))

In [262]:
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop',metrics=["accuracy"])

In [263]:
model2.fit(input_, y=output_, batch_size=200, nb_epoch=10, verbose=1, validation_split=0.2)

Train on 39572 samples, validate on 9893 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x6d0a9198>

In [265]:
score = model2.evaluate(input_,output_, verbose=0)

In [266]:
score

[3.1289339194437087, 0.13736985747136735]

We received an accuracy of 13%

In [267]:
lst_generate=generate_text(169,40,model2,token,vocab2)
lst_generate





































['institutions in the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using',
 'Lands in the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using',
 'Transgender in the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using',
 'blueberries in the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using data science and the same techniques Using data scien

After we generated the texts for the males, we output the resultds to a dataframe and then to a .csv file.

In [268]:
male_df = pd.DataFrame(lst_generate)

male_df.to_csv('generated_male_posts.csv', sep=',', index=False)
male_df.head(10)

Unnamed: 0,0
0,institutions in the same techniques Using data...
1,Lands in the same techniques Using data scienc...
2,Transgender in the same techniques Using data ...
3,blueberries in the same techniques Using data ...
4,1846529728965286 in the same techniques Using ...
5,greenish in the same techniques Using data sci...
6,Future of the same techniques Using data scien...
7,perhaps in the same techniques Using data scie...
8,techno in the same techniques Using data scien...
9,slots in the same techniques Using data scienc...
