<a href="https://colab.research.google.com/github/shalini9795/SuperheroNameGenerator-NLP/blob/main/Superhero_Name_Generator_Learner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Superhero (and Supervillain) Name Generator

---

[Superhero Names Dataset](https://github.com/am1tyadav/superhero)

## Task 2

1. Import the data
2. Create a tokenizer
3. Char to index and Index to char dictionaries

In [1]:
!git clone https://github.com/am1tyadav/superhero

Cloning into 'superhero'...
remote: Enumerating objects: 8, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 8 (delta 0), reused 4 (delta 0), pack-reused 0[K
Unpacking objects: 100% (8/8), done.


In [2]:
with open('superhero//superheroes.txt','r') as f:
  data=f.read()

data[:100]

'jumpa\t\ndoctor fate\t\nstarlight\t\nisildur\t\nlasher\t\nvarvara\t\nthe target\t\naxel\t\nbattra\t\nchangeling\t\npyrrh'

In [3]:
import tensorflow as tf
print(tf.__version__)

2.5.0


We do not want to filter out the \ts in the text hence we are using filter to filter out all the other characters except \t\n

In [4]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~',
    split='\n',
)

Convert sequence of characters to sequence of numbers for the machine to understand

In [5]:
tokenizer.fit_on_texts(data)

Build vocabulary, whenever tokenizers gets 'a' it will assign it a numeric value of 2, 
---
Why is it important?
---

Our neural network model does not understand characters but it does understand numbers

Therefore we have creates a tokenizer



In [6]:
char_to_index = tokenizer.word_index
index_to_char = dict((v,k) for k,v in char_to_index.items())

print("printing contents of index_to_char", index_to_char)

printing contents of index_to_char {1: '\t', 2: 'a', 3: 'e', 4: 'r', 5: 'o', 6: 'n', 7: 'i', 8: ' ', 9: 't', 10: 's', 11: 'l', 12: 'm', 13: 'h', 14: 'd', 15: 'c', 16: 'u', 17: 'g', 18: 'k', 19: 'b', 20: 'p', 21: 'y', 22: 'w', 23: 'f', 24: 'v', 25: 'j', 26: 'z', 27: 'x', 28: 'q'}


## Task 3

1. Converting between names and sequences

In [7]:
names = data.splitlines() #splitting by \n
names[:10]

['jumpa\t',
 'doctor fate\t',
 'starlight\t',
 'isildur\t',
 'lasher\t',
 'varvara\t',
 'the target\t',
 'axel\t',
 'battra\t',
 'changeling\t']

In [8]:
tokenizer.texts_to_sequences(names[0])

[[25], [16], [12], [20], [2], [1]]

In [9]:
def name_to_seq(name):
  return [tokenizer.texts_to_sequences(c)[0][0] for c in name]

In this function we have conerted list of lists to a single list i.e [ [25], [16], [12], [20], 
[2], [1] ] to [25, 16, 12, 20, 2, 1] 

In [10]:
name_to_seq(names[0])

[25, 16, 12, 20, 2, 1]

In [11]:
def seq_to_name(seq):
  return ''.join([index_to_char[i] for i in seq if i!=0])

In [12]:
seq_to_name(name_to_seq(names[0])) #getting the sequence back helper functions

'jumpa\t'

## Task 4

1. Creating sequences
2. Padding all sequences

In [15]:
sequences= []

for name in names:
  seq=name_to_seq(name)
  if len(seq) >= 2:
    sequences += [seq[:i] for i in range(2,len(seq)+1)]

We are getting sequences for each word, dividing each word into 2 first and gradually adding the next letter it needs to remember after the first

In [16]:
sequences[:10]

[[25, 16],
 [25, 16, 12],
 [25, 16, 12, 20],
 [25, 16, 12, 20, 2],
 [25, 16, 12, 20, 2, 1],
 [14, 5],
 [14, 5, 15],
 [14, 5, 15, 9],
 [14, 5, 15, 9, 5],
 [14, 5, 15, 9, 5, 4]]

In [19]:
max_len = max([len(x) for x in sequences])
print(max_len)
#tokenizer starts with 1 and not with 0

33


In [21]:
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(
    sequences,padding = 'pre',
    maxlen=max_len
)
#pad 0s before each so that o make it of even length
#we have not done it after because it will be easier to recognize the labels
print(padded_sequences[0])

[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0 25 16]


In [22]:
padded_sequences.shape

(88279, 33)

## Task 5: Creating Training and Validation Sets

1. Creating training and validation sets

In [23]:
x,y=padded_sequences[:,:-1],padded_sequences[:,-1]
print(x.shape,y.shape)

(88279, 32) (88279,)


In [24]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(66209, 32) (66209,)
(22070, 32) (22070,)


In [27]:
num_chars = len(char_to_index.keys())+1 #one because we have added 0s
print(num_chars)

29


## Task 6: Creating the Model

pass the number of characters as the first argument because that is the size of the vocabulary that we have

---
8 is the dimension of the future vector not taking large values because maxlen is 30

Why are we using causal padding- (output t does not depend on t+1, temporal order is not going to be violated, causal good for time series data


In [29]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPool1D, LSTM
from tensorflow.keras.layers import Bidirectional, Dense

model=Sequential([
                  Embedding(num_chars, 8, input_length=max_len-1),
                  Conv1D(64,5,strides=1,activation='tanh', padding='causal'),
                  MaxPool1D(2),
                  LSTM(32),
                  Dense(num_chars, activation='softmax')

])

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 32, 8)             232       
_________________________________________________________________
conv1d (Conv1D)              (None, 32, 64)            2624      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 16, 64)            0         
_________________________________________________________________
lstm (LSTM)                  (None, 32)                12416     
_________________________________________________________________
dense (Dense)                (None, 29)                957       
Total params: 16,229
Trainable params: 16,229
Non-trainable params: 0
_________________________________________________________________


## Task 7: Training the Model

## Task 8: Generate Names!