# Моделирование языка на уровне символов

**В качестве входных данных я выбрал текст книги "451 градус по фаренгейту", который сохранил в .txt-формате**

In [32]:


#Чтение и обработка текста
with open('61168.txt', 'r') as fp:
  text = fp.read()

start_indx = text.find('The Man in the Brown Suit')
end_indx = text.find('End of the Project Gutenberg')
text = text[start_indx:end_indx]
char_set = set(text)
print('Общая длина: ', len(text))
print('Уникальные символы: ', len(char_set))

Общая длина:  431300
Уникальные символы:  99


##Построение словаря с использованием массива NumPy

In [3]:
chars_sorted = sorted(char_set)
char2int = {ch: i for i, ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)

In [4]:
text_encoded = np.array([char2int[ch] for ch in text], dtype = np.int32)

In [5]:
print('Форма закодированного текста: ', text_encoded.shape)
print(text[:15], '== Кодирование ==>', text_encoded[:15])
print(text_encoded[15:21], '== Обратное кодирование ==>', ''.join(char_array[text_encoded[15:21]]))

Форма закодированного текста:  (431300,)
The Man in the  == Кодирование ==> [48 65 62  1 41 58 71  1 66 71  1 77 65 62  1]
[30 75 72 80 71  1] == Обратное кодирование ==> Brown 


## Создание из массива NumPy набора данных TensorFlow

In [6]:
import tensorflow as tf

ds_text_encoded = tf.data.Dataset.from_tensor_slices(text_encoded)

In [7]:
for ex in ds_text_encoded.take(5):
  print('{} -> {}'.format(ex.numpy(), char_array[ex.numpy()]))

48 -> T
65 -> h
62 -> e
1 ->  
41 -> M


## Получение входа х и цели у

In [8]:
seq_length = 40
chunk_size = seq_length + 1

In [9]:
ds_chunks = ds_text_encoded.batch(chunk_size, drop_remainder=True)

In [10]:
from tensorflow.python.ops.gen_array_ops import split
#Определение функции для разделения х и у
def split_input_target(chunk):
  input_seq = chunk[:-1]
  target_seq = chunk[1:]
  return input_seq, target_seq

ds_sequences = ds_chunks.map(split_input_target)

In [11]:
for example in ds_sequences.take(2):
  print('Вход (х): ', repr(''.join(char_array[example[0].numpy()])))
  print('Цель (у): ', repr(''.join(char_array[example[1].numpy()])))
  print()

Вход (х):  'The Man in the Brown Suit, by Agatha Chr'
Цель (у):  'he Man in the Brown Suit, by Agatha Chri'

Вход (х):  'stie\n\nThis eBook is for the use of anyon'
Цель (у):  'tie\n\nThis eBook is for the use of anyone'



## Разделение на мини-пакеты

In [12]:
BATCH_SIZE = 64
BUFER_SIZE = 10000
ds = ds_sequences.shuffle(BUFER_SIZE).batch(BATCH_SIZE)

## Построение модели на основе RNN

In [13]:
def build_model(vocab_size, embedding_dim, rnn_units):
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(vocab_size, embedding_dim),
      tf.keras.layers.LSTM(
          rnn_units,
          return_sequences=True),
      tf.keras.layers.Dense(vocab_size)
  ])
  return model

## Установка параметров обучения

In [14]:
charset_size = len(char_array)
embedding_dim = 256
rnn_units = 512

tf.random.set_seed(1)
model = build_model(
    vocab_size=charset_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units
)

## Сводное представление модели

In [15]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 256)         25344     
                                                                 
 lstm (LSTM)                 (None, None, 512)         1574912   
                                                                 
 dense (Dense)               (None, None, 99)          50787     
                                                                 
Total params: 1,651,043
Trainable params: 1,651,043
Non-trainable params: 0
_________________________________________________________________


## Обучение модели

In [17]:
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True)
)

In [18]:
history = model.fit(ds, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## Генерация новых отрывков текста

In [19]:
tf.random.set_seed(1)
logits = [[1.0, 1.0, 1.0]]
print('Вероятности: ', tf.math.softmax(logits).numpy()[0])

Вероятности:  [0.33333334 0.33333334 0.33333334]


In [20]:
samples = tf.random.categorical(
    logits=logits, num_samples=10)
tf.print(samples.numpy())

array([[0, 0, 1, 2, 0, 0, 0, 0, 1, 0]])


In [21]:
tf.random.set_seed(1)
logits = [[1.0, 1.0, 3.0]]
print('Вероятности: ', tf.math.softmax(logits).numpy()[0])

Вероятности:  [0.10650698 0.10650698 0.78698605]


In [22]:
samples = tf.random.categorical(
    logits=logits, num_samples=10)
tf.print(samples.numpy())

array([[2, 0, 2, 2, 2, 0, 1, 2, 2, 0]])


In [26]:
#Код функции sample()
def sample(model, starting_str, 
           len_generated_text=500,
           max_input_length=40,
           scale_factor=1.0):
  encoded_input = [char2int[s] for s in starting_str]
  encoded_input = tf.reshape(encoded_input, (1, -1))

  generated_str = starting_str

  model.reset_states()
  for i in range(len_generated_text):
    logits = model(encoded_input)
    logits = tf.squeeze(logits, 0)

    scaled_logits = logits * scale_factor
    new_char_indx = tf.random.categorical(scaled_logits, num_samples=1)
    
    new_char_indx = tf.squeeze(new_char_indx)[-1].numpy()

    generated_str += str(char_array[new_char_indx])

    new_char_indx = tf.expand_dims([new_char_indx], 0)
    encoded_input = tf.concat(
        [encoded_input, new_char_indx], axis=1)
    encoded_input = encoded_input[:, -max_input_length:]

  return generated_str

In [27]:
#Пример генерации текста
tf.random.set_seed(1)
print(sample(model, starting_str='Tne Man'))

Tne Man in the Project Gutenberg Literary Sup)) nodedicates cent on deck, and went into the terms—that it was a tharked the secret clemement she round the Mill House and the man who like the boat after here, I walked such a copy, he might have
a ship unwaring care—onle.

“Any one things you can received that his implined all the Africa, I knew over my story from the expity to Kimberley is incontinued painors after the departh he went
outside with a young men. She was back into though I had returned fro


## Масштабирование логитов

In [28]:
logits = np.array([[1.0, 1.0, 3.0]])
print('Вероятности перед масштабированием: ', tf.math.softmax(logits).numpy()[0])
print('Вероятности после масштабирования с коэффициентом 0.5: ', tf.math.softmax(0.5*logits).numpy()[0])
print('Вероятности после масштабирования с коэффициентом 0.1: ', tf.math.softmax(0.1*logits).numpy()[0])

Вероятности перед масштабированием:  [0.10650698 0.10650698 0.78698604]
Вероятности после масштабирования с коэффициентом 0.5:  [0.21194156 0.21194156 0.57611688]
Вероятности после масштабирования с коэффициентом 0.1:  [0.31042377 0.31042377 0.37915245]


## Порожденный текст при параметре альфа=2.0

In [29]:
tf.random.set_seed(1)
print(sample(model, starting_str='Tne Man', scale_factor=2.0))

Tne Man in the Brown Suit.’”

“Oh, yes, you know. I’ve got my eyes reading the station. In the nawing me he had seen the police come and the station. It was it about the secret men with a sickly beautiful time.”

“I don’t know that I had really done of the silences. The coint of my story. It was a little law from the platform. He was just in an orders to pretty of the situation of my neck and the secret the danger. I had displayed it as the mountains of the time. I don’t seem to understand, so that I h


## Порожденный текст при параметре альфа=5.0

In [31]:
tf.random.set_seed(1)
print(sample(model, starting_str='Tne Man', scale_factor=5.0))

Tne Man in the Brown Suit.’”

“I should like the same time I did not see it all the secretary seemed to me that the man who had a good deal of the time. I don’t know that I had the things that I had a perfectly and seemed to me that the case that I was the time of the station. I was still help since of the time. I was in the secretary is some one who had started to me that the diamonds with a sigh of the station. I was a secretary of the time. I should have the terms of the things that I had the terms 


## Порожденный текст при параметре альфа=0.5

In [30]:
tf.random.set_seed(1)
print(sample(model, starting_str='Tne Man', scale_factor=0.5))

Tne Man aid 25d.% we asliled
abod, really ruc man7 was kignafe washS begin if you mackatis,” I? dé?, I hadeief) th)ome horrinfie?”

“How think a?Mmwdlipava _itt_ Slafe.”
This I exjected airpact
VeM.4 .8 Hexop my Anne locked Juifies! Nevelbbat—" Hedclitexalr? I cem—truate wPry2chictivily
“Enqquicoon._ So that duptheskness?

Eurcha)nitribf, eecermstifivently ecretmend_ socrwast Officew
silvehed., Sir Eus, whten
Clarninf Marl?” Bedgriet drussime—’s—neu’tichply.m:, whot. Gaitle time ‘tot. De
dirfuzome casc


Из примеров можно сделать вывод, что чем больше значение парамтера, тем больше предсказуемости, в противном случае наблюдается больше случайности.