# Text generation using RNN - Character Level (TO BE RUN IN GOOGLE COLAB)

To generate text using RNN, we need a to convert raw text to a supervised learning problem format.

Take, for example, the following corpus:

"Her brother shook his head incredulously"

First we need to divide the data into tabular format containing input (X) and output (y) sequences. In case of a character level model, the X and y will look like this:

|      X     |  Y  |
|------------|-----|
|    Her b   |  r  |
|    er br   |  o  |
|    r bro   |  t  |
|     brot   |  h  |
|    broth   |  e  |
|    .....   |  .  |
|    .....   |  .  |
|    ulous   |  l  |
|    lousl   |  y  |

Note that in the above problem, the sequence length of X is five characters and that of y is one character. Hence, this is a many-to-one architecture. We can, however, change the number of input characters to any number of characters depending on the type of problem.

A model is trained on such data. To generate text, we simply give the model any five characters using which it predicts the next character. Then it appends the predicted character to the input sequence (on the extreme right of the sequence) and discards the first character (character on extreme left of the sequence). Then it predicts again using the new sequence and the cycle continues until a fix number of iterations. An example is shown below:

Seed text: "incre"

|      X                                            |  Y                       |
|---------------------------------------------------|--------------------------|
|                        incre                      |    < predicted char 1 >  |
|               ncre < predicted char 1 >              |    < predicted char 2 >  |
|       cre< predicted char 1 > < predicted char 2 >   |    < predicted char 3 >  |
|       re< predicted char 1 >< predicted char 2 > < predicted char 3 >   |    < predicted char 4 >  |
|                      ...                          |            ...           |

# Notebook Overview
1. Preprocess data
2. LSTM model
3. Generate code

In [8]:
!pip install gitpython



In [9]:
# import libraries
import warnings
warnings.filterwarnings("ignore")

import os
import re
import numpy as np
import random
import sys
import io
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Activation, LSTM
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import get_file

# 1. Preprocess data

We're going to build a C code generator by training an RNN on a huge corpus of C code (the linux kernel code). You can download the C code used as source text from the following link:
https://github.com/torvalds/linux/tree/master/kernel

We have already downloaded the entire kernel folder and stored in a local directory

## Load C code

In [10]:
import os
import git

# Define the repository and directory path
repo_url = "https://github.com/sreegithub19/upgrad_programming"
repo_path = "/content/upgrad_programming"
subdir_path = "2_Course_continuation/_2_Exam_2/4_Deep_learning/_6_Recurrent_Neural_Networks"

# Clone the repository
if not os.path.exists(repo_path):
    git.Repo.clone_from(repo_url, repo_path)

# Change the working directory to the specific folder
os.chdir(os.path.join(repo_path, subdir_path))
print("Current working directory:", os.getcwd())

Current working directory: /content/upgrad_programming/2_Course_continuation/_2_Exam_2/4_Deep_learning/_6_Recurrent_Neural_Networks


In [11]:
# set path where C files reside

print("Current working directory:", os.getcwd())

path = r"linux_kernel"

os.chdir(path)

file_names = os.listdir()
print(file_names)

Current working directory: /content/upgrad_programming/2_Course_continuation/_2_Exam_2/4_Deep_learning/_6_Recurrent_Neural_Networks
['module_signature.c', 'ucount.c', 'bpf', 'panic.c', 'tracepoint.c', 'sysctl-test.c', 'kallsyms.c', 'ksysfs.c', 'acct.c', 'Kconfig.hz', 'kallsyms_internal.h', 'scs.c', 'uid16.c', 'Kconfig.locks', 'uid16.h', 'params.c', 'events', 'pid_sysctl.h', 'smpboot.h', 'exec_domain.c', 'gen_kheaders.sh', 'exit.c', 'vmcore_info.c', 'locking', 'audit_tree.c', 'audit_watch.c', 'groups.c', 'user_namespace.c', 'workqueue.c', 'usermode_driver.c', 'user.c', 'umh.c', 'kexec_elf.c', 'taskstats.c', 'elfcorehdr.c', 'static_call_inline.c', 'kexec.c', 'backtracetest.c', 'irq', 'stackleak.c', 'smpboot.c', 'jump_label.c', 'ptrace.c', 'smp.c', 'up.c', 'torture.c', 'freezer.c', 'kallsyms_selftest.c', 'profile.c', 'utsname.c', 'audit_fsnotify.c', 'time', 'auditsc.c', 'fail_function.c', 'context_tracking.c', 'Kconfig.kexec', 'latencytop.c', 'notifier.c', 'cpu_pm.c', 'workqueue_internal.

In [12]:
# use regex to filter .c files
import re
c_names = ".*\.c$"

c_files = list()

for file in file_names:
    if re.match(c_names, file):
        c_files.append(file)

print(c_files)

['module_signature.c', 'ucount.c', 'panic.c', 'tracepoint.c', 'sysctl-test.c', 'kallsyms.c', 'ksysfs.c', 'acct.c', 'scs.c', 'uid16.c', 'params.c', 'exec_domain.c', 'exit.c', 'vmcore_info.c', 'audit_tree.c', 'audit_watch.c', 'groups.c', 'user_namespace.c', 'workqueue.c', 'usermode_driver.c', 'user.c', 'umh.c', 'kexec_elf.c', 'taskstats.c', 'elfcorehdr.c', 'static_call_inline.c', 'kexec.c', 'backtracetest.c', 'stackleak.c', 'smpboot.c', 'jump_label.c', 'ptrace.c', 'smp.c', 'up.c', 'torture.c', 'freezer.c', 'kallsyms_selftest.c', 'profile.c', 'utsname.c', 'audit_fsnotify.c', 'auditsc.c', 'fail_function.c', 'context_tracking.c', 'latencytop.c', 'notifier.c', 'cpu_pm.c', 'relay.c', 'sys.c', 'cfi.c', 'resource.c', 'dma.c', 'kprobes.c', 'stop_machine.c', 'kcmp.c', 'static_call.c', 'user-return-notifier.c', 'sys_ni.c', 'utsname_sysctl.c', 'rseq.c', 'capability.c', 'stacktrace.c', 'crash_core.c', 'padata.c', 'bounds.c', 'compat.c', 'configs.c', 'delayacct.c', 'scftorture.c', 'kcov.c', 'kthread.

In [13]:
# load all c code in a list
full_code = list()
for file in c_files:
    code = open(file, "r", encoding='utf-8')
    full_code.append(code.read())
    code.close()

In [14]:
# let's look at how a typical C code looks like
print(full_code[20])

// SPDX-License-Identifier: GPL-2.0-only
/*
 * The "user cache".
 *
 * (C) Copyright 1991-2000 Linus Torvalds
 *
 * We have a per-user structure to keep track of how many
 * processes, files etc the user has claimed, in order to be
 * able to have per-user limits for system resources. 
 */

#include <linux/init.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/bitops.h>
#include <linux/key.h>
#include <linux/sched/user.h>
#include <linux/interrupt.h>
#include <linux/export.h>
#include <linux/user_namespace.h>
#include <linux/binfmts.h>
#include <linux/proc_ns.h>

#if IS_ENABLED(CONFIG_BINFMT_MISC)
struct binfmt_misc init_binfmt_misc = {
	.entries = LIST_HEAD_INIT(init_binfmt_misc.entries),
	.enabled = true,
	.entries_lock = __RW_LOCK_UNLOCKED(init_binfmt_misc.entries_lock),
};
EXPORT_SYMBOL_GPL(init_binfmt_misc);
#endif

/*
 * userns count is 1 for root user, 1 for init_uts_ns,
 * and 1 for... ?
 */
struct user_namespace init_user_ns = {
	.uid_map = {
		{
			.extent[0

In [15]:
# merge different c codes into one big c code
text = "\n".join(full_code)
print("Total number of characters in entire code: {}".format(len(text)))

Total number of characters in entire code: 2242645


In [16]:
# top_n: only consider first top_n characters and discard the rest for memory and computational efficiency
top_n = 400000
text = text[:top_n]

## Convert characters to integers

In [17]:
# create character to index mapping
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [18]:
print("Vocabulary size: {}".format(len(chars)))

Vocabulary size: 97


## Divide data in input (X) and output (y)

### Create sequences

In [19]:
# define length for each sequence
MAX_SEQ_LENGTH = 50          # number of input characters (X) in each sequence
STEP           = 3           # increment between each sequence
VOCAB_SIZE     = len(chars)  # total number of unique characters in dataset

sentences  = []              # X
next_chars = []              # y

for i in range(0, len(text) - MAX_SEQ_LENGTH, STEP):
    sentences.append(text[i: i + MAX_SEQ_LENGTH])
    next_chars.append(text[i + MAX_SEQ_LENGTH])

In [20]:
print('Number of training samples: {}'.format(len(sentences)))

Number of training samples: 133317


## Create input and output using the created sequences

When you're not using the Embedding layer of the Keras as the very first layer, you need to convert your data in the following format:
#### input shape should be of the form :  (#samples, #timesteps, #features)
#### output shape should be of the form :  (#samples, #timesteps, #features)

![Tensor shape](./jupyter resources/rnn_tensor.png)

#samples: the number of data points (or sequences)
#timesteps: It's the length of the sequence of your data (the MAX_SEQ_LENGTH variable).
#features: Number of features depends on the type of problem. In this problem, #features is the vocabulary size, that is, the dimensionality of the one-hot encoding matrix using which each character is being represented. If you're working with **images**, features size will be equal to: (height, width, channels), and the input shape will be (#training_samples, #timesteps, height, width, channels)

In [21]:
# create X and y
X = np.zeros((len(sentences), MAX_SEQ_LENGTH, VOCAB_SIZE), dtype=bool)
y = np.zeros((len(sentences), VOCAB_SIZE), dtype=bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [22]:
print("Shape of X: {}".format(X.shape))
print("Shape of y: {}".format(y.shape))

Shape of X: (133317, 50, 97)
Shape of y: (133317, 97)


Here, X is reshaped to (#samples, #timesteps, #features). We have explicitly mentioned the third dimension (#features) because we won't use the Embedding() layer of Keras in this case since there are only 97 characters. Characters can be represented as one-hot encoded vector. There are no word embeddings for characters.

# 2. LSTM

In [23]:
from keras.optimizers import Adam

# define model architecture - using a two-layer LSTM with 128 LSTM cells in each layer
model = Sequential()
model.add(LSTM(128, input_shape=(MAX_SEQ_LENGTH, VOCAB_SIZE), return_sequences=True, dropout=0.5))
model.add(LSTM(128, dropout=0.5))
model.add(Dense(VOCAB_SIZE, activation = "softmax"))

optimizer = Adam(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics = ['acc'])

In [24]:
# check model summary
model.summary()

In [25]:
# fit model
model.fit(X, y, batch_size=128, epochs=20)

Epoch 1/20
[1m1042/1042[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 11ms/step - acc: 0.1447 - loss: 3.3885
Epoch 2/20
[1m1042/1042[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 11ms/step - acc: 0.2405 - loss: 2.8884
Epoch 3/20
[1m1042/1042[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 11ms/step - acc: 0.2660 - loss: 2.7483
Epoch 4/20
[1m1042/1042[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 11ms/step - acc: 0.2838 - loss: 2.6668
Epoch 5/20
[1m1042/1042[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 11ms/step - acc: 0.2978 - loss: 2.6083
Epoch 6/20
[1m1042/1042[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 11ms/step - acc: 0.3039 - loss: 2.5806
Epoch 7/20
[1m1042/1042[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 11ms/step - acc: 0.3075 - loss: 2.5578
Epoch 8/20
[1m1042/1042[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 11ms/step - acc: 0.3149 - loss: 2.5257
Epoch 9/20
[1m1042/1042[0m [32m━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7a3ea60fc790>

# 3. Generate code

Create a function that will make next character predictions based on temperature. If temperature is greater than 1, the generated characters will be more versatile and diverse. On the other hand, if temperature is less than one, the generated characters will be much more conservative.

In [26]:
# define function to sample next word from a probability array based on temperature
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [27]:
np.random.multinomial(10, [0.05, 0.9, 0.05], size=2)

array([[ 1,  9,  0],
       [ 0, 10,  0]])

In [28]:
# generate code

start_index = random.randint(0, len(text) - MAX_SEQ_LENGTH - 1) # pick random code to start text generation

for diversity in [0.5, 1.0, 1.5]:
        print('-'*50, 'diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + MAX_SEQ_LENGTH]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(1000):
            x_pred = np.zeros((1, MAX_SEQ_LENGTH, VOCAB_SIZE))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()

-------------------------------------------------- diversity: 0.5
----- Generating with seed: "
		/* reparent: our child is in a different pgrp t"

		/* reparent: our child is in a different pgrp to cure *olfot  se   h fe   fuopt stree,o							 * teopi  */
		ef cptlock_pool(pest_prqbuirer 
= * 		 *ofde = >pocito_gr__c tofters 	 nrt_cromedstipistfinfoets o _oner   stint grouptihg io eeo          *e i SPUPL_PO	L_ * metifget_st the),
}			 o We s i *_sor(ctaitse) 
) * f W            to wo_trak p sted_ at_ing_t ceuponthei sne tt     	et ch

g	/* Co tertron ilse ex dnd tnes i fe t memtiegring os    * m  me fo   aedore t phoc
stai acoovor te mo *orten tn
 * leer i
     e e_saintt modeld thathi l i.
  * t	ele inn= eader no th numitate a  nork, tro t ;
			 r    tid c(pparan)untrem)
 paocom(pereat, ret_stzt_rchert)	
		/*t	rnerwepta = retust_ otn ) 's 
 t	t ei f ret  et_r toble;
 trtade=ng)
			stet_unoop( ( str a; 		reepon_foot( ouk);e		 e* ->neeen_;ymutrro;     fite) {
		etuun   elrse_t_a
k	i   

In [29]:
# generate code

start_index = random.randint(0, len(text) - MAX_SEQ_LENGTH - 1) # pick random seed

for diversity in [0.5, 1.0, 1.5]:
        print('-'*50, 'diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + MAX_SEQ_LENGTH]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(1000):
            x_pred = np.zeros((1, MAX_SEQ_LENGTH, VOCAB_SIZE))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()

-------------------------------------------------- diversity: 0.5
----- Generating with seed: "flags);
}

/*
 * Return true if the calling CPU is"
flags);
}

/*
 * Return true if the calling CPU is noeed ool  tee  R etp nop      meselp_ath
t     thee * so eot      a v, we W/
	** 	*onomasercn an the    hhonn_td tme tor then to tocren ath chedd is peoue  tintnae fettion_lent onk st torngadl  Urent samtin bf rote ing      	        rstr i sither st  fpontreint_ dace ; le aenst  ort toom and tha  Mhe toee st_s sorg   tier  * the to  e nore it it r mgtdret acti iem fi stes onde fos
  * t o lock iteowrot to tt of)is seonser ltre  n soinp p s *o aned atabr wa wt wo pnc,ln
m  nomertaug work ator.
    
	in  _uore(& s/ {atchs= peq_p
or_ steing_p &lorkod
		o* Thed taokee work  to tot  oo thes a s anl che tha  dor toenre p  p pool pee the   p t todef toee 
	 * o i  fd tsf end ut e hernel ng the
 ane irg be wrrle pooler o et tee  wont atived loae the
e */	
u trace_onni_uso_act(oerer	
*				  ftderowd

In [30]:
import datetime, pytz;
print("Current Time in IST:", datetime.datetime.now(pytz.utc).astimezone(pytz.timezone('Asia/Kolkata')).strftime('%Y-%m-%d %H:%M:%S'))

Current Time in IST: 2025-02-13 10:02:31
