# Urdu Text Generator using News Dataset
Keeping in mind that urdu language is one of the language that requires a lot of reserch and work remaining in NLP area if we will compare it with English and some other popular languages. I find this interesting and full of opportunities for young researchers like to work on these gaps. 

Text generator is an Application of language model. Language model can be simply defined as the model that will calculate the probability of sentence to be the next sentance. If i talk on samller level, Language model is model that will calculate probability of any sentence to be the next sentence and then we can calculate the probability of whole sentence using chian rulw of probability.

In this notebook, I am going to use Urdu Headlines dataset provided by **Adnan Zaidi**. Thanks to him for this effort. I will be using LSTM in keras with tensorflow backend to train this Language model and will use it as text generator for Urdu News. You must have heard that deep learning models are always hungry for data. so this model will not perform that much well as we have near 2300 news headlines but performance will increase if we will have more data in future and we will improve our model.

In [None]:
import numpy as np
import pandas as pd

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM, Bidirectional, Flatten, Dense, Embedding

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import CategoricalCrossentropy

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Reading data
CSV file in this dataset contains only headlines of news. I am going to read file and extract all news headlines to a list to preprocess the text so we can use it for training.

In [None]:
data = pd.read_csv('/kaggle/input/urdu-news-headlines/UrduNewsHeadlines.csv')
data = data['قومی\t وزیراعظم عمران خان نے کل ایسی جگہ جانے کا فیصلہ کر لیا کہ اپوزیشن رہنما بھی دنگ  ']
data = [x.split('\t')[1].strip() for x in data]
data[:10]

## Tokenization
Now we have data that contains around 2300 news headlines. It time to prepare our dataset. I am using keras built in tokeniser to tokenize sentences.

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
vocab_size = len(tokenizer.word_index) + 1
vocab_size

## Dataset preparation for language modeling
Here is a tricky thing you need to understand. we need to prepare dataset for a language model. The task of a language model is to predict what's going to be the next word.Here I am going to loop through each sentence and will prepare training examples by taking previous words as input and next coming word as output (to be predicted.)

In [None]:
train_data = tokenizer.texts_to_sequences(data)
training_examples = []
labels = []

for e in train_data:
    for i in range(1, len(e)):
        training_examples.append(e[:i])
        labels.append(e[i])
        
max_len = max([len(x) for x in training_examples])
X = pad_sequences(training_examples, maxlen=max_len)
X

## One-hot encoding of output labels
The main problem is to predict the word that is going to be the next word. so it is a multiclass classification problem where number of classes would be equal to the number of words in vocabulary. So to solve it, we need to place softmax activation funcction at the output layer of our neural network. for that i am doing one hot encoding of our outputs in next cell

In [None]:
Y = to_categorical(labels, num_classes=vocab_size)
Y

## Model definition
In the cell below, I am using Bidirectional LSTM to train our language generator model.

In [None]:
model = Sequential([
    Embedding(vocab_size, 300),
    Bidirectional(LSTM(20)),
    Flatten(),
    Dense(vocab_size, activation='softmax')
])

optim = Adam(learning_rate=0.01)
loss = CategoricalCrossentropy()
model.compile(optimizer=optim, loss=loss, metrics=['accuracy'])

In [None]:
BS = 32
EPOCHS = 20

history = model.fit(X, Y, batch_size=BS, epochs=EPOCHS)

In [None]:
model.summary()

## Text generation
Now, we have trained an LSTM model to which if we give some words as input, it will output next word. To generate text what i will do is to input it some starting words and it will predict something. then i will concatenate the predicted word to input and give it to model again and so on.

In [None]:
seed_text = "وزیراعظم عمران خان نے"
next_words = 30
  
for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_len-1, padding='pre')
    predicted = model.predict_classes(token_list, verbose=0)
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
    seed_text += " " + output_word
print(seed_text)