In [29]:
import tensorflow as tf
import numpy as np
import keras 
import hazm
from keras.callbacks import ModelCheckpoint,EarlyStopping,ReduceLROnPlateau
from keras.layers import BatchNormalization

In [None]:
!wget -O pos_tagger.model "https://drive.usercontent.google.com/u/0/uc?id=1Q3JK4NVUC2t5QT63aDiVrCRBV225E_B3&export=download"

In [30]:
token = []
count={}

# **Text Preprocessing and Data Preparation for Machine Learning Models Using Hazm**

This code preprocesses Persian textual data and prepares it for machine learning models, such as RNNs or LSTMs.

---

## **Code Overview**

The following steps are performed:

1. **Reading Data from a File**  
   The text is read from `data.txt` and stored in a variable.

2. **Removing Empty Lines**  
   Extra empty lines (`\n\n`) are removed from the text.

3. **Stemming and Lemmatization**  
   The `hazm` library is used to normalize the text:
   - **Stemming:** Reduces words to their root form.
   - **Lemmatization:** Converts words to their base dictionary form.

4. **Tokenization and POS Tagging**  
   - The text is tokenized into individual words.  
   - POS (Part-of-Speech) tagging is applied using the `pos_tagger.model`.  
   - Each word is tagged in the format `[word-POS_tag]`.

5. **Creating Unique Tokens and Counting Frequencies**  
   - A list of unique tokens is created.  
   - The frequency of each token in the text is calculated.

6. **Token-to-Index Conversion**  
   Tokens are replaced with their respective indices to prepare numerical input for models.

7. **Data Preparation for Model Training**  
   A sliding window of size 10 is applied:
   - `X_train` contains sequences of 10 consecutive words (as indices).  
   - `Y_train` contains the next word following each sequence.


### Reading Data from a File

In [32]:
with open("data.txt","r") as file:
    contact =file.read()

### Removing Empty Lines

In [33]:
contact = contact.replace("\n\n","")

### Stemming and Lemmatization

In [34]:
stemmer = hazm.Stemmer()
contact=stemmer.stem(contact)

In [35]:
lemmatizer = hazm.Lemmatizer()
contact=lemmatizer.lemmatize(contact)

### Tokenization and POS Tagging

In [36]:
def merge (arr) :
    arr1=[]
    for v in arr:
        arr1.append(f"[{v[0]}-{v[1]}]")
    return arr1

In [37]:
contact=hazm.word_tokenize(contact)

In [38]:
spacy_posTagger = hazm.POSTagger(model='pos_tagger.model')
contact=merge(spacy_posTagger.tag(tokens = contact))

In [39]:
for index,value in enumerate(contact):
    if not value in token:
        token.append(value)

### Creating Unique Tokens and Counting Frequencies

In [40]:
for index,value in enumerate(token):
    count[value]=contact.count(value)

### Token-to-Index Conversion

In [41]:
for index,value in enumerate(contact):
    contact[index] = token.index(value)

### Data Preparation for Model Training

In [42]:
X_train=[]
Y_train=[]

In [43]:
for index,value in enumerate(contact):
    if len(contact)-10 > index:
        X_train.append(contact[index:index+10])
        Y_train.append(contact[index+10])

In [44]:
X_train = np.array(X_train)
Y_train = np.array(Y_train)

# **Training an LSTM Model for Text Prediction**

This section defines and trains an LSTM-based model using Keras for predicting the next word in a sequence of Persian text.

---

## **Code Overview**

The following steps are performed:

1. **Model Architecture:**
   - An **Embedding layer** maps the tokenized words to dense vector representations.
   - An **LSTM layer** processes the sequences and learns long-term dependencies.
   - A **Dense layer** with a softmax activation predicts the next word.

2. **Model Compilation:**
   - **Optimizer:** Adam optimizer for efficient training.  
   - **Loss Function:** `sparse_categorical_crossentropy` is used for multi-class classification.  
   - **Metrics:** Accuracy is monitored.

3. **Callbacks for Training:**
   - **ModelCheckpoint:** Saves the best model based on `val_loss` and `loss`.
   -   
4. **Training:**
   - The model is trained using `X_train` and `Y_train` with a validation split of 1%.  
   - The model runs for 1000 epochs.


In [45]:
vocab_size = len(token) + 1 
maxlen=10

In [46]:
model = keras.Sequential()

### Model Architecture

In [47]:
model.add(keras.layers.Embedding(vocab_size, 7000, input_length=10))
model.add(keras.layers.LSTM(1024))  
model.add(keras.layers.Dense(vocab_size, activation='softmax'))

### Model Compilation

In [48]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

### Callbacks for Training

In [50]:
checkpoint_val_loss = ModelCheckpoint('token_generator_best_val_loss.keras', save_best_only=True,monitor="val_loss")
checkpoint_loss = ModelCheckpoint('token_generator_best_loss.keras', save_best_only=True,monitor="loss")

### Training

In [None]:
model.fit(X_train, Y_train, epochs=1000,validation_split=0.01,callbacks=[checkpoint_val_loss,checkpoint_loss])