#DSCI-D-590-Final-project
##Movie Script Generator
##Team- Sricharan Cheeti

##Preprocessing

####1.Splitting the Script: The script is initially split into individual lines, allowing each line to be processed separately.

####2.Whitespace Removal: Extraneous whitespaces are removed from each line. This step helps in cleaning the script, ensuring that there are no unnecessary spaces at the beginning or end of each line.

####3.Skipping Empty Lines: Empty lines are skipped. This is important for avoiding unnecessary gaps in the script, making it more compact and readable.

####4.Identifying Character Dialogues: Lines that are in all uppercase and consist of five or fewer words are identified as character dialogues. This is based on the common scriptwriting convention where character names (preceding their dialogues) are often in all caps and short. You add a special marker <CHARACTER> around these lines to distinguish them as character dialogues. This can be particularly useful for later stages of script analysis or formatting, where distinguishing between dialogue and description is essential.

####5.Reassembling the Script: Finally, the processed lines are reassembled back into a single text block, maintaining their original order.

In [None]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import re

def preprocess_script(text):
    """
    Preprocess the script by maintaining scene descriptions, dialogues, and special instructions.
    Cleans and formats the text for consistency.
    """
    # Split the script into lines
    lines = text.split('\n')

    # Preprocessed script
    preprocessed_script = []

    for line in lines:
        # Remove any extraneous whitespace
        line = line.strip()

        # Skip empty lines
        if not line:
            continue

        # Check for character dialogues (usually in all caps)
        if line.isupper() and len(line.split()) <= 5:
            # Add a marker for character names
            line = f"<CHARACTER>{line}</CHARACTER>"

        # Add the processed line to the preprocessed script
        preprocessed_script.append(line)

    return '\n'.join(preprocessed_script)

# Read the entire script
file_path = '/content/inglorious_basterds_script.txt'  # Replace with your script file path
with open(file_path, 'r', encoding='utf-8') as file:
    entire_script_content = file.read()

# Preprocess the script
preprocessed_script = preprocess_script(entire_script_content)

# Save the preprocessed script to a new file
preprocessed_file_path = '/content/preprocessed_script.txt'  # Replace with your desired file path
with open(preprocessed_file_path, 'w', encoding='utf-8') as file:
    file.write(preprocessed_script)


##Feature Extraction

####Tokenization:The script is broken down into individual words (tokens). This is done by splitting each line of the script into words.

####Removing Stop Words and Punctuation:Stop words (common words that typically don't contribute to the meaning, like "the", "is", etc.) are removed. This is crucial because stop words can skew analyses by their frequency while contributing little to understanding the content.Punctuation is stripped from each word. This ensures that words are analyzed based on their textual content without being influenced by surrounding punctuation.

###Word Frequency Count:A frequency count of all words (now free of stop words and punctuation) is performed using a Counter.This count helps in understanding the most common words in the script, which could be pivotal for thematic analysis.

####Sorting and Indexing:The vocabulary is sorted according to word frequency, with the most frequent words first.A word-to-index mapping is created, where each unique word is assigned a unique integer. This is a standard practice in text analysis, facilitating various computational processes.

In [None]:
import string
from collections import Counter


def tokenize_script(script_lines):
    """
    Tokenizes the script by splitting each line into words, removing stop words and punctuation,
    and creating a word-to-index mapping.
    """
    stop_words = set(stopwords.words('english'))
    # Remove punctuation and split each line into words
    tokens = [word.strip(string.punctuation) for line in script_lines for word in line.split()]

    # Filter out stop words and empty tokens
    tokens = [word for word in tokens if word and word.lower() not in stop_words]

    # Creating a counter of all words
    word_counts = Counter(tokens)

    # Sorting words according to their frequency
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)

    # Creating a word to index mapping (word -> integer)
    word_to_index = {word: index for index, word in enumerate(sorted_vocab, 1)}  # starting index from 1

    # Tokenizing the script
    tokenized_script = [[word_to_index.get(word.strip(string.punctuation)) for word in line.split() if word.strip(string.punctuation) and word.lower() not in stop_words and word_to_index.get(word.strip(string.punctuation)) is not None] for line in script_lines]

    return tokenized_script, word_to_index

# Path to the preprocessed script file
preprocessed_file_path = '/content/preprocessed_script.txt'  # Replace with your file path

# Reading the preprocessed script
with open(preprocessed_file_path, 'r', encoding='utf-8') as file:
    script_content = file.read()

# Splitting the script into a list of lines
script_lines = script_content.split('\n')

# Tokenizing the script
tokenized_script, word_to_index = tokenize_script(script_lines)

# Displaying the first few tokenized lines and a snippet of the word_to_index mapping
print(tokenized_script[:5])
print({k: word_to_index[k] for k in list(word_to_index)[:10]})  # Showing first 10 words in the word_to_index dictionary


[[1112, 1113], [1114, 525, 611, 1524, 1525, 271, 16, 156, 1115, 401], [720, 464, 1526, 721], [402, 2373], [1527, 1528, 1529]]
{'CHARACTER>COL': 1, 'LANDA</CHARACTER': 2, 'CHARACTER>LT': 3, 'German': 4, 'CHARACTER>SHOSANNA:</CHARACTER': 5, 'Shosanna': 6, 'ALDO</CHARACTER': 7, 'one': 8, 'I’m': 9, 'back': 10}


##Main Functionality- Text Generation

####1. Model Architecture:

####Embedding Layer: The model begins with an embedding layer, which transforms each word-index in the input sequences into a dense vector of fixed size. This is a crucial step as it allows the model to learn a rich representation for each word.
####LSTM Layers: Next are two LSTM (Long Short-Term Memory) layers. LSTM, a form of recurrent neural network (RNN), is adept at processing sequences and capturing temporal dependencies. This is essential for text generation, where the meaning and structure depend heavily on the order of words.
####Dropout Layer: A dropout layer follows, which helps prevent overfitting by randomly setting a portion of the input units to zero during training.
####Dense Layer with Softmax Activation: The final layer is a dense (fully connected) layer with softmax activation. It outputs a probability distribution over the entire vocabulary for the next word in the sequence.
####2. Training Process:
####Data Preparation: The training data is prepared by creating sequences from the tokenized script. Each input sequence is padded to a fixed length and used to predict the next word in the sequence.
####One-Hot Encoding: The output words (labels) are one-hot encoded, turning them into a format suitable for categorical prediction.
####Model Compilation: The model is compiled with categorical crossentropy as the loss function and the Adam optimizer, a combination well-suited for classification tasks.
####3. Text Generation:
####Generating New Text: Once trained, the model can generate new text. Starting with a seed sequence, the model predicts the next word, which is then appended to the sequence. This new sequence is fed back into the model for the next prediction. This process repeats, generating a chain of text.
####Handling Sequences: The model continuously updates the input sequence by adding new predictions and trimming to maintain the fixed length.
####Probabilistic Nature: Each output from the model represents a probability distribution over the possible next words, allowing for diverse and contextually relevant text generation.
####4. Application:
####This functionality is particularly exciting for creative applications like:

####Automated Script Writing: Generating new script content that aligns with the style and themes of the input script.
####Predictive Text Generation: Offering suggestions for scriptwriters based on the current context of their writing.
####5. Significance:
####The significance of this model lies in its ability to learn and mimic the linguistic patterns, style, and narrative flow of the input script, providing a tool that can assist in creative writing or even generate entirely new content based on learned patterns. It represents a blend of machine learning and creative writing, pushing the boundaries of how AI can be used in artistic and creative domains.







In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding, Dropout
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences

# Assuming 'tokenized_script' and 'word_to_index' are available from your tokenization process

# Prepare data for training
def prepare_sequences(tokenized_script, sequence_length):
    X, y = [], []
    for line in tokenized_script:
        for i in range(1, len(line)):
            sequence = line[:i+1]
            sequence = pad_sequences([sequence], maxlen=sequence_length, padding='pre')[0]
            X.append(sequence[:-1])
            y.append(sequence[-1])
    return np.array(X), to_categorical(y, num_classes=len(word_to_index) + 1)


In [None]:
# Define sequence length and prepare sequences
sequence_length = 50  # You can adjust this
X, y = prepare_sequences(tokenized_script, sequence_length)

In [None]:
# Define the model
model = Sequential()
model.add(Embedding(input_dim=len(word_to_index) + 1, output_dim=100, input_length=sequence_length-1))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dense(len(word_to_index) + 1, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
# Train the model
model.fit(X, y, epochs=100, batch_size=128)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x798fbaf7d0c0>

In [None]:
def generate_text(seed_text, num_words, model, word_to_index, index_to_word):
    text = []
    for _ in range(num_words):
        encoded = [word_to_index[word] for word in seed_text.split() if word in word_to_index]
        encoded = pad_sequences([encoded], maxlen=sequence_length-1, padding='pre')
        y_pred = model.predict(encoded, verbose=0)

        # Get the index with the highest probability
        predicted_index = np.argmax(y_pred, axis=-1)[0]
        predicted_word = index_to_word[predicted_index]

        seed_text += ' ' + predicted_word
        text.append(predicted_word)
    return ' '.join(text)


In [None]:
seed_text = "The French farmer"
generated_text = generate_text(seed_text, 500, model, word_to_index, {v: k for k, v in word_to_index.items()})
print(generated_text)

sinks convoy en Cramming noodle back pants weight door killin’ sheer violence Marcel sat sympathetic French beam emanating neck violence lion HAMMERSMARK puffing underneath man soldier entire black feet gold tableau certain Palu NAZI ENLISTED MEN moved napkin wall gonna one truck opposite end table Nazi knife marquee saying behind French dialogue going Basterds worse German Herrman dead Nazi caricatures ladies burning precede makes grab KNIFE rodent skin Monsieur LaPadite rumors heard regarding back us agonizing Louisaiane swill star false often Fredrick’s black fishnet veil face FLASK fighting one says S.S cap German found occur stinks playing kaput without posters kiosks one table interpreter S.S takes studies taking German Nazi taking occur short kaput calabash rat scamper door momentarily LOCKING folder now-vacant cinema entrance Inside cigarette FIRST German classic films However anyway winning beyond lie hostess medical pop file cinema approaching gold Nancy pipe Shosanna’s cinem

###Personal Contribution Statement

####As I am working on a Solo project, all the tasks were contributed by me.The development of the text generation model from a movie script was intensive and strategically planned over a three-week period, encompassing various key stages of the project.

####Week 1: Script Preparation and Feature Extraction

Script Preprocessing: Initiated the project by cleaning, formatting, and structurally organizing the movie script for computational analysis.
Feature Extraction: Implemented tokenization of the script, removing stop words and punctuation, and created a structured dataset. This included building a frequency-based vocabulary essential for the model training.
####Week 2: Model Development and Initial Training

Data Preparation: Prepared the data for the neural network by generating and formatting sequences from the tokenized script, including one-hot encoding of output labels.
Model Design and Commencement of Training: Designed the LSTM-based neural network model, selecting the appropriate architecture and configuring layers. Initiated the model training, focusing on embedding, LSTM layers, dropout, and dense layers.
####Week 3: Model Optimization and Text Generation

Model Fine-Tuning: Continued with the training of the model, making iterative adjustments and refinements to optimize its performance and accuracy.
Text Generation Implementation and Testing: Developed the text generation functionality to enable the model to produce new script content. Conducted thorough testing and evaluation of the generated text, ensuring coherence and alignment with the style of the input script.