# Introduction

> In this notebook, we explore a unique approach to generating textual data that pivots around a specific theme: chess. The primary aim is to create a flexible generator capable of producing sentences that are either related to the game of chess or entirely unrelated, based on a simple toggle.


> Such data can be used to practise difficult vectorization techniques, ML vs NN  for this particular case which allows me to refresh my memory on the different applicable vectorization techniques in NLP.


# Strategy

> 1. Keyword Categorization: We categorize our vocabulary into two sets: one related to chess (e.g., "pawn", "checkmate", "bishop") and another comprising general words unrelated to chess (e.g., "apple", "music", "running"). This categorization serves as the foundation for thematic relevance.

> 2. Parameter-Driven Generation: Through a Boolean parameter (about_chess), users can specify whether the generated sentence should be about chess. This simple yet effective method allows for flexibility in the type of data generated.

> 3. Randomization Techniques: We use randomization to select words from the appropriate category, decide on the length of the sentence, and determine the placement of commas. This approach ensures that each generated sentence is unique and that there is a semblance of the unpredictability inherent in natural language.

> 4. Iterative Development: The function is designed to iterate a variable number of times to construct a sentence. During each iteration, it randomly decides whether to add a word directly or to intersperse a comma, thereby creating a "sentence" that is a somewhat chaotic yet interesting assembly of words.

In [1]:
import random

def generate_sentence(about_chess=False):
    # Words not related to chess
    general_words = [
        "mountain", "river", "sky", "apple", "computer", "music", "book", "phone", 
        "running", "swimming", "jumping", "dancing", "eating", "sleeping"
    ]
    
    # Chess-related keywords
    chess_keywords = [
        "checkmate", "pawn", "knight", "bishop", "rook", "queen", "king", "opening", 
        "middlegame", "endgame", "castling", "stalemate", "Elo rating", "FIDE"
    ]
    
    # Decide on a list of words to use based on about_chess parameter
    words_to_use = chess_keywords if about_chess else general_words
    
    # Generate a random sentence
    sentence = ""
    for i in range(random.randint(5, 10)):  # Generate a sentence with 5 to 10 words
        if sentence:  # Add comma randomly if not the first word
            sentence += ", " if random.random() > 0.5 else " "
        sentence += random.choice(words_to_use)
    
    return sentence

In [2]:
# Example usage
about_chess = True  # Change this to False to generate a sentence not about chess
generated_sentence = generate_sentence(about_chess)
generated_sentence

'endgame endgame, pawn, checkmate knight knight, Elo rating king'

In [3]:
about_chess = False
generated_sentence = generate_sentence(about_chess)
generated_sentence

'apple eating, jumping music running'

In [6]:
import pandas as pd

def batch_generate_sentences(generate_sentence_func, n=10, randomness_percent=50, to_csv=False):
    """
    Generate a batch of sentences, some about chess and some not, based on a specified randomness percentage.
    
    :param generate_sentence_func: Function to generate individual sentences.
    :param n: Total number of sentences to generate.
    :param randomness_percent: Percentage of sentences that should be about chess.
    :param to_csv: Whether to save the output as a CSV file.
    :return: Pandas DataFrame with the generated sentences and their classification.
    """
    data = []
    num_chess = int((randomness_percent / 100) * n)  # Calculate the number of chess-related sentences
    
    for _ in range(num_chess):
        sentence = generate_sentence_func(True)
        data.append([sentence, "CHESS"])
        
    for _ in range(n - num_chess):
        sentence = generate_sentence_func(False)
        data.append([sentence, "OTHER"])
    
    # Shuffle the data to mix CHESS and OTHER sentences
    random.shuffle(data)
    
    # Create a DataFrame
    df = pd.DataFrame(data, columns=['Generated Sentence', 'About'])
    
    # Save to CSV if required
    if to_csv:
        df.to_csv("generated_sentences.csv", index=False)
    
    return df

In [8]:
# Example usage with previously defined generate_sentence function and specific parameters
df_example = batch_generate_sentences(generate_sentence, n=10000, randomness_percent=55, to_csv=True)
df_example

Unnamed: 0,Generated Sentence,About
0,"king, endgame rook rook bishop FIDE pawn castling",CHESS
1,"dancing phone running jumping, mountain",OTHER
2,"bishop, Elo rating pawn pawn, knight, rook kni...",CHESS
3,"river apple, phone, mountain, eating apple, ri...",OTHER
4,"FIDE, knight pawn, pawn stalemate rook",CHESS
...,...,...
9995,"king, rook opening castling knight",CHESS
9996,"apple, music, music, music, running sky phone,...",OTHER
9997,"sky, computer dancing sleeping apple, running ...",OTHER
9998,"middlegame opening knight middlegame, queen, q...",CHESS
