### Step 1: Importing the dependencies

In [1]:
import pandas as pd
import re
import pickle
import tensorflow as tf
import tensorflow_datasets as tfds
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

# Download NLTK resources (stopwords and WordNet)
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

# Loading the pre-trained model
W2V_model = KeyedVectors.load_word2vec_format('../Datasets/archive/GoogleNews-vectors-negative300.bin.gz',binary=True,limit=500000)

2024-02-02 13:45:05.578988: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-02 13:45:05.778032: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-02 13:45:05.779423: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[nltk_data] Downloading package stopwords to /home/yuvraj/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/yuvraj/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Step 2: Creation of Questions dataframe
Currently there are 5 different files belonging to 5 different domain of the questions and in this module we will be creating a dataframe that will be having total of 1600 questions belonging to 5 different domains. The domains are 
- DSA questions
- System design questions
- AI questions
- Computer fundamental questions
- Behavioural questions

In [2]:
file_paths = ['../Dataset/DSA_que.txt', '../Dataset/System_Design_que.txt', '../Dataset/Behavioural_que.txt',
              '../Dataset/CS_fundamentals.txt', '../Dataset/AI_que.txt']
que_type = {0: 'DSA',
            1: 'System_design',
            2: 'Behavioural',
            3: 'CS_fundamentals',
            4: 'AI'}
count = 0
df = pd.DataFrame()

# Going over all the paths
for paths in file_paths:
    try:
        # Reset que_ls for each file iteration
        que_ls = []

        with open(paths, 'r') as file:
            lines = file.readlines()
            for line in lines:
                
                # Removing the leading and following white space after reading content from the file
                line = line.strip()
                
                # Saving the line in the que list
                que_ls.append(line)

            # Creating series from the 
            que_sr = pd.Series(que_ls)
            temp_df = pd.DataFrame({'Que': que_sr})

            # Adding a feature 'Category'
            temp_df['Category'] = que_type[count]
            count = count + 1

            # Concatenating the dataframes
            df = pd.concat([df, temp_df], axis=0)

    except FileNotFoundError:
        print(f"The file {paths} was not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

In [3]:
# Shuffling the dataframe
df = df.sample(frac=1, random_state=42)

# Resetting the index
df.reset_index(drop=True,inplace=True)

In [4]:
# Saving the dataframe
# df.to_csv('../Dataset/Que_Classification.csv')

### Exploratory data analysis

### Loading data to tensorflow dataset
Loading data into tf.data.Dataset offers several advantages and reasons why it's commonly used in TensorFlow pipelines:

- Efficient Memory Usage: TensorFlow dataset API provides efficient handling of large datasets by streaming data from disk or memory. It loads data on-the-fly, which is particularly useful when dealing with datasets that do not fit entirely into memory.

- Parallelism: TensorFlow's data API supports parallel data loading and preprocessing. This enables faster data pipeline execution, especially on multi-core CPUs or GPUs, as data loading and preprocessing can be performed in parallel with model training or inference.

- Data Transformation: tf.data.Dataset allows for easy and flexible data transformation and preprocessing. You can apply various transformations such as shuffling, batching, mapping, filtering, and more to the dataset to prepare it for training or inference.

In [5]:
# Shuffle and batch the dataset
batch_size = 32
seed = 42

In [6]:
# Creating TensorFlow Dataset directly from the DataFrame
raw_text_ds = tf.data.Dataset.from_tensor_slices((df['Que']))

# SHuffling the items and creating batches
raw_text_ds = raw_text_ds.shuffle(len(df), seed=seed).batch(batch_size)

In [7]:
# Display some benchmark statistics
tfds.benchmark(raw_text_ds)


************ Summary ************



  0%|          | 0/52 [00:00<?, ?it/s]

Examples/sec (First included) 1512.66 ex/sec (total: 53 ex, 0.04 sec)
Examples/sec (First only) 36.51 ex/sec (total: 1 ex, 0.03 sec)
Examples/sec (First excluded) 6800.10 ex/sec (total: 52 ex, 0.01 sec)


Unnamed: 0,duration,num_examples,avg
first+lasts,0.035038,53,1512.662239
first,0.027391,1,36.508854
lasts,0.007647,52,6800.098549


In [8]:
for text_batch in raw_text_ds.take(1):
  for i in range(5):
    print("Que", text_batch.numpy()[i])

Que b'What is the purpose of the I/O scheduler in an operating system?'
Que b'Describe a situation where you had to collaborate with other departments to enhance customer experience.'
Que b'What is the difference between breadth-first search (BFS) and depth-first search (DFS)?'
Que b'How would you design an e-commerce checkout system?'
Que b'Discuss the role of activation functions in preventing vanishing gradients.'


### Train, test split + Data loading optimization

In [9]:
print("Total Batches : ",len(raw_text_ds))
print("Total Training Batches (80:20) : ",len(raw_text_ds)*0.8)
print("Total Testing Batches : ",len(raw_text_ds)*0.2)

Total Batches :  52
Total Training Batches (80:20) :  41.6
Total Testing Batches :  10.4


The performance of a dataset pipeline can have a significant impact on the performance of a machine learning model. If a dataset pipeline is slow, it can bottleneck the overall performance of the model. tfds.benchmark is a simple and easy-to-use tool for evaluating the performance of dataset pipelines. It can be used to identify bottlenecks, compare different pipelines, and track progress over time.

In [10]:
def create_train_test_val(ds,train_size,val_size):
    
    # Calculating total batches
    total_batches = len(ds)
    
    # Extracting training,testing and validation batch from the dataset (ds)
    train_ds_batches = int(train_size*total_batches)
    test_ds_batches = int(val_size*total_batches)
    
    # 80:20
    train_ds = ds.take(train_ds_batches) 
    test_ds = ds.skip(train_ds_batches).take(test_ds_batches)
    
    # Catching and prefetching the dataset to improve data pipeline performance
    train_ds = train_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
    test_ds = test_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
    
    return train_ds,test_ds


# Calling the function
train_ds,test_ds = create_train_test_val(raw_text_ds,0.8,0.2)

In [11]:
# Display some benchmark statistics
tfds.benchmark(train_ds)


************ Summary ************



  0%|          | 0/41 [00:00<?, ?it/s]

Examples/sec (First included) 1728.01 ex/sec (total: 42 ex, 0.02 sec)
Examples/sec (First only) 56.28 ex/sec (total: 1 ex, 0.02 sec)
Examples/sec (First excluded) 6271.10 ex/sec (total: 41 ex, 0.01 sec)


Unnamed: 0,duration,num_examples,avg
first+lasts,0.024305,42,1728.008144
first,0.017768,1,56.282497
lasts,0.006538,41,6271.099957


### Data processing pipeline

cat and Cat

In [12]:
# Example to understand why we need to do lowercasing
word1 = "cat"
word2 = "Cat"
word2_lowercased = word2.lower()

def compare(word1,word2):
    if word1 == word2:
        print("Same")
    else:
        print("Differnt")

compare(word1,word2)
compare(word1,word2_lowercased)

Differnt
Same


In [13]:
# Function to process a single text entry in the dataset
def process_text(text):
    """
    Input : Single raw text
    Output: Cleaned text

    Description: This function will take a single raw text as input, remove all stopwords and punctuation, then lowercase the words to eliminate any ambiguity. 
    Ultimately, lemmatization will be applied to the text, and clean text will be returned.
    """
    # Lowercasing and Tokenizing the text
    tokens = tf.strings.lower(tf.strings.split(text))
    
    # Removing the stopwords as they don't provide any information
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.numpy().decode('utf-8') not in stop_words]
    
    # Remove punctuation
    tokens = [tf.strings.regex_replace(token, '[%s]' % re.escape(string.punctuation), '') for token in tokens]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token.numpy().decode('utf-8')) for token in tokens]
    
    # Join tokens back into a single string
    processed_text = tf.strings.reduce_join(tokens, separator=' ')
    
    return processed_text

In [23]:
str = "The quick brown fox jumps over the lazy dogs, but they don't seem to care about it."
output = process_text(str)
print(output.numpy())

b'quick brown fox jump lazy dog seem care it'


The tf.py_function is used in this context to incorporate a Python function (process_text in this case) into a TensorFlow computational graph. TensorFlow operations are typically written in TensorFlow's native language (graph operations) for better performance, but sometimes you need to use external Python code that TensorFlow doesn't know how to execute directly. tf.py_function serves as a bridge to allow you to use regular Python functions within the TensorFlow graph.

Here's why you need tf.py_function in the process_dataset_element:

- Integration with TensorFlow Graph: TensorFlow operates with a computational graph, and many of its operations are written in C++ or CUDA for efficiency. The tf.py_function allows you to include your custom Python code (process_text) within this graph, ensuring that the entire data processing pipeline can be efficiently executed.

- Eager Execution Compatibility: If you're working in TensorFlow 2.0 or later with eager execution enabled, you might wonder why you need tf.py_function. While eager execution is more Pythonic and allows you to use regular Python functions directly, using tf.py_function ensures that the same code can be seamlessly switched to graph mode for performance gains during model training.

- Parallel Execution and Distributed Training: TensorFlow can parallelize the execution of operations, and tf.py_function allows TensorFlow to handle parallel execution efficiently. This is particularly important when dealing with large datasets or when training models on distributed systems.

- Consistency in Tensor Shapes: When using tf.py_function, TensorFlow is better able to manage and infer the shapes of the tensors involved, ensuring compatibility with the rest of your graph.

In summary, tf.py_function is a wrapper that allows you to use your custom Python functions within the TensorFlow graph, ensuring compatibility with TensorFlow's computation engine. While it introduces a slight overhead due to the transition between Python and TensorFlow execution, it is a necessary step to incorporate non-TensorFlow code seamlessly into the TensorFlow pipeline. If you are using TensorFlow 2.0 or later, you can experiment without tf.py_function and rely on eager execution, but keep in mind the potential performance implications, especially during large-scale training scenarios.

In [14]:
# Define the function to process a single element of the dataset
def process_dataset_element(element):

    processed_text = tf.py_function(func=process_text, inp=[element], Tout=tf.string)
    return processed_text

# Assuming you have a train_ds dataset
cleaned_train_ds = train_ds.map(process_dataset_element)

b'quick brown fox jump lazy dog seem care it'
