### Installing the dependencies

In [1]:
!pip install transformers
!pip install sentence-transformers
!pip install gpt-2-simple



### Selecting Tensorflow version

In [2]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [3]:
!nvidia-smi     #To check which GPU is alloted to us

Mon Feb 14 09:00:59 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Importing gpt_2_simple and Downloading gpt2 simple model 124M

In [4]:
import gpt_2_simple as gpt2_simple
gpt2_simple.download_gpt2(model_name = '124M')   # 124M is the size of the gpt2 model which is also the name of model.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Fetching checkpoint: 1.05Mit [00:00, 219Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 4.13Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 518Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:13, 38.2Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 495Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 7.76Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 6.88Mit/s]


## Mounting Google Drive

This is done so that we can save our model checkpoint directly to google drive and load it back when needed so that we don't need to fine-tune again and again which takes a lot of time.

In [5]:
gpt2_simple.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Loading the json file 

In [6]:
import json

raw_data = '/content/ScrappedData.json'

with open(raw_data, 'r') as f:
    df =json.load(f)


data = []
        
for x in df:
    for y in range(len(x['Questions'])-1):
        question = '[YOU] : ' + x['Questions']
        answer = '[BOT] : ' + x['Context']
            
    data.append(question)
    data.append(answer)

In [7]:
# Creating a new text file named as chatbot.txt and writing the data in it.
with open('chatbot.txt', 'w') as f:
     for line in data:
        f.write(line)
        f.write('\n')

In [8]:
file_name = "/content/chatbot.txt"

Loading the checkpoint from google drive

> I have fine tuned the gpt2 model and than I loaded the checkpoint again in the next Colab session so that I don't have to re-finetune the model which can waste my time.

Initial Fine tuning

In [None]:
session = gpt2_simple.start_tf_sess()
gpt2_simple.finetune(session, dataset= file_name, steps = 1000, model_name = '124M', sample_every = 10, save_every = 50, run_name = 'run1',
                     print_every = 10, restore_from = 'fresh', learning_rate=0.00001)

Copy the checkpoint to the google drive.


> NOTE: This will create a tar file, you need to download it and extract the contents and copy back the folder to the google drive and than loan it.



In [None]:
gpt2_simple.copy_checkpoint_to_gdrive(run_name = 'run1')

Loading the fine tuned checkpoint that we created. After loading your checkpoint, you don't have to again fine tune for running this checkpoint and generating the text. But if you want to make some changes or if you want to continue the fine tuning than you can re-finetune the checkpoint.

In [10]:
gpt2_simple.copy_checkpoint_from_gdrive("run1") 
session = gpt2_simple.start_tf_sess()
gpt2_simple.load_gpt2(session, run_name='run1')

Loading checkpoint checkpoint/run1/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-1000


In [12]:
import pandas as pd
import numpy as np
import random
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

class Mock:
    def __init__(self, training_data):
        self.train_data = training_data
        self.questions = training_data['Questions'].tolist()
        ''' correct_cnt & incorrect_cnt is initialized in init method otherwise it will be 
            reset everytime if initialized in below method.'''
        self.correct_cnt = 0
        self.incorrect_cnt = 0
        ''' score is initialized 0 initially, later it will be used to store its actual value. '''
        self.score = 0
        self.model = SentenceTransformer('all-MiniLM-L6-v2')


    def cos_similarity(self, answer_by_gpt, answer_input):
      # actual answer will be generated by gpt2
      ''' actual answer will be return from module __init__'s getApproximateQuestion
          which will be compared with answer_input i.e. input answer from user. '''
      sentences = [answer_input, answer_by_gpt]
      embed = self.model.encode(sentences)
      self.score = cosine_similarity([embed[0]], [embed[1]])  # Calculating Cosine Similarity for answers.
      if self.score > [0.50]:
          print('Correct', self.score)
          self.correct_cnt += 1
      else:
          print('Wrong', self.score)
          self.incorrect_cnt += 1



if __name__ == "__main__":
  mock_interview_dataset = pd.read_csv('Web Scrapped data.csv', encoding='unicode_escape')
  my_chatbot = Mock(mock_interview_dataset)                                     # ......... python class to check cosine similarity or senetence similarity
  while True:
    ready_or_not = input("Hello, I will be taking your Mock Interview. Are you Ready? Y/N \n")
    if ready_or_not == 'Y' or ready_or_not == 'y':
      count = 0
      chosen = []
      while count != 4:                                # len(train_data.Questions):
          choice = random.choice(mock_interview_dataset.Questions)
          if choice not in chosen:
              chosen.append(choice)
              count = count + 1
              print("Question : ", np.unique(choice))
              print('Your Answer :')
              user_input = input()
              
              output_by_gpt = gpt2_simple.generate(session,
                                                  length = 50,
                                                  temperature = 0.6,
                                                  include_prefix = False,
                                                  prefix = choice, 
                                                  nsamples = 1,
                                                  seed = 42,
                                                  return_as_list = True)[0]
              # print(output_by_gpt)              # compare this x value with the user's input and find similarity score using senetence similarity algorithm.
              my_chatbot.cos_similarity(output_by_gpt, user_input)
      print('Total Correct answers :', my_chatbot.correct_cnt)
      print('Total Incorrect answers :', my_chatbot.incorrect_cnt)
      print('Final Score :' + str((my_chatbot.correct_cnt/4)*100) + '%')
      print("Thankyou for taking this interview.")
      break
    if ready_or_not == 'N' or ready_or_not == 'n':
      break

Hello, I will be taking your Mock Interview. Are you Ready? Y/N 
y
Question :  ['Differentiate between Statistical Modeling and Machine Learning?']
Your Answer :
i dont know
Wrong [[-0.0549138]]
Question :  ['What is the error term composed of in regression?']
Your Answer :
Error is a sum of bias error+variance error+ irreducible error in regression. Bias and variance error can be reduced but not the irreducible error.
Correct [[0.59368575]]
Question :  ['What are the hyperparameters of an SVM?']
Your Answer :
The gamma value, c value and the type of kernel are the hyperparameters of an SVM model.
Correct [[0.5763355]]
Question :  ['Explain the differences between Random Forest and Gradient Boosting machines.']
Your Answer :
Random forests are a significant number of decision trees pooled using averages or majority rules at the end. Gradient boosting machines also combine decision trees but at the beginning of the process unlike Random forests. Random forest creates each tree independe