# Amazon Review Generator
GPT-2 demostration by Feng Zhang

#### Disclaimer: 
since we don't have permission to install packages on Cheaha directly, we created a virtual environment to install all the required packages locally under jwang96's home directory, and linked this virtual environment to Jupyter as tf-local. To remain the same developing environment and for the ease of debugging, we developed our code under jwang96's Cheaha account, and use the same dataset under jwang96's scratch directory.

#### Warning: Restarting the kernel is required once run the model 'gpt-books' or 'gpt-elec'. gpt2 can load only one model each time.




In [1]:
import gpt_2_simple as gpt2

## 1. One sample from the Dataset

In [2]:
data_file = '/data/scratch/jwang96/Books_5.json'
with open(data_file, 'r') as f:
    print(f.readline())
    


{"overall": 5.0, "verified": false, "reviewTime": "03 30, 2005", "reviewerID": "A1REUF3A1YCPHM", "asin": "0001713353", "style": {"Format:": " Hardcover"}, "reviewerName": "TW Ervin II", "reviewText": "The King, the Mice and the Cheese by Nancy Gurney is an excellent children's book.  It is one that I well remember from my own childhood and purchased for my daughter who loves it.\n\nIt is about a king who has trouble with rude mice eating his cheese. He consults his wise men and they suggest cats to chase away the mice. The cats become a nuisance, so the wise men recommend the king bring in dogs to chase the cats away.  The cycle goes on until the mice are finally brought back to chase away the elephants, brought in to chase away the lions that'd chased away the dogs.\n\nThe story ends in compromise and friendship between the mice and the king.  The story also teaches cause and effect relationships.\n\nThe pictures that accompany the story are humorous and memorable.  I was thrilled to 

## 2. Generate Dataset (1M reviews on Books)

* We only need the score(overall) and text(reviewText) of each record, so we deal with raw data to write our own training data. Meanwhile, we add ' <|endoftext|>\n\n' between each record, which will make our work more manageable later.

In [18]:
import json

output_file = '/data/scratch/jwang96/books.large'
def extract_corpus(data_file, n):
    corpus = ""
    with open(data_file, 'r') as f:
        for i in range(n):
            line = f.readline()
            try:
                js = json.loads(line)
                revtxt = js['reviewText']
                score = js['overall']
                revtxt = revtxt.replace('\n', ' ')
                corpus += str(score)+': '+revtxt
                corpus += ' <|endoftext|>\n\n'
            except:
                continue
    return corpus
with open(output_file, 'w') as f:
    f.write(extract_corpus(data_file, 1000000))

## 3. Load fine-tuned model

#### Warning: Restarting the kernel is required if run the model 'gpt-elec'" before. gpt2 can load only one model each time.
* This Fine-tune the model is based on 124M pretrained model, which is download from https://github.com/minimaxir/gpt-2-simple.
* The training code is in GPT2_train.py file.


In [2]:
model_name = 'gpt-books'
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name=model_name)

W1201 21:04:32.874630 46912496440576 deprecation.py:323] From /home/jwang96/.conda/envs/tf/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


Loading checkpoint checkpoint/gpt-books/model-1000


## 4. Example output:

### 4.1 General Review Text
* Generates review text from Ranking 1 to 5 star.




In [5]:
gpt2.generate(sess, run_name=model_name, prefix='1.0', truncate='<|endoftext|>')
gpt2.generate(sess, run_name=model_name, prefix='2.0', truncate='<|endoftext|>')
gpt2.generate(sess, run_name=model_name, prefix='3.0', truncate='<|endoftext|>')
gpt2.generate(sess, run_name=model_name, prefix='4.0', truncate='<|endoftext|>')
gpt2.generate(sess, run_name=model_name, prefix='5.0', truncate='<|endoftext|>')

1.0: It is fun to talk about the history of the Indians in the early 1900's.  I have written more than once related to the lighthouse and the life of the woman who was once there.  I have also read about the Indian tribes that lived in the South during the early 1900's.  I love the fact that the author has a background in the history of the South.  It is not only interesting to read about the settlers of the South but also to hear about the history of the Indians.  I highly recommend this book to anyone interested in the history of the South. 
2.0: I was also given a copy of the book by the author.  This is a book that I recommend to people who are curious about the Bible. 
3.0: If you have been through the Heckman's Bible before, you will have read this book.  How else to explain how the Bible was written and what its meaning is.  I read this book once a year on a Sunday afternoon and by the next morning, I found it a home.  I had never read it, so I decided to read it again.  I had n

### 4.2 Specific Review Text
* Generates review text from specific products.

In [5]:
gpt2.generate(sess, run_name=model_name, prefix='5.0: The Harry Potter', truncate='<|endoftext|>')

5.0: The Harry Potter author has created a perfect world that is so unique.  The characters are so unique and so original that they bring out the food at the table.  Your children will love the books.  This book is a must read! 


In [8]:
gpt2.generate(sess, run_name=model_name, prefix='5.0: The Twilight', truncate='<|endoftext|>')

5.0: The Twilight Zone is the second best book of all time.  It's the new kind of medium.  It's the same old medium that plays out in movies.  It's the same old medium that weaves into the music of the 1950s.  And the results are amazing.  The Twilight Zone is the one book with the signal to the world that the world has turned into a movie.  Will this be the next book?  I don't know.  It's too good to be true and I wish I had read it years earlier.  I'm not a fan of the movie, but it's still a great book. 


### Performance Analysis:
* These outputs are very human-like reviews and the content matches the given suffix. 
* Outputs are very similar between some reviews no matter what name of books we are generated. 
    - Maybe it is because in our dataset, in reviewing book product review, people are often not including too much plot-related content (maybe trying not to be a "spoiler"). Therefore most of the reviews praise the product in a very same way ("interesting", "funny," etc.).
* The model is terrible at discriminating reviews between low and high scores. 
    - 5.0 review is not bad, while 1.0 review does not make sense. 
    - The incorrectness of low score review might because low score reviews are insufficient. Only a few generated-reviews get a low score. It does make sense from the experience of our life. We used to mark full-score as long as the products are not horrible.
    - The reviews of the 5.0 score have a higher probability of being a short sequence. Because in the training data, a massive bunch of 5.0-score reviews is concise and brief.
* With the help of Grammarly, the scores of 4 examples are 62, 96, 90, and 78 correspondingly. 
    - There are many tiny grammar mistakes. e.g., "its" should be "it's."


## 5. Model fine-funed from Electronic dataset

#### Warning: Restarting the kernel is required if run the model 'gpt-elec'" before. gpt2 can load only one model each time.



* We also fine-tuned another categories which is Electronic to see if we can get some different results.
* Restarting the kernel is required if run the "model_name = 'gpt-books'" before. gpt2 can load only one model each time.

In [2]:
model_name = 'gpt-elec' # fine-tuned from Electronic.json
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name=model_name)

W1201 21:07:43.541576 46912496440576 deprecation.py:323] From /home/jwang96/.conda/envs/tf/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


Loading checkpoint checkpoint/gpt-elec/model-1000


In [3]:
gpt2.generate(sess, run_name=model_name, prefix='1.0', truncate='<|endoftext|>')
gpt2.generate(sess, run_name=model_name, prefix='2.0', truncate='<|endoftext|>')
gpt2.generate(sess, run_name=model_name, prefix='3.0', truncate='<|endoftext|>')
gpt2.generate(sess, run_name=model_name, prefix='4.0', truncate='<|endoftext|>')
gpt2.generate(sess, run_name=model_name, prefix='5.0', truncate='<|endoftext|>')

1.0: I have had this for over a year and it works perfectly.  I have had several other products.  This one works great as well.  It comes with a case, instructions and a usb mouse.  It works with my 1st gen mouse.  Its a little heavier than I expected, but I still use mine for the most part.  I use it for school.  Its a good deal. 
2.0: this is great. it does well for my needs, and the Mounts are the best it came with. 
3.0: I bought this for my Nikon D7000 and it works perfectly.  You can change settings and it's easy to set.  There are a couple of things that I really like: 1) it's small and compact enough to carry with you and it doesn't interfere with the camera's normal size. 2) It comes with a couple of lenses for the lens.  Once you figure it out, you can use the second lens for portraits and it's not really as big as the first one.  I would say that the second lens is pretty good and I got it for $5.  Again, my recommendation is that buy this for your D7000 and if you need it y

In [4]:
gpt2.generate(sess, run_name=model_name, prefix='5.0: The Kindle', truncate='<|endoftext|>')

5.0: The Kindle Fire II is awesome. I am happy with it. You can even turn it up to edit photos and videos. I can even use the Music player. 


In [5]:
gpt2.generate(sess, run_name=model_name, prefix='5.0: This Camera', truncate='<|endoftext|>')

5.0: This Camera is great for traveling. I have been using this camera for a month now and the pictures are so good. I use it with the Canon 17-55mm and have not had any problems. 


### Performance Analysis:
* Similar to result of "fine-tuned model with books dataset".
* The content of reviews become more specific because electronic products have greater differences than books.
* The model is still terrible at discriminating reviews between low and high scores. 
* the scores evaluated by Grammarly are good but some of them get low score due to grammar mistakes. 







## 6. Summary

  In conclusion, GPT-2 does well in the text generation process. The relatedness of content is great. And it's clear to see the reviews are talking about different products if we changed the training dataset.The generated text is readable and easy to understand. It could generate precisely content with prefix provided, although some sentences get low scores with Grammarly due to tiny grammar mistakes (e.g., its should be it's). However, the performance on discrimination of scores is terrible. We think this is because 1.0 reviews are insufficient in training data.
