# TEXT GENERATION USING TOURISM REVIEWS - Vibha Rao



* **Problem -** *Tourism reviews are a valuable source of information for potential travelers. However, they can be difficult to read and understand, especially if they are long or contain a lot of technical jargon.*

* **Solution -** *A text generation using GPT2 - Large that can automatically generate summaries of tourism reviews would be a valuable tool for potential travelers. The system would take as input a tourism review, and it would output a summary of the review that is concise, informative, and easy to understand.*



*   **Tech stack used - Python , GPT-2 LARGE , Hugging Face Transformers &  
    Pytorch.**


*   Python: The code is written in Python.
*   Hugging Face Transformers: The Transformers library is used to load the    
    GPT-2 Large model and the tokenizer.
*   PyTorch: PyTorch is used to generate the summary of the tourism review.





*   **Benefits:** The benefits of such a system would include:

1.  Increased efficiency: Potential travelers would be able to quickly and easily get the information they need from tourism reviews.
2.  Improved understanding: Potential travelers would be able to better understand the pros and cons of different tourism destinations.
3.  Increased trust: Potential travelers would be more likely to trust tourism reviews that have been summarized by a machine learning system.

In [1]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [2]:
link = 'https://drive.google.com/file/d/1cvkIruzYVNRNTeOWptH6-bnFNdsbW27A/view'

import pandas as pd

# to get the id part of the file
id = link.split("/")[-2]

downloaded = drive.CreateFile({'id':id})
downloaded.GetContentFile('places_review.csv')

df = pd.read_csv('places_review.csv')
print(df)

                      City                Place  \
0        Aamby Valley City      19 Degree North   
1        Aamby Valley City      19 Degree North   
2        Aamby Valley City      19 Degree North   
3        Aamby Valley City      19 Degree North   
4        Aamby Valley City      19 Degree North   
...                    ...                  ...   
1482461              Zuluk  Zuluk Wildlife Area   
1482462              Zuluk  Zuluk Wildlife Area   
1482463              Zuluk  Zuluk Wildlife Area   
1482464              Zuluk  Zuluk Wildlife Area   
1482465              Zuluk  Zuluk Wildlife Area   

                                                    Review  Rating       Name  \
0        aamby valley beautiful place clear blue skies ...       5  Anonymous   
1        executed obt akshay thanx team thoroughly enjo...       4  Anonymous   
2        awesome experience atv tracts obstacles mainta...       5  Anonymous   
3        visited aamby valley yesterday short excursion...     

# Inspecting the Data

In [3]:
#shape - representing the dimensionality of the DataFrame
df.shape

(1482466, 7)

In [4]:
# depicts the top five dataframes
df.head()

Unnamed: 0,City,Place,Review,Rating,Name,Date,Raw_Review
0,Aamby Valley City,19 Degree North,aamby valley beautiful place clear blue skies ...,5,Anonymous,,Aamby valley is a beautiful place with its cle...
1,Aamby Valley City,19 Degree North,executed obt akshay thanx team thoroughly enjo...,4,Anonymous,,Very well executed obt by Akshay.... Thanx as ...
2,Aamby Valley City,19 Degree North,awesome experience atv tracts obstacles mainta...,5,Anonymous,,Awesome experience at the ATV\nTracts and obst...
3,Aamby Valley City,19 Degree North,visited aamby valley yesterday short excursion...,4,Anonymous,,we visited the Aamby Valley yesterday for shor...
4,Aamby Valley City,19 Degree North,far mumbai place finest adventure places visit...,5,Anonymous,,"Not far from Mumbai, this place is one of the ..."


In [5]:
# Inspecting some of the reviews
for i in range(5):
    print("Review #",i+1)
    print(df.City[i])
    print(df.Raw_Review[i])
    print()

Review # 1
Aamby Valley City
Aamby valley is a beautiful place with its clear blue skies and fresh green grass. My family and I visited aamby valley to celebrate my mother's birthday. My mom had the most splendid time. Thanks to Pinky Bharadwaj for handling our booking from Bombay. Together...

Review # 2
Aamby Valley City
Very well executed obt by Akshay.... Thanx as a team we thoroughly enjoyed especially A frame and treasure hunt

Review # 3
Aamby Valley City
Awesome experience at the ATV
Tracts and obstacles well maintained,
Very safe yet challenging

Had a blast

Good experience

Helpful instructors :)

Review # 4
Aamby Valley City
we visited the Aamby Valley yesterday for short excursion trip from Mumbai. We drove down and travel time was approx 3hours. The city is so clean and away from the polluted air of Mumbai. 
We had out lunch at woodpecker hotel. I was truly impressed...

Review # 5
Aamby Valley City
Not far from Mumbai, this place is one of the finest adventure places I h

In [6]:
# Inspecting some of the reviews

# Getting last 3 rows from df
df_last_3 = df.tail(3)

# Printing df_last_3
print(df_last_3)

          City                Place  \
1482463  Zuluk  Zuluk Wildlife Area   
1482464  Zuluk  Zuluk Wildlife Area   
1482465  Zuluk  Zuluk Wildlife Area   

                                                    Review  Rating       Name  \
1482463  excellent watched place east sikkim visited pl...       5  Anonymous   
1482464  beautiful areas sikkim falls eastern sikkim an...       4  Anonymous   

         Date                                         Raw_Review  
1482463   NaN  A excellent & must watched place for east sikk...  
1482464   NaN  One of the most beautiful areas in Sikkim... i...  


# Preparing the Data

In [7]:
# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}

In [8]:
# Installing the transformers library - provides APIs to quickly download
# and use pre-trained models for natural language processing tasks
!pip install transformers

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m105.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m74.0 MB/s[0m eta [36m0:00:

In [9]:
# GPT2Tokenizer - tokenize text into tokens
# GPT2LMHeadModel - generate text, translate languages, and answer your questions
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# 1. Convert the sentences into the tokens

In [10]:
# The GPT-2 Large model is a large language model with 1.5 billion parameters,
# which means that it has been trained on a massive dataset of text and code.
# tokenizer -  tokenize text into tokens used by GPT-2 Large model.

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

**Creating a GPT2LMHeadModel instance from the pre-trained "gpt2-large" model, and have set the pad token ID to the EOS token ID**

In [11]:
# EOS token ID is the token that marks the end of a sentence in the GPT-2 vocabulary.
model = GPT2LMHeadModel.from_pretrained('gpt2-large', pad_token_id=tokenizer.eos_token_id)

Downloading model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [12]:
# returns the ID of the EOS token in the tokenizer's vocabulary
# tokenizer - object in the Hugging Face Transformers library.
tokenizer.eos_token_id

50256

In [13]:
# takes a token ID as input and returns the corresponding token string
tokenizer.decode(tokenizer.eos_token_id)

'<|endoftext|>'

### We have considered the first dataframe from reviews data here ⬇️

In [14]:
# return_tensors='pt' argument tells the tokenizer.encode() method
# to return the token IDs as a PyTorch tensor.

sentence = 'Aamby valley is a beautiful place with its clear blue skies and fresh green grass.'
numeric_ids = tokenizer.encode(sentence, return_tensors = 'pt')

In [15]:
numeric_ids

tensor([[   32,   321,  1525, 19272,   318,   257,  4950,  1295,   351,   663,
          1598,  4171, 24091,   290,  4713,  4077,  8701,    13]])

In [16]:
tokenizer.decode(numeric_ids[0][3])

' valley'

# 2. Generate the text given the sentence

In [17]:
result = model.generate(numeric_ids, max_length = 100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)


In [18]:
result

tensor([[   32,   321,  1525, 19272,   318,   257,  4950,  1295,   351,   663,
          1598,  4171, 24091,   290,  4713,  4077,  8701,    13,   632,   318,
           635,   530,   286,   262,  1178,  4113,   287,   262,   995,   810,
           345,   460,   766,   262, 34822,  6378,   422,   262,  1353,   286,
           257,  8598,    13,   198,   198,   464,  1703,  1525,  6916,   318,
          5140,   319,   262,  4865,  1022, 27026,   290, 16581, 37878,    13,
           383, 19272,   468,   587, 30671,   329,  4138,   286,   812,   290,
           318,  1363,   284,   257,  1271,   286, 22700,  4693,   884,   355,
           262, 42438, 22931,  2330,  9529,   259,   420, 27498,   290,   262,
         22700, 31877,  1885, 47329,    13, 50256]])

In [19]:
generated_text = tokenizer.decode(result[0], skip_special_tokens=True)
print(generated_text)


Aamby valley is a beautiful place with its clear blue skies and fresh green grass. It is also one of the few places in the world where you can see the Milky Way from the top of a mountain.

The Amby Valley is located on the border between Nepal and Bhutan. The valley has been inhabited for thousands of years and is home to a number of endangered species such as the Himalayan white rhinoceros and the endangered Tibetan antelope.


# EXAMPLE 2

In [20]:
sentence2 = 'Connaught place is a microcosim of local Delhi culture.  The centre has a breathtakingly huge Indian flag.  Surrounding the central park area are merchants, shops, restsurants, streetfood vendors and both locals and tourists.'
numeric_ids = tokenizer.encode(sentence2, return_tensors = 'pt')

In [21]:
numeric_ids

tensor([[37321,  3413,  1295,   318,   257,  4580,  6966,   320,   286,  1957,
         12517,  3968,    13,   220,   383,  7372,   468,   257, 35589,   306,
          3236,  3942,  6056,    13,   220,  4198,   744,   278,   262,  4318,
          3952,  1989,   389, 21779,    11, 12437,    11,  1334, 11793,  1187,
            11,  4675, 19425, 17192,   290,  1111, 17205,   290, 15930,    13]])

In [22]:
tokenizer.decode(numeric_ids[0][3])

' is'

# INCREASING MAX LENGTH TO 200
We can keep tuning as per our requirements.

In [23]:
result = model.generate(numeric_ids, max_length = 200, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

In [24]:
result

tensor([[37321,  3413,  1295,   318,   257,  4580,  6966,   320,   286,  1957,
         12517,  3968,    13,   220,   383,  7372,   468,   257, 35589,   306,
          3236,  3942,  6056,    13,   220,  4198,   744,   278,   262,  4318,
          3952,  1989,   389, 21779,    11, 12437,    11,  1334, 11793,  1187,
            11,  4675, 19425, 17192,   290,  1111, 17205,   290, 15930,    13,
           198,   198,   464,  3952,   318,   635,  1363,   284,   257,  1271,
           286, 27081,   290, 10157,  1127,    13,   383,   749,  5863,   286,
           777,   318,   262,  8882,   397,   375,  5303, 10857,    11,   543,
           373,  3170,   287,   262,  1467,   400,  4289,   290,   318,   530,
           286,   262, 13325, 27081,   287,  3794,    13,   632,   318,   531,
           284,   307,   262, 48145,   286,  8882,   265,  2611, 22081,    11,
           262,  9119,   286,  3794,   338, 10404,  3356,    13, 50256]])

In [25]:
generated_text = tokenizer.decode(result[0], skip_special_tokens=True)
print(generated_text)

Connaught place is a microcosim of local Delhi culture.  The centre has a breathtakingly huge Indian flag.  Surrounding the central park area are merchants, shops, restsurants, streetfood vendors and both locals and tourists.

The park is also home to a number of temples and shrines. The most famous of these is the Mahabodhi Temple, which was built in the 16th century and is one of the oldest temples in India. It is said to be the birthplace of Mahatma Gandhi, the founder of India's independence movement.
