** Data Scrapping**

In [2]:
# Import Requests
import requests

# Import Beautiful Soup
from bs4 import BeautifulSoup

In [3]:
# Execute request
# If you’re using a different site just replace the url e.g. r=requests.get(‘put your url in here’)
r = requests.get('https://www.yelp.com/biz/salmon-bar-san-francisco-2?osq=Japanese')

In [4]:
# Check request status
print(r.status_code)

200


In [5]:
# Check result
r.text

'<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border: 0; font-size: 100%; font: inherit; vertical-align: baseline;"><head><script>document.documentElement.className=document.documentElement.className.replace(/\x08no-js\x08/,"js");</script><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="Content-Language" content="en-US" /><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link rel="mask-icon" sizes="any" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b2bb2fb0ec9c/assets/img/logos/yelp_burst.svg" content="#FF1A1A"><link rel="shortcut icon" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b05852393ae5/assets/img/logos/favicon.ico"><script> window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;window.ygaPageStartTime=new Date().getTime();</script><script>\n            window.yelp = window.yelp || {};\

In [6]:
# Make the soup
soup = BeautifulSoup(r.text, 'html.parser')

In [7]:
# First get all of the review-content divs
results = soup.findAll(class_=' review__09f24__oHr9V border-color--default__09f24__NPAKY')
results

[<div class=" review__09f24__oHr9V border-color--default__09f24__NPAKY"><div class=" margin-b3__09f24__l9v5d border-color--default__09f24__NPAKY"><div class=" arrange__09f24__LDfbs gutter-auto__09f24__W9jlL border-color--default__09f24__NPAKY"><div class=" arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG border-color--default__09f24__NPAKY"><div aria-label="Matt G." class=" border-color--default__09f24__NPAKY" role="region"><div class=" arrange__09f24__LDfbs gutter-1-5__09f24__vMtpw vertical-align-middle__09f24__zU9sE border-color--default__09f24__NPAKY"><div class=" arrange-unit__09f24__rqHTg border-color--default__09f24__NPAKY"><div class=" css-w8rns border-color--default__09f24__NPAKY"><a class="css-1fkqezt" href="/user_details?userid=71Zlsb9bKUMZsV46E3eSLQ" target="_self"><img alt="Photo of Matt G." class=" css-1pz4y59" draggable="true" height="64" loading="lazy" src="https://s3-media0.fl.yelpcdn.com/photo/IPUjfhK-A3bVxXpJJVdsIw/60s.jpg" srcset="https://s3-media0.fl.yelpc

In [8]:
# Loop through review-content divs and extract paragraph text
reviews = []
for result in results:
  reviews.append(result.find('p').text)

In [9]:
for review in reviews:
  print(review,'\n')

My partner and I loved it! Even in the Mission with a huge diversity of cuisines, this is something new. I'm not really a seafood person, but the food was delicious and I appreciated the exploration of some seafood dishes outside what one usually gets at an American seafood restaurant. Everything was great but I especially recommend the Lion King as a standout main. Plus, cheap drinks and small plates for happy hour!​ 

Very good food and excellent service!! We were driving around the area and saw a grand opening sign so we decided to give it a try. So glad we did! We ordered the lunch special which comes with a salad and miso soup. Will definitely be back. 

Just tried our new neighborhood restaurant. Friendly service and delicious, perfectly cooked salmon. We had the Lion King and Honey Lemon Salmon with the Curry Croquettes as starter together with a nice California Sauvignon Blanc. Looking forward to try another salmon offering soon. 

Fabulous, friendly and sometimes witty service



---

---

**Data Cleaning , Drawing Insights about data**

In [10]:
# Import pandas
import pandas as pd

#Import numpy
import numpy as np

In [11]:
# Create a pandas dataframe from array
df = pd.DataFrame(np.array(reviews), columns=['review'])

In [12]:
# Calculate word count
df['word_count'] = df['review'].apply(lambda x: len(str(x).split(" ")))

In [13]:
# Calculate character count
df['char_count'] = df['review'].str.len()

In [14]:
def avg_word(review):
  words = review.split()
  return (sum(len(word) for word in words) / len(words))

# Calculate average words
df['avg_word'] = df['review'].apply(lambda x: avg_word(x))

In [15]:
# Import stopwords
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [16]:
stop_words = stopwords.words('english')
df['stopword_coun'] = df['review'].apply(lambda x: len([x for x in x.split() if x in stop_words]))

In [17]:
df.describe()

Unnamed: 0,word_count,char_count,avg_word,stopword_coun
count,10.0,10.0,10.0,10.0
mean,171.5,972.8,4.886251,59.1
std,215.45675,1228.540941,0.560755,67.959874
min,33.0,226.0,4.148936,6.0
25%,43.5,265.25,4.514364,11.5
50%,60.0,352.0,4.770384,22.0
75%,237.5,1285.25,5.232143,93.75
max,729.0,4198.0,5.878788,217.0


In [18]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word,stopword_coun
0,My partner and I loved it! Even in the Mission...,73,421,4.780822,27
1,Very good food and excellent service!! We were...,47,241,4.148936,17
2,Just tried our new neighborhood restaurant. Fr...,43,283,5.604651,11
3,"Fabulous, friendly and sometimes witty service...",42,264,5.309524,13
4,4.5 STARSWhat to do when you go to Salmon Bar?...,729,4198,4.759945,217


In [19]:
df

Unnamed: 0,review,word_count,char_count,avg_word,stopword_coun
0,My partner and I loved it! Even in the Mission...,73,421,4.780822,27
1,Very good food and excellent service!! We were...,47,241,4.148936,17
2,Just tried our new neighborhood restaurant. Fr...,43,283,5.604651,11
3,"Fabulous, friendly and sometimes witty service...",42,264,5.309524,13
4,4.5 STARSWhat to do when you go to Salmon Bar?...,729,4198,4.759945,217
5,Taste: 4.5/5Service: 5/5Value: 4/5Happy Hour V...,33,226,5.878788,6
6,"Can you build a restaurant around salmon? Yes,...",203,1121,4.527094,90
7,So my partner and I decided to stop in for the...,249,1365,4.510121,104
8,We were wandering around SF for the day before...,251,1340,4.342629,95
9,"We ordered the bacon shrimp kabobs, miso glaze...",45,269,5.0,11


In [20]:
df.sort_values(by='stopword_coun')

Unnamed: 0,review,word_count,char_count,avg_word,stopword_coun
5,Taste: 4.5/5Service: 5/5Value: 4/5Happy Hour V...,33,226,5.878788,6
2,Just tried our new neighborhood restaurant. Fr...,43,283,5.604651,11
9,"We ordered the bacon shrimp kabobs, miso glaze...",45,269,5.0,11
3,"Fabulous, friendly and sometimes witty service...",42,264,5.309524,13
1,Very good food and excellent service!! We were...,47,241,4.148936,17
0,My partner and I loved it! Even in the Mission...,73,421,4.780822,27
6,"Can you build a restaurant around salmon? Yes,...",203,1121,4.527094,90
8,We were wandering around SF for the day before...,251,1340,4.342629,95
7,So my partner and I decided to stop in for the...,249,1365,4.510121,104
4,4.5 STARSWhat to do when you go to Salmon Bar?...,729,4198,4.759945,217


In [21]:
df['review_lower'] = df['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))

In [22]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word,stopword_coun,review_lower
0,My partner and I loved it! Even in the Mission...,73,421,4.780822,27,my partner and i loved it! even in the mission...
1,Very good food and excellent service!! We were...,47,241,4.148936,17,very good food and excellent service!! we were...
2,Just tried our new neighborhood restaurant. Fr...,43,283,5.604651,11,just tried our new neighborhood restaurant. fr...
3,"Fabulous, friendly and sometimes witty service...",42,264,5.309524,13,"fabulous, friendly and sometimes witty service..."
4,4.5 STARSWhat to do when you go to Salmon Bar?...,729,4198,4.759945,217,4.5 starswhat to do when you go to salmon bar?...


In [23]:
# Remove Punctuation
df['review_nopunc'] = df['review_lower'].str.replace('[^\w\s]', '')

  df['review_nopunc'] = df['review_lower'].str.replace('[^\w\s]', '')


In [24]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word,stopword_coun,review_lower,review_nopunc
0,My partner and I loved it! Even in the Mission...,73,421,4.780822,27,my partner and i loved it! even in the mission...,my partner and i loved it even in the mission ...
1,Very good food and excellent service!! We were...,47,241,4.148936,17,very good food and excellent service!! we were...,very good food and excellent service we were d...
2,Just tried our new neighborhood restaurant. Fr...,43,283,5.604651,11,just tried our new neighborhood restaurant. fr...,just tried our new neighborhood restaurant fri...
3,"Fabulous, friendly and sometimes witty service...",42,264,5.309524,13,"fabulous, friendly and sometimes witty service...",fabulous friendly and sometimes witty service ...
4,4.5 STARSWhat to do when you go to Salmon Bar?...,729,4198,4.759945,217,4.5 starswhat to do when you go to salmon bar?...,45 starswhat to do when you go to salmon barob...


In [25]:
df['review_nopunc_nostop'] = df['review_nopunc'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))

In [26]:
# Return frequency of values
freq= pd.Series(" ".join(df['review_nopunc_nostop']).split()).value_counts()[:30]

In [27]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word,stopword_coun,review_lower,review_nopunc,review_nopunc_nostop
0,My partner and I loved it! Even in the Mission...,73,421,4.780822,27,my partner and i loved it! even in the mission...,my partner and i loved it even in the mission ...,partner loved even mission huge diversity cuis...
1,Very good food and excellent service!! We were...,47,241,4.148936,17,very good food and excellent service!! we were...,very good food and excellent service we were d...,good food excellent service driving around are...
2,Just tried our new neighborhood restaurant. Fr...,43,283,5.604651,11,just tried our new neighborhood restaurant. fr...,just tried our new neighborhood restaurant fri...,tried new neighborhood restaurant friendly ser...
3,"Fabulous, friendly and sometimes witty service...",42,264,5.309524,13,"fabulous, friendly and sometimes witty service...",fabulous friendly and sometimes witty service ...,fabulous friendly sometimes witty service deli...
4,4.5 STARSWhat to do when you go to Salmon Bar?...,729,4198,4.759945,217,4.5 starswhat to do when you go to salmon bar?...,45 starswhat to do when you go to salmon barob...,45 starswhat go salmon barobvs get salmon wast...


In [28]:
other_stopwords = ['get', 'us', 'see', 'use', 'said', 'asked', 'day', 'go' \
  'even', 'ive', 'right', 'left', 'always', 'would', 'told', \
  'get', 'us', 'would', 'get', 'one', 'ive', 'go', 'even', \
  'also', 'ever', 'x', 'take', 'let' ]

In [29]:
df['review_nopunc_nostop_nocommon'] = df['review_nopunc_nostop'].apply(lambda x: "".join(" ".join(x for x in x.split() if x not in other_stopwords)))

In [30]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word,stopword_coun,review_lower,review_nopunc,review_nopunc_nostop,review_nopunc_nostop_nocommon
0,My partner and I loved it! Even in the Mission...,73,421,4.780822,27,my partner and i loved it! even in the mission...,my partner and i loved it even in the mission ...,partner loved even mission huge diversity cuis...,partner loved mission huge diversity cuisines ...
1,Very good food and excellent service!! We were...,47,241,4.148936,17,very good food and excellent service!! we were...,very good food and excellent service we were d...,good food excellent service driving around are...,good food excellent service driving around are...
2,Just tried our new neighborhood restaurant. Fr...,43,283,5.604651,11,just tried our new neighborhood restaurant. fr...,just tried our new neighborhood restaurant fri...,tried new neighborhood restaurant friendly ser...,tried new neighborhood restaurant friendly ser...
3,"Fabulous, friendly and sometimes witty service...",42,264,5.309524,13,"fabulous, friendly and sometimes witty service...",fabulous friendly and sometimes witty service ...,fabulous friendly sometimes witty service deli...,fabulous friendly sometimes witty service deli...
4,4.5 STARSWhat to do when you go to Salmon Bar?...,729,4198,4.759945,217,4.5 starswhat to do when you go to salmon bar?...,45 starswhat to do when you go to salmon barob...,45 starswhat go salmon barobvs get salmon wast...,45 starswhat salmon barobvs salmon wastedsalmo...


---
---
**Sentiment Analysis**

In [33]:
# Import textblob
from textblob import Word
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')

# Lemmatize final review format
df['cleaned_review']=df['review_nopunc_nostop_nocommon'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [34]:
from textblob import TextBlob
df['polarity'] = df['cleaned_review'].apply(lambda x: TextBlob(x).sentiment[0])

In [35]:
df[['review','polarity']]

Unnamed: 0,review,polarity
0,My partner and I loved it! Even in the Mission...,0.286869
1,Very good food and excellent service!! We were...,0.509524
2,Just tried our new neighborhood restaurant. Fr...,0.622273
3,"Fabulous, friendly and sometimes witty service...",0.434375
4,4.5 STARSWhat to do when you go to Salmon Bar?...,0.152676
5,Taste: 4.5/5Service: 5/5Value: 4/5Happy Hour V...,0.1875
6,"Can you build a restaurant around salmon? Yes,...",0.519048
7,So my partner and I decided to stop in for the...,0.348431
8,We were wandering around SF for the day before...,0.137314
9,"We ordered the bacon shrimp kabobs, miso glaze...",0.271212


In [36]:
# Calculate subjectivity
df['subjectivity'] = df['cleaned_review'].apply(lambda x: TextBlob(x).sentiment[1])


In [37]:
df[['review','subjectivity']]

Unnamed: 0,review,subjectivity
0,My partner and I loved it! Even in the Mission...,0.529192
1,Very good food and excellent service!! We were...,0.695238
2,Just tried our new neighborhood restaurant. Fr...,0.790909
3,"Fabulous, friendly and sometimes witty service...",0.744643
4,4.5 STARSWhat to do when you go to Salmon Bar?...,0.481609
5,Taste: 4.5/5Service: 5/5Value: 4/5Happy Hour V...,0.5
6,"Can you build a restaurant around salmon? Yes,...",0.70506
7,So my partner and I decided to stop in for the...,0.62592
8,We were wandering around SF for the day before...,0.535119
9,"We ordered the bacon shrimp kabobs, miso glaze...",0.424242


---
---
**Keyword Extraction**

In [38]:
!pip install rake_nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rake_nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Installing collected packages: rake-nltk
Successfully installed rake-nltk-1.0.6


In [39]:
from rake_nltk import Rake
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [40]:
r=Rake()

In [41]:
keywords=[]
for review in reviews:
  r.extract_keywords_from_text(review)
  keywords.append(r.get_ranked_phrases()[:3])

In [42]:
keywords

[['one usually gets', 'happy hour !\u200b', 'seafood dishes outside'],
 ['grand opening sign', 'excellent service !!', 'miso soup'],
 ['try another salmon offering soon',
  'nice california sauvignon blanc',
  'perfectly cooked salmon'],
 ['delicious food stylishly presented',
  'sometimes witty service',
  'better sf hh'],
 ['4 ), chicken karaage ($ 4 ), gyoza ($ 4 ),',
  'dozen oysters ($ 18 ), housemade ankimo ($ 8 ), yakitori',
  'glass ($ 5 -$ 7 ). f'],
 ['would come back', 'create unique dishes', 'clam chowder croquettes'],
 ['parking gods deem', 'shish salmon leaf', 'restaurant around salmon'],
 ['crispy crunchy outer layer',
  'second time trying oysters',
  'really tasty complimentary items'],
 ['nicely fried korean style boneless chicken thighs',
  'seafood forward japanese style restaurant included',
  'divine ($ 8 ), shrimp wrapped'],
 ['spicy sauce would give', 'bacon shrimp kabobs', 'bacon shrimp kabobs']]

---
---
**Machine Translation**

In [43]:
!pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/lts/1.8/torch_lts.html
Collecting torch==1.8.1+cu111
  Downloading https://download.pytorch.org/whl/lts/1.8/cu111/torch-1.8.1%2Bcu111-cp38-cp38-linux_x86_64.whl (1982.2 MB)
[K     |█████████████▌                  | 834.1 MB 1.4 MB/s eta 0:13:14tcmalloc: large alloc 1147494400 bytes == 0x6520e000 @  0x7f8e36e95615 0x5d6f4c 0x51edd1 0x51ef5b 0x4f750a 0x4997a2 0x4fd8b5 0x4997c7 0x4fd8b5 0x49abe4 0x4f5fe9 0x55e146 0x4f5fe9 0x55e146 0x4f5fe9 0x55e146 0x5d8868 0x5da092 0x587116 0x5d8d8c 0x55dc1e 0x55cd91 0x5d8941 0x49abe4 0x55cd91 0x5d8941 0x4990ca 0x5d8868 0x4997a2 0x4fd8b5 0x49abe4
[K     |█████████████████               | 1055.7 MB 1.2 MB/s eta 0:12:34tcmalloc: large alloc 1434370048 bytes == 0x2a36000 @  0x7f8e36e95615 0x5d6f4c 0x51edd1 0x51ef5b 0x4f750a 0x4997a2 0x4fd8b5 0x4997c7 0x4fd8b5 0x49abe4 0x4f5fe9 0x55e146 0x4f5fe9 0x55e146 0x4f

In [46]:
!pip install transformers ipywidgets gradio --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 4.4 MB/s 
Collecting ipywidgets
  Downloading ipywidgets-8.0.4-py3-none-any.whl (137 kB)
[K     |████████████████████████████████| 137 kB 67.3 MB/s 
[?25hCollecting gradio
  Downloading gradio-3.15.0-py3-none-any.whl (13.8 MB)
[K     |████████████████████████████████| 13.8 MB 21.1 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 66.7 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 44.6 MB/s 
Collecting widgetsnbextension~=4.0
  Downloading widgetsnbextension-4.0.5-py3-none-any.whl (2.0 MB)
[K     

In [47]:

from transformers import pipeline     # Transformers pipeline

In [48]:
translation_pipeline = pipeline('translation_en_to_de')

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [49]:
results = translation_pipeline(keywords[5])

In [50]:
results

[{'translation_text': 'zurückkehren'},
 {'translation_text': 'einzigartige Gerichte zu kreieren'},
 {'translation_text': 'clam chowder croquettes'}]

In [51]:
keywords[5]

['would come back', 'create unique dishes', 'clam chowder croquettes']