<a href="https://colab.research.google.com/github/saffarizadeh/INSY4054/blob/main/Automation_Project_Solution_Student_Version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="http://saffarizadeh.com/Logo.png" width="300px"/>

# *INSY 4054: Emerging Technologies*

# **Automation Project**

Instructor: Dr. Kambiz Saffarizadeh

---

## Please read carefully

In this project, we want to learn how to automate the process of analyzing the user reviews on a specific webpage. If we were in charge of continuously monitoring some specific products of our competitors, this automation can save us a lot of time.

The target webpage is https://saffarizadeh.com/ET/reviews.html.

Please open and view the webpage.

In next steps, after importing all needed libraries, we first download the webpage. Then using BeautifulSoup, we extract titles, reviews, and ratings from the webpage. We then create a table to keep these data. Next, we pass the reviews to a sentiment analysis model and store the sentiments in a new column in the table. Finally, we create a few reports based on the sentiment analysis and store all tables in an Excel file.

## Insert all needed libraries here

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

## Crawl the website

Use the `get` method to download the following webpage:

`https://saffarizadeh.com/ET/reviews.html`

In [None]:
response = requests.get("https://saffarizadeh.com/ET/reviews.html")

## Make a soup

Use Beautiful Soup to create/make an HTML soup!

In [None]:
soup = BeautifulSoup(response.content, 'lxml')

## Using the soup, extract titles, reviews, and ratings

Note: do this in 3 separate steps. These steps are extremely similar. So after writing the first one, the next ones should be very easy.

Note: use `attrs` to find all relevant elements for each step.

Note: to figure out which attribute(s) and attribute values you need to use, open https://saffarizadeh.com/ET/reviews.html in Chrome or FireFox, right click on the element you want to extract, and select `Inspect` or `Inspect Element`. This way you can see the HTML code for this specific element. Using slides #13, #14, and #15 of Automating Business Tasks II", you should be able to identify the attribute name and attribute values needed.

Titles: Store all review titles in a list named `list1`

In [None]:
list1 = []

for title in soup.find_all(attrs={"class": "reviewTitle"}):
  list1.append(title.text)

Reviews: Store all review body texts in a list named `list2`

In [None]:
list2 = []

for review in soup.find_all(attrs={"class": "reviewBody"}):
  list2.append(review.text)

Ratings: Store all review ratings in a list named `list3`

Note: For this step you have two ways to extract the ratings. Both ways are fine but using the `rating` attribute is easier.

Note: the extracted ratings will be in `str` (text) format; convert them to `int` or `float` before storing them in the list

Note: if you could not convert the ratings into numeric values, you can continue with string values and come back at the end of the project to see whether you can fix the problem. You can do the next part of this project without this type conversion.

In [None]:
list3 = []

for rating in soup.find_all(attrs={"class": "rating"}):
  list3.append(int(rating.get("rating")))

# alternatively
# list3 = []
# for rating in soup.find_all(attrs={"class": "rating"}):
#   list3.append(int(rating.text[:1]))

## Create a `pandas` data frame and store the three lists that you created for titles, body texts, and ratings

Hint: you can first create a dictionary with `title`, `body`, and `rating` as keys and list1, list2, and list 3 and values. Then you can create a data frame from this dictionary.

In [None]:
df = pd.DataFrame({"title": list1, "body": list2, "rating": list3})

Show the data frame:

In [None]:
df

Unnamed: 0,title,body,rating
0,Yeaaa USB C!!!!! But the dongle? Why? Why not ...,WHY Logitech! Why? Lets finally go FULL USB C ...,3
1,Logitech - how many tries do you need to get i...,The scroll wheel is awesome. The fit and finis...,3
2,"Worthy Upgrade, Too Bad It’s Not In White",Purchased this product after accidentally purc...,4
3,Terrible scroll wheel issues,I have the former MX Master 2S and upgraded to...,1
4,Logitech made a great mouse even better,"The MX Master 2s was a fantastic mouse, but I ...",5
5,Unconfortable downgrade from the Performance MX 1,Cons in comparison to the original Performance...,2
6,Not for gamers,Many Youtubers recommend this mouse as their a...,3
7,More compatible with Mac than I expected,If you work with a Mac and are wondering if th...,5
8,Improvement over the last gen. Worth the upgrade,I have two of the previous generation and this...,5
9,The best scroll wheel ever.,The best Mx Master yet. And this time with rea...,5


In [None]:
#@title Run this cell to train a sentiment analysis model. This model directly comes from Activity 6 in "Deep Learning IV" slides. Running this cell takes 1-2 minutes.
%%capture
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

train_data, validation_data, test_data = tfds.load(name="imdb_reviews", split=('train[:60%]', 'train[60%:]', 'test'), as_supervised=True)
model = tf.keras.models.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim50/2", input_shape=[], dtype=tf.string, trainable=True),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(train_data.shuffle(1).batch(512), epochs=10, verbose=1)

The sentiment analysis model we use is trained on movie reviews. So, it might not be the best fit for the specific context of our automation (product reviews). But first let's use the model and then judge the results.

After running the previous cell, pass the column containing the reviews to the model. To do so, run the following code after replacing `column_placeholder` with the actual column from the data frame:

`sentiment = model(column_placeholder).numpy()`

In [None]:
sentiment = model(df["body"]).numpy()

Store `sentiment` as a new column in the data frame.

In [None]:
df["sentiment"] = sentiment

Show the data frame:

In [None]:
df

Unnamed: 0,title,body,rating,sentiment
0,Yeaaa USB C!!!!! But the dongle? Why? Why not ...,WHY Logitech! Why? Lets finally go FULL USB C ...,3,0.052188
1,Logitech - how many tries do you need to get i...,The scroll wheel is awesome. The fit and finis...,3,0.601579
2,"Worthy Upgrade, Too Bad It’s Not In White",Purchased this product after accidentally purc...,4,0.097332
3,Terrible scroll wheel issues,I have the former MX Master 2S and upgraded to...,1,0.444877
4,Logitech made a great mouse even better,"The MX Master 2s was a fantastic mouse, but I ...",5,0.998531
5,Unconfortable downgrade from the Performance MX 1,Cons in comparison to the original Performance...,2,0.310158
6,Not for gamers,Many Youtubers recommend this mouse as their a...,3,0.426594
7,More compatible with Mac than I expected,If you work with a Mac and are wondering if th...,5,0.996798
8,Improvement over the last gen. Worth the upgrade,I have two of the previous generation and this...,5,0.577835
9,The best scroll wheel ever.,The best Mx Master yet. And this time with rea...,5,0.972707


Select the rows with sentiment values above average.

In [None]:
above_average = df[df["sentiment"] > df["sentiment"].mean()]

Select the rows with sentiment values below average.

In [None]:
below_average = df[df["sentiment"] < df["sentiment"].mean()]

Create an Excel file with three sheets showing: 1) the main data frame, 2) the rows with sentiment values above average, and 3) the rows with sentiment values below average.

In [None]:
with pd.ExcelWriter('output_all_sheets.xlsx') as writer:
  df.to_excel(writer, sheet_name='main', index=True)
  above_average.to_excel(writer, sheet_name='above average', index=True)
  below_average.to_excel(writer, sheet_name='below average', index=True)

Are sentiment values in line with the ratings? Why?


Answer: _____________________

# Download the .ipynb version of your notebook and submit it on D2L.

# Just Code:

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

response = requests.get("https://saffarizadeh.com/ET/reviews.html")
soup = BeautifulSoup(response.content, 'lxml')

list1 = []
for title in soup.find_all(attrs={"class": "reviewTitle"}):
  list1.append(title.text)

list2 = []
for review in soup.find_all(attrs={"class": "reviewBody"}):
  list2.append(review.text)

list3 = []
for rating in soup.find_all(attrs={"class": "rating"}):
  list3.append(int(rating.get("rating")))

df = pd.DataFrame({"title": list1, "body": list2, "rating": list3})

# Run the model cell

sentiment = model(df["body"]).numpy()
df["sentiment"] = sentiment

above_average = df[df["sentiment"] > df["sentiment"].mean()]

below_average = df[df["sentiment"] < df["sentiment"].mean()]

with pd.ExcelWriter('output_all_sheets.xlsx') as writer:
  df.to_excel(writer, sheet_name='main', index=True)
  above_average.to_excel(writer, sheet_name='above average', index=True)
  below_average.to_excel(writer, sheet_name='below average', index=True)