# COMP41680 Assignment 2 - Task 1 (Data Collection)
## Student Name: Sanika Kulkarni     Student ID: 21200060

The objective of this assignment is to scrape a collection of product reviews from a set of web pages, preprocess the data, and evaluate the performance of different classifiers in the context of two related text classification tasks: 
- Predicting review sentiment
- Predicting review helpfulness

### Task 1 Objectives:

**1. Scrape the complete set of web pages from the personal website address:**
http://mlg.ucd.ie/modules/python/assign2/21200060/ 

**2. From the web pages above, parse every review across all years 2016-2021. For each product review, extract the following information:**
- The star rating of the review
- The title text of the review
- The main body text of the review 
- Review helpfulness information

**3. Store the parsed review data in an appropriate format**




## Importing libraries

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time

## Defining the required columns of dataframe

In [2]:
#List to store star ratings of the reviews
star_ratings = []

#list to store title text of the reviews
titles = [] 

#list to store Main body of the review
review_content = []  

#list to store helpfulness of the review
usefulness_rating = []

## Defining functions to scrape the review data
- The `extract_url_years_months()` function is used to generate a list of all urls via the main page and extract the year and month attribute. This is helpful to run the `fetch()` function. 
- The `extract()` function is used to extract the required data from the entire HTML script.
- The `fetch()` function loops through the all the review pages per year, per month and extracts information about the review using `extract()` function

In [4]:
years = []
months = []

def extract_url_years_months():
    
    base_url = 'http://mlg.ucd.ie/modules/python/assign2/21200060/'
    result = requests.get(base_url)
    soup = BeautifulSoup(result.text, 'html.parser') 
    links = soup.find_all("a") 
    all_urls = []
    for link in links[1:]: #since first link is index.html
        link_address = base_url + link.get("href")
        all_urls.append(link_address)

    for url in all_urls:
        year = url.split("-")[1]
        month = url.split("-")[2]
        if year not in years:
            years.append(year)
        if month not in months:
            months.append(month)

In [6]:
extract_url_years_months()

In [9]:
def extract(soup):
    
  #Extracting all div blocks of reviews
  divs = soup.find('div')
  review_items = divs.find_all(class_="review")

  for review in review_items:
    #Extracting ratings
    star_ratings.append(review.span.img.get('alt'))

    #Extracting the review-title
    titles.append(review.h5.get_text())
    
    #Extracting the main-body
    review_content.append(review.find_all('p')[-1].get_text())

    #Extracting the helpfulness info
    usefulness_rating.append(review.find_all('p')[-2].get_text())

In [10]:
def fetch():

  for year in years:
    for month in months:
      url = "http://mlg.ucd.ie/modules/python/assign2/21200060/reviews-%s-%s-01.html" % (year, month)
      print('\r %s' % (url), end='')
      result = requests.get(url)
      soup = BeautifulSoup(result.content, 'html.parser')
        
      #Extracting all information of page 1
      extract(soup)
      num_pages = int(soup.find('h4', class_="results").get_text()[-1]) + 1
    
      for i in range(2, num_pages):
        url = "http://mlg.ucd.ie/modules/python/assign2/21200060/reviews-%s-%s-0%s.html" % (year, month, i)
        print('\r %s' % (url), end='')
        result = requests.get(url)
        soup = BeautifulSoup(result.content, 'html.parser')
        #Extracting all information from page-2 onwards
        extract(soup)

### Fetching the data

In [11]:
fetch()

 http://mlg.ucd.ie/modules/python/assign2/21200060/reviews-2021-dec-05.html

## Creating a Reviews DataFrame
Each column contains the data from the above created lists

In [12]:
reviews = pd.DataFrame()

In [13]:
reviews['Title'] = titles
reviews['Review'] = review_content
reviews['Star_ratings'] = star_ratings
reviews['Helpfulness'] = usefulness_rating

In [14]:
reviews.head()

Unnamed: 0,Title,Review,Star_ratings,Helpfulness
0,The herbs were great...but the cherry tomatoe...,The herb kit that came with my Aerogarden was ...,2-star,15 out of 17 users found this review helpful
1,Even more useful than regular parchment paper,I originally bought this just because it was c...,5-star,19 out of 19 users found this review helpful
2,Shake it before you bake it,"If you do it in reverse (bake before shaking),...",2-star,2 out of 13 users found this review helpful
3,Not what the picture describes,I bought this steak for my father in law for C...,2-star,7 out of 14 users found this review helpful
4,What a ripe off - GIVE ME A BREAK,Sorry but I had these noodles and they are no ...,2-star,10 out of 34 users found this review helpful


In [19]:
print("Total Reviews: ", len(reviews))

Total Reviews:  5546


In [16]:
reviews.tail()

Unnamed: 0,Title,Review,Star_ratings,Helpfulness
5541,Ovaltine has changed their formula,Ovaltine has updated their packaging and chang...,1-star,25 out of 27 users found this review helpful
5542,Perhaps too compostable?,I bought these bags to go with Trading ECO-200...,3-star,20 out of 21 users found this review helpful
5543,"Nutiva Organic Shelled Hempseed, 5-Pound Bag",This item was brought up in a forum with a lin...,5-star,22 out of 26 users found this review helpful
5544,This gum is really great!,If you have problems with Aspartame (which is ...,5-star,17 out of 17 users found this review helpful
5545,Cat Scratch Fever!!,"I opened up the cat scratcher, spread a little...",5-star,27 out of 27 users found this review helpful


In [17]:
reviews["Review"][0]

'The herb kit that came with my Aerogarden was superb and I enjoyed caring for my little garden. Once it was time to replace it, I purchased the cherry tomato seed kit. These also grew rapidly, but one day I noticed that they had completely fallen over...how would I stake these on an Aerogarden? And yes, the lights were as close to the plants as possible. I kind of leaned them against each other to keep them upright. But they still fell over several times. So they grew and grew but I never got any tomatoes. :( I also did follow the directions to ensure that they had complete darkness for proper flowering. Fortunately we have several tomato and cherry tomato plants in our outdoor vegetable garden. Unfortunately, I spent $20 on 3 cherry tomato Aerogarden plants and got zero yield. …'

## Storing the data in .csv format for further analysis

In [18]:
reviews.to_csv('product_reviews.csv',index=False)