# Text Mining and ML model Analysis for Yelp reviews 

The objective of this assignment is to scrape a collection of product reviews from a set of web pages, preprocess the data, and evaluate the performance of different classifiers in the context of two related text classification tasks: (i) predicting review sentiment; (ii) predicting review helpfulness.

### This notebook covers Task 1 - Data Collection.

In [2]:
#libraries used
import requests, urllib
from bs4 import BeautifulSoup
from pathlib import Path
import pandas as pd

Create directory for raw data storage, if it does not already exist:

In [3]:
#creating new directory
dir_raw = Path("raw")
dir_raw.mkdir(parents=True, exist_ok=True)

Defining years and months we are interested to look into

In [4]:
# The years and months from the review list
years = ["2016","2017", "2018", "2019", "2020", "2021"]
months = ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]

We have scrape data from every available page, hence defining a function to find the page number

In [5]:
#defination to define page
def find_page(page,prefix,href):
    suffix = "0"+str(page) + ".html"
    url = prefix + href +suffix
    response = urllib.request.urlopen(url)
    soup = BeautifulSoup(response, 'html.parser')
    return soup

We are interested only in few of the HTML tags, hence defining a function to filter the correct tags with desirable values<br>
**Required Fields:**<br>
- Main Title<br>
- Reiview Title<br>
- Review Body<br>
- Star Rating<br>
- Helpfulness<br>
<br>
This functions returns a data frame for a particular page.<br>
The helpfulness information has been considered as a percentage as it would be more easy to handle it further for classification

In [6]:
#defination to filter tags
def filter_tags(soup):
    review=[]
    Titles = []
    Review_titles=[]
    Ratings = []
    Helpfulness_infos= []
    Review_Bodys =[]
    container = soup.find_all('div', class_ = ['review', 'review-alt'])
    title= soup.title.text
    for con in container:
            #assigns main-title to each review
            Titles.append(title)
            stars = con.find_all("img")
            #assign ratings to each review
            for star in stars:
                Ratings.append(int(star["alt"].split("-")[0]))
            #assign body to each review
            review_body = con.find_all("p", {"class":"review-body"})
            for result in review_body:
                Review_Bodys.append(result.text.strip())
            #assign helpfulness information to each review
            helpfulness = con.find_all("p", {"class":"metadata"})[1]
            for result in helpfulness:
                helpful = result.split(" ")
                num=int(helpful[0])
                deno=int(helpful[3])
                percent=num*100/deno
                Helpfulness_infos.append(percent)
            #assign review title to each review
            review_title=con.find_all("h5")
            for result in review_title:
                Review_titles.append(result.text.strip())
    #add all these columns to data frame
    review= pd.DataFrame({'Main Title': Titles,
    'Review Title': Review_titles,
    'Rating': Ratings,
    'Helpfulness Information': Helpfulness_infos,
    'Review Body': Review_Bodys})
    return review 

Defining a function to collect the filter data for each month based on different pages.

In [7]:
#definition to collect data
def collect_data(month,year):
    review_month = []
    href = "-" + year + "-" + month + "-"
    suffix = "01" + ".html"
    prefix = "http://mlg.ucd.ie/modules/python/assign2/21201977/reviews"
    #construct main url
    url = prefix + href +suffix
    response = urllib.request.urlopen(url)
    soup = BeautifulSoup(response, 'html.parser')
    pages= int(soup.find("h4").text.strip().split(" ")[6])
    for page in range(1, pages+1) :
        # extract the soup as per each page
        soup =find_page(page,prefix,href)
        # filter the soup to form review data frame
        review_page = filter_tags(soup)
        review_month.append(review_page)
    reviews_per_month = pd.concat(review_month)
    return reviews_per_month

Concatenate dataframe for each year

In [8]:
review_year =[]
review=[]
for year in years:
    for month in months:
        print("Filtering Data for: ",year,month)
        review = collect_data(month, year)
        review_year.append(review)
    # concatenate each month's review for every year
    reviews_per_year=pd.concat(review_year)

Filtering Data for:  2016 jan
Filtering Data for:  2016 feb
Filtering Data for:  2016 mar
Filtering Data for:  2016 apr
Filtering Data for:  2016 may
Filtering Data for:  2016 jun
Filtering Data for:  2016 jul
Filtering Data for:  2016 aug
Filtering Data for:  2016 sep
Filtering Data for:  2016 oct
Filtering Data for:  2016 nov
Filtering Data for:  2016 dec
Filtering Data for:  2017 jan
Filtering Data for:  2017 feb
Filtering Data for:  2017 mar
Filtering Data for:  2017 apr
Filtering Data for:  2017 may
Filtering Data for:  2017 jun
Filtering Data for:  2017 jul
Filtering Data for:  2017 aug
Filtering Data for:  2017 sep
Filtering Data for:  2017 oct
Filtering Data for:  2017 nov
Filtering Data for:  2017 dec
Filtering Data for:  2018 jan
Filtering Data for:  2018 feb
Filtering Data for:  2018 mar
Filtering Data for:  2018 apr
Filtering Data for:  2018 may
Filtering Data for:  2018 jun
Filtering Data for:  2018 jul
Filtering Data for:  2018 aug
Filtering Data for:  2018 sep
Filtering 

Saving this filtered data as .csv file for further use

In [9]:
#save the data in .csv format
filePath = dir_raw/"reviwes.csv"
reviews_per_year.to_csv(filePath, index=False)