## REVIEW RATINGS PREDICTION

#### Problem Statement

We have a client who has a website where people write different reviews for technical products. Now they are adding a new feature to their website i.e. The reviewer will have to add stars(rating) as well with the review. The rating is out 5 stars and it only has 5 options available 1 star, 2 stars, 3 stars, 4 stars, 5 stars. Now they want to predict ratings for the reviews which were written in the past and they don’t have a rating. So, we have to build an application which can predict the rating by seeing the review.

**You have to scrape at least 20000 rows of data. You can scrape more data as well, it’s up to you. More the data better the model

In this section you need to scrape the reviews of different laptops, Phones, Headphones, smart watches, Professional Cameras, Printers, Monitors, Home theater, Router, power bank from different ecommerce websites. Basically, we need these columns**

**1. Reviews of the product.**

**2. Rating of the product.**

You can fetch other data as well, if you think data can be useful or can help in the project. It completely depends on your imagination or assumption

Hints:
    
* Try to fetch data from different websites. If data is from different websites, it will help our model to remove the effect of over fitting.
* Try to fetch an equal number of reviews for each rating, for example if you are fetching 10000 reviews then all ratings 1,2,3,4,5 should be 2000. It will balance our data set.
* Convert all the ratings to their round number, as there are only 5 options for rating i.e., 1,2,3,4,5. If a rating is 4.5 convert it 5.

In [370]:
#importing the libraries

import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

#Exceptions
from selenium.common.exceptions import NoSuchElementException, TimeoutException, WebDriverException, StaleElementReferenceException, ElementClickInterceptedException


import pandas as pd
import re
import time

import warnings
warnings.filterwarnings('ignore')

* We are scraping product reviews from Amazon.in and Flipkart.in. 
* We are scraping reviews for the following products:

`Laptops`    `Phones`    `Headphones`     `Smart watches`     `Professional Cameras`     `Printers`     `Monitors`     
`Home theater`     `Router`     `Power bank`

Details to be scraped:
    
1. `Product`
2. `Review`
3.`Rating`

Scraping the data from Amazon.in

In [265]:
products = ['Laptop', 'smart phone', 'headphone', 'smart watch', 'professional camera', 'printer', 'monitors', 'home theater', 'router', 'power bank']


delay = 10



In [189]:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")

driver = webdriver.Chrome('chromedriver.exe', chrome_options = chrome_options)
driver.maximize_window()
time.sleep(1)

In [30]:
url = []

for pr in products:
    
    driver.get('https://www.amazon.in')
    time.sleep(2)

    search_box = driver.find_element_by_id("twotabsearchtextbox")
    search_box.send_keys(pr)

    search_button = driver.find_element_by_id("nav-search-submit-button")
    search_button.click()
    url_temp = []
    while len(url_temp)<25:
        WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR,'a.s-pagination-item.s-pagination-next.s-pagination-button.s-pagination-separator')))
        url_tags = driver.find_elements_by_xpath("//h2[@class = 'a-size-mini a-spacing-none a-color-base s-line-clamp-2']/a")
        for ur in url_tags:
            url_temp.append(ur.get_attribute("href"))
        
        #Clicking the next page button
        next_page = driver.find_element_by_xpath("//a[@class = 's-pagination-item s-pagination-next s-pagination-button s-pagination-separator']")
        next_page.click()
    url.extend(url_temp[:25])
    time.sleep(1)
print(len(url))

250


In [195]:
driver.close()

In [212]:
rating_url = []
rating_url_temp = []

#User defined function for scraping the rating url from each product page
def rating_page(a,b):
    global rating_url_temp
    rating_url_temp = []
    try:
        driver.close()
    except:
        pass
    
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--incognito")

    driver = webdriver.Chrome('chromedriver.exe', chrome_options = chrome_options)
    driver.maximize_window()
    time.sleep(1)
    
    for i in url[a:b]:
        try:
            driver.get(i)
            time.sleep(1)
        except:
            time.sleep(1)
            driver.get(i)
            time.sleep(1)

        try:
            height = driver.execute_script("return document.body.scrollHeight")
            driver.execute_script("window.scrollTo(0, {})".format(height-2000))

            WebDriverWait(driver,delay).until(EC.presence_of_element_located((By.CLASS_NAME,"cr-widget-Histogram")))
            url_tag5 = driver.find_element_by_xpath('//a[@class="a-link-normal 5star"]')
            url_tag4 = driver.find_element_by_xpath('//a[@class="a-link-normal 4star"]')
            url_tag3 = driver.find_element_by_xpath('//a[@class="a-link-normal 3star"]')
            url_tag2 = driver.find_element_by_xpath('//a[@class="a-link-normal 2star"]')
            url_tag1 = driver.find_element_by_xpath('//a[@class="a-link-normal 1star"]')

            rating_url_temp.append(url_tag5.get_attribute("href"))
            rating_url_temp.append(url_tag4.get_attribute("href"))
            rating_url_temp.append(url_tag3.get_attribute("href"))
            rating_url_temp.append(url_tag2.get_attribute("href"))
            rating_url_temp.append(url_tag1.get_attribute("href"))
        except NoSuchElementException as e:
            pass

    print(len(rating_url_temp))

In [203]:
rating_page(0,125)

505


In [219]:
rating_url.extend(rating_url_temp)
print(len(rating_url))

505


In [222]:
rating_page(125,250)

535


In [223]:
rating_url.extend(rating_url_temp)
print(len(rating_url))

1040


In [231]:
reviews = []
review_title = []
ratings = []

try:
    driver.close()
except:
    pass

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")

driver = webdriver.Chrome('chromedriver.exe', chrome_options = chrome_options)
driver.maximize_window()
time.sleep(1)

for ur in rating_url:
    try:
        driver.get(ur)
        time.sleep(2)
    except:
        time.sleep(1)
        driver.get(ur)
        time.sleep(2)

    reviews_temp = []
    ratings_temp = []
    title_temp = []

    while len(reviews_temp)<15:

        try:
            height = driver.execute_script("return document.body.scrollHeight")
            driver.execute_script("window.scrollTo(0, {})".format(height-2000))

            #Collecting the required data for each product

            #Scraping the reviews
            WebDriverWait(driver,delay).until(EC.presence_of_element_located((By.CSS_SELECTOR,'div.a-text-left.a-fixed-left-grid-col.reviewNumericalSummary.celwidget.a-col-left > div.a-row.a-spacing-mini.customerReviewsTitle > h2')))
            try:    
                try:
                    review_tags = driver.find_elements_by_xpath("//span[@class = 'a-size-base review-text review-text-content']")
                    for i in review_tags:
                        reviews_temp.append(i.text)
                except NoSuchElementException as e:
                    reviews_temp.append("-")
            except StaleElementReferenceException as e:      
                try:
                    review_tags = driver.find_elements_by_css_selector("customer_review-RZLY8VF7RJZ0D > div.a-row.a-spacing-small.review-data > span > span")
                    for i in review_tags:
                        reviews_temp.append(i.text)
                except NoSuchElementException as e:
                    reviews_temp.append("-")



            #Scraping ratings
            WebDriverWait(driver,delay).until(EC.presence_of_element_located((By.CSS_SELECTOR,'div.navFooterVerticalColumn.navAccessibility > div > div:nth-child(3) > div')))

            try:
                try:
                    rating_tags = driver.find_elements_by_xpath("//i[@data-hook = 'review-star-rating']/span")
                    for rat in rating_tags:
                        ratings_temp.append(rat.get_attribute("innerHTML"))
                except NoSuchElementException as e:
                    ratings_temp.append("-")
            except StaleElementReferenceException as e:
                try:
                    rating_tags = driver.find_elements_by_css_selector("customer_review-R12439CVGE5HDA >  > a:nth-child(1) > i > span")
                    for rat in rating_tags:
                        ratings_temp.append(rat.get_attribute("innerHTML"))
                except NoSuchElementException as e:
                    ratings_temp.append("-")

            #Scraping review title
            WebDriverWait(driver,delay).until(EC.presence_of_element_located((By.CSS_SELECTOR,'td.aok-nowrap > span.a-size-base > a')))
            try:
                try:
                    title_tags = driver.find_elements_by_xpath("//a[@class = 'a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']")
                    for title in title_tags:
                        title_temp.append(title.text)
                except NoSuchElementException as e:
                    title_temp.append("-")
            except StaleElementReferenceException as e:

                try:
                    title_tags = driver.find_elements_by_css_selector("div.a-expander-content.a-expander-partial-collapse-content > > span.a-size-base.review-title.a-text-bold")
                    for title in title_tags:
                        title_temp.append(title.text)
                except NoSuchElementException as e:
                    title_temp.append("-")

            try:
                next_page_button = driver.find_element_by_class("//li[@class = 'a-last']")
                next_page_button.click()
            except AttributeError as e:
                break

        except NoSuchElementException as e:
            break

    reviews.extend(reviews_temp[:15])
    ratings.extend(ratings_temp[:15])
    review_title.extend(title_temp[:15])

    time.sleep(1)
            

            
print(len(reviews))
print(len(ratings))
print(len(review_title))

9319
9248
9248


In [234]:
reviews = reviews[:9248]
print(len(reviews))
print(len(ratings))
print(len(review_title))

9248
9248
9248


We have scraped 9248 records from Amazon.in

Scraping data from flipkart.com

In [329]:
driver = webdriver.Chrome('chromedriver.exe', chrome_options = chrome_options)
driver.maximize_window()
time.sleep(1)

In [319]:
flip_urls = []
for pr in products:
    driver.get('https://www.flipkart.com')
    time.sleep(1)
    try:
        login_popup = driver.find_element_by_xpath("//button[@class = '_2KpZ6l _2doB4z']")
        login_popup.click()
    except NoSuchElementException as e:
        pass
    
    search_field = driver.find_element_by_xpath("//input[@class = '_3704LK']")
    search_field.send_keys(pr)
    time.sleep(1)
    search_button = driver.find_element_by_css_selector("form > ul > li:nth-child(1) > div > a > div.lrtEPN._17d0yO")
    search_button.click()
    time.sleep(1)
    
    
    #scraping the urls for the products
    try:
        WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME,"_1fQZEK")))
        flip_url_tags = driver.find_elements_by_xpath("//a[@class = '_1fQZEK']")
        for ur in flip_url_tags:
            flip_urls.append(ur.get_attribute("href"))
    except (NoSuchElementException, TimeoutException) as e: 
        WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME,"_2rpwqI")))
        flip_url_tags = driver.find_elements_by_xpath("//a[@class = '_2rpwqI']")
        for ur in flip_url_tags:
            flip_urls.append(ur.get_attribute("href"))
        
    
    #clicking the next button
    height = driver.execute_script("return document.body.scrollHeight")
    driver.execute_script("window.scrollTo(0, {})".format(height-2500))

    try:
        next_flip_page = driver.find_element_by_css_selector("div._36fx1h._6t1WkM._3HqJxg > div._1YokD2._2GoDe3 > div:nth-child(2) > div:nth-child(26) > div > div > nav > a._1LKTO3")
        next_flip_page.click()
        time.sleep(1)
    except (NoSuchElementException, TimeoutException) as e:
        next_flip_page = driver.find_element_by_css_selector("div._36fx1h._6t1WkM._3HqJxg > div._1YokD2._2GoDe3 > div:nth-child(2) > div:nth-child(12) > div > div > nav > a._1LKTO3 > span")
        next_flip_page.click()
        time.sleep(1)
    
    try:
        WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME,"_1fQZEK")))
        flip_url_tags = driver.find_elements_by_xpath("//a[@class = '_1fQZEK']")
        for ur in flip_url_tags:
            flip_urls.append(ur.get_attribute("href"))
    except (NoSuchElementException, TimeoutException) as e: 
        WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME,"_2rpwqI")))
        flip_url_tags = driver.find_elements_by_xpath("//a[@class = '_2rpwqI']")
        for ur in flip_url_tags:
            flip_urls.append(ur.get_attribute("href"))

print(len(flip_urls))

640


In [410]:
#Scraping the required data from the urls
flip_ratings = []
flip_reviews = []
flip_titles = []

flip_ratings_temp = []
flip_reviews_temp = []
flip_titles_temp = []

#User defined function for scraping the records from flipkart.com

def flip_scrape(a,b):
    
    driver = webdriver.Chrome('chromedriver.exe', chrome_options = chrome_options)
    driver.maximize_window()
    time.sleep(1)
    
    global flip_ratings_temp, flip_reviews_temp, flip_titles_temp
    flip_ratings_temp = []
    flip_reviews_temp = []
    flip_titles_temp = []

    for page in flip_urls[a:b]:
        

        try:
            driver.get(page)
            time.sleep(1)
        except WebDriverException as e:
            time.sleep(1)
            driver.get(page)
            time.sleep(1)

        try:
            flip_all_review = driver.find_element_by_xpath("//div[@class = '_3UAT2v _16PBlm']")
            flip_all_review.click()

                #Selecting the filter -'Most Recent'
            try:
                WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR,"div._1YokD2._3Mn1Gg.col-12-12 > div > div:nth-child(2) > div > div > div > div > select")))
                flip_filter = driver.find_element_by_xpath("//select[@class = '_1EDlbo tVKh2S']")
                flip_filter.click()
                time.sleep(1)
            except (NoSuchElementException, TimeoutException) as e:
                flip_filter = driver.find_element_by_xpath("//div[@id='container']/div/div[3]/div/div/div[2]/div[1]/div/div[2]/div/div/div/div/select")
                flip_filter.click()
                time.sleep(1)
            try:
                WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR,"div._1YokD2._2GoDe3.col-12-12 > div:nth-child(2) > div > div > div > div > select > option:nth-child(2)")))
                most_recent_button = driver.find_element_by_xpath("//div[@id='container']/div/div[3]/div/div/div[2]/div[1]/div/div[2]/div/div/div/div/select/option[2]")
                most_recent_button.click()
            except (NoSuchElementException,StaleElementReferenceException, TimeoutException) as e:
                most_recent_button = driver.find_element_by_xpath("//option[@value = 'MOST_RECENT]")
                most_recent_button.click()

            #Data collection
            #Scraping the ratings
            try:
                try:
                    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME,"_3-dnWo")))
                    flip_ratings_tags = driver.find_elements_by_xpath("//div[@class = 'col _2wzgFH K0kLPL']/div[1]/div[1]")
                    for rat in flip_ratings_tags:
                        flip_ratings_temp.append(rat.text)
                except (StaleElementReferenceException, TimeoutException) as e:
                    most_recent_button.click()
                    flip_ratings_tags = driver.find_elements_by_xpath("//div[@class = 'col _2wzgFH K0kLPL']/div[1]/div[1]")
                    for rat in flip_ratings_tags:
                        flip_ratings_temp.append(rat.text)
            except NoSuchElementException as e:
                flip_ratings_temp.append("-")



            #Scraping the reviews
            try:
                try:
                    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME,"_3-dnWo")))
                    flip_reviews_tags = driver.find_elements_by_xpath("//div[@class = 'col _2wzgFH K0kLPL']/div[2]/div[1]")
                    for rev in flip_reviews_tags:
                        flip_reviews_temp.append(rev.text)
                except (StaleElementReferenceException, TimeoutException) as e:
                    most_recent_button.click()
                    flip_reviews_tags = driver.find_elements_by_xpath("//div[@class = 'col _2wzgFH K0kLPL']/div[2]/div[1]")
                    for rev in flip_reviews_tags:
                        flip_reviews_temp.append(rev.text)
            except NoSuchElementException as e:
                flip_reviews_temp.append("-")

            #Scraping the review titles
            try:
                try:
                    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME,"c_3-dnWo")))
                    flip_title_tags = driver.find_elements_by_xpath("//p[@class = '_2-N8zT']")
                    for title in flip_title_tags:
                        flip_titles_temp.append(title.text)
                except (StaleElementReferenceException, TimeoutException) as e:
                    most_recent_button.click()
                    flip_title_tags = driver.find_elements_by_xpath("//p[@class = '_2-N8zT']")
                    for title in flip_title_tags:
                        flip_titles_temp.append(title.text)
            except NoSuchElementException as e:
                flip_titles_temp.append("-")


            #clicking the next page

            try:
                try:
                    next_review_page = driver.find_element_by_xpath("//a[@class = '_1LKTO3']")
                    next_review_page.click()
                except ElementClickInterceptedException as e:
                    height = driver.execute_script("return document.body.scrollHeight")
                    driver.execute_script("window.scrollTo({}, 0)".format(height-2000))
                    next_review_page = driver.find_element_by_css_selector("div._1YokD2._3Mn1Gg.col-9-12 > div:nth-child(13) > div > div > nav > a._1LKTO3 > span")
                    next_review_page.click()

            except NoSuchElementException as e:
                pass
            
        except NoSuchElementException as e:
            pass




    print(len(flip_ratings_temp))
    print(len(flip_reviews_temp))
    print(len(flip_titles_temp))
    driver.close()

In [411]:
#User defined function

def flip_copy():
    for rating,review,title in zip(flip_ratings_temp, flip_reviews_temp, flip_titles_temp):
        flip_ratings.append(rating)
        flip_reviews.append(review)
        flip_titles.append(title)

    print(len(flip_ratings))
    print(len(flip_reviews))
    print(len(flip_titles))

In [412]:
flip_scrape(0,200)

2274
2018
1801


In [413]:
flip_copy()

1801
1801
1801


In [415]:
flip_scrape(200,400)

1726
1352
1261


In [416]:
flip_copy()

3062
3062
3062


In [419]:
flip_scrape(400,640)

2064
2317
1861


In [428]:
flip_copy()

12367
12367
12367


In [429]:
amazon = pd.DataFrame({})
amazon['Review Title'] = review_title
amazon['Reviews'] = reviews
amazon['Ratings'] = ratings

flipkart = pd.DataFrame({})
flipkart['Review Title'] = flip_titles
flipkart['Reviews'] = flip_reviews
flipkart['Ratings'] = flip_ratings

ratings_data = pd.concat([amazon,flipkart])
ratings_data

Unnamed: 0,Review Title,Reviews,Ratings
0,Satisfied with the product,,5.0 out of 5 stars
1,Nice..,Most of them reviewed negatively about the sel...,5.0 out of 5 stars
2,Nice Product Quality,Awsome Product always recommended,5.0 out of 5 stars
3,Very good laptop in its segment. Works well wi...,Battery is good.. With 75% brightness and cont...,5.0 out of 5 stars
4,Good product,Like,5.0 out of 5 stars
...,...,...,...
12362,Delightful,Nice power bank,3
12363,Moderate,Go for it,5
12364,Terrible product,Please don’t buy it . It doesn’t work after so...,4
12365,Could be way better,Very heating,5


In [430]:
ratings_data.to_csv('ratings.csv', index = False)

We have saved the dataset with 21615 records