# Summarizing Glassdoor Reviews

> *Advanced Customer Analytics*  
> *MSc in Data Science, Department of Informatics*  
> *Athens University of Economics and Business*

---

<p style='text-align: justify;'>Create a second Python notebook with a function called <code>summarize()</code>. The function should accept as a parameter the path to the csv file created by the first notebook. It should then create a 1-page PDF file that includes a summary of all the reviews in the csv. The nature of the summary is entirely up to you. It can be text-based, visual-based, or a combination of both. It is also up to you to define what is important enough to be included in the summary. Focuss on creating a summary that you think would be the most informative for customers. The creation of the PDF should be done through the notebook. You can use whatever Python-based library that you want.</p>

---

##### *Libraries*

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from fpdf import FPDF
from functions.graphs import *
from functions.preprocessing import *
from functions.frequencies import *
from functions.vectorization import *
from functions.summarization import *

##### *The filepath where the previously scraped data is located*

In [2]:
filepath = './data/glassdoor_reviews.csv'

##### *Define a function to create a PDF with a summary of the Glassdoor reviews*

In [3]:
def summary(filepath:str):
    """
    Parameters
    ----------
    filepath: str
        The local path where the previously scraped data is located
    
    Returns
    -------
    None.
    """
    
    # read the data that was scraped from Glassdoor
    df = pd.read_csv(filepath, converters={'rating':int, 'date':pd.to_datetime})
    
    # get number of reviews and avg rating values
    # to pass them to the PDF summary creation
    no_reviews = str(len(df))
    avg_rating = str(round(df.rating.mean(), 2))
        
    # save an image showing the number of reviews per rating
    plot_number_of_reviews_per_rating(df)
    
    # save an image showing the number of reviews and the average rating per year
    plot_number_of_reviews_and_avg_rating_per_year(df)
    
    # preprocess the reviews,
    # convert them to clean sentences,
    # and append them to a list
    reviews = df.text.apply(text_preprocessing)
    
    # initialize empty lists
    # to store pros and cons of each review
    pros = []
    cons = []
    
    # loop through reviews
    for review in reviews:
        # split on the separator defined during scrape
        pros.append(review.split(' *separator* ')[0].strip())
        cons.append(review.split(' *separator* ')[1].strip())
    
    # convert lists to single text
    pros_sent = ' '.join(pros) # str of pros
    cons_sent = ' '.join(cons) # str of cons

    # https://spacy.io/usage/models
    # trained pipeline for the english language
    nlp = spacy.load('en_core_web_lg')
    
    # list of english stopwords
    stop_words = set(stopwords.words('english'))
    
    # count how many times each word appears,
    # and also returns the result in a dataframe
    _, _, df_pros = count_word_frequencies(pros, stop_words, nlp)
    _, _, df_cons = count_word_frequencies(cons, stop_words, nlp)
    
    # get top k words appearing ONLY in pros or cons
    df_distinct_pros, df_distinct_cons = get_top_k_distinct_pros_and_cons_words(df_pros, df_cons)
    
    # get top k words appearing (both in pros or cons, but) MOST FREQUENTLY in pros or cons
    df_top_k_pros, df_top_k_cons, _ = get_top_k_mixed_pros_and_cons_words(df_pros, df_cons)
    
    # save an image showing...
    # top k words appearing ONLY in pros
    plot_top_k_words(df_distinct_pros)
    # top k words appearing ONLY in cons
    plot_top_k_words(df_distinct_cons, is_pros=False)
    # top k words appearing (both in pros or cons, but) MOST FREQUENTLY in pros
    plot_top_k_words(df_top_k_pros, is_distinct=False)
    # top k words appearing (both in pros or cons, but) MOST FREQUENTLY in cons
    plot_top_k_words(df_top_k_cons, is_pros=False, is_distinct=False)
    
    # check what people say in pros
    # and get the most representative sentences about what they most frequently mention
    freq_pros_sentences = get_most_similar_and_frequently_shown_sentences(pros_sent,
                                                                          ngram_range=(1,3),
                                                                          stop_words=stop_words,
                                                                          threshold=0.1)
    
    # check what people say in cons
    # and get the most representative sentences about what they most frequently mention
    freq_cons_sentences = get_most_similar_and_frequently_shown_sentences(cons_sent,
                                                                          ngram_range=(1,3),
                                                                          stop_words=stop_words,
                                                                          threshold=0.1)
    
    # openai - TL;DR summarization
    # https://beta.openai.com/examples/default-tldr-summary
    summary_pros = openai_tldr_summarization(' '.join(freq_pros_sentences))
    summary_cons = openai_tldr_summarization(' '.join(freq_cons_sentences))
    
    # openai - Grammar correction
    # https://beta.openai.com/examples/default-grammar
    summary_pros = openai_grammar_correction(summary_pros)
    summary_cons = openai_grammar_correction(summary_cons)
    
    # save pros summary to .txt
    with open('./summary/summary_pros.txt', 'w') as text_file:
        text_file.write(summary_pros)
    
    # save cons summary to .txt
    with open('./summary/summary_cons.txt', 'w') as text_file:
        text_file.write(summary_cons)
        
    # create a summarization report
    create_pdf_report(no_reviews, avg_rating, summary_pros, summary_cons)

In [4]:
summary(filepath)

---

*Thank you!*

---