This docuemnt and scripte is created by Harun-Ur-Rashid on Kaggle 6 years ago (around 2018).
Link to original document: https://www.kaggle.com/code/harunshimanto/summarization-with-wine-reviews-using-spacy


# &#127916; Introdruction Wine Reviews
![Imgur](https://i.imgur.com/0GFdU23.png)
> In this notebook, I will try to explore the Wine Reviews Dataset. It contains 130k  of reviews  in Wine Reviews. And at the end of this notebook, I will try to make simple text summarizer that will summarize given reviews. The summarized reviews can be used as a reviews title also.I will use Spacy as natural language processing library for handling this project.

## &#128203; Object Of This Project 
The objective of this project is to build a model that can create relevant summaries for reviews written on Wine reviews. This dataset contains above 130k  reviews, and is hosted on [Kaggle](https://www.kaggle.com/zynicide/wine-reviews).

## What Is Text Summarization?
![Imgur](https://i.imgur.com/LLfNlBS.png)
> Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks).

## Types of Text Summarization Methods
Text summarization methods can be classified into different types.
![Imgur](https://i.imgur.com/J5KyMBJ.png)
**i. Based on input type:**

1. Single Document, where the input length is short. Many of the early summarization systems dealt with single document summarization.

2. Multi Document, where the input can be arbitrarily long.

**ii. Based on the purpose:**

1. Generic, where the model makes no assumptions about the domain or content of the text to be summarized and treats all inputs as homogeneous. The majority of the work that has been done revolves around generic summarization.

2. Domain-specific, where the model uses domain-specific knowledge to form a more accurate summary. For example, summarizing research papers of a specific domain, biomedical documents, etc.

3. Query-based, where the summary only contains information which answers natural language questions about the input text.

**iii. Based on output type:**

1. Extractive, where important sentences are selected from the input text to form a summary. Most summarization approaches today are extractive in nature.

2. Abstractive, where the model forms its own phrases and sentences to offer a more coherent summary, like what a human would generate. This approach is definitely a more appealing, but much more difficult than extractive summarization.

# 1. Import Packages 


In [3]:
import numpy as np # linear algebra
import spacy
nlp = spacy.load('en_core_web_sm')
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from IPython.display import display
import base64
import string
import re
from collections import Counter
from time import time
# from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwords
from nltk.corpus import stopwords
import nltk
import heapq
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
%matplotlib inline

stopwords = stopwords.words('english')
sns.set_context('notebook')

# 2. Import Dataset 
> In this section, I will load the desired dataset for this notebook. This dataset has huge number of reviews. It will be hard to work with full dataset. So I will randomly sample the dataset into smaller chunks for easy purpose.

In [7]:
reviews = pd.read_csv("winemag data/winemag-data-130k-v2.csv", nrows=5000,usecols =['points', 'title', 'description'],encoding='latin1')
reviews = reviews.dropna()
reviews.head(15)

Unnamed: 0,description,points,title
0,"Aromas include tropical fruit, broom, brimston...",87,Nicosia 2013 VulkÃ Bianco (Etna)
1,"This is ripe and fruity, a wine that is smooth...",87,Quinta dos Avidagos 2011 Avidagos Red (Douro)
2,"Tart and snappy, the flavors of lime flesh and...",87,Rainstorm 2013 Pinot Gris (Willamette Valley)
3,"Pineapple rind, lemon pith and orange blossom ...",87,St. Julian 2013 Reserve Late Harvest Riesling ...
4,"Much like the regular bottling from 2012, this...",87,Sweet Cheeks 2012 Vintner's Reserve Wild Child...
5,Blackberry and raspberry aromas show a typical...,87,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...
6,"Here's a bright, informal red that opens with ...",87,Terre di Giurfo 2013 Belsito Frappato (Vittoria)
7,This dry and restrained wine offers spice in p...,87,Trimbach 2012 Gewurztraminer (Alsace)
8,Savory dried thyme notes accent sunnier flavor...,87,Heinz Eifel 2013 Shine GewÃ¼rztraminer (Rheinh...
9,This has great depth of flavor with its fresh ...,87,Jean-Baptiste Adam 2012 Les Natures Pinot Gris...


# 3. Text preprocessing
> In this step, I will be using Spacy for preprocessing text, in others words I will clearing not useful features from reviews title like punctuation, stopwords. For this task, there are two useful libraries available in Python. 1. NLTK 2. Spacy. In this notebook, I will be working with Spacy because it is very fast and has many useful features compared to NLTK. So without further do let's get started!

In [9]:
!python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')
def normalize_text(text):
    tm1 = re.sub('<pre>.*?</pre>', '', text, flags=re.DOTALL)
    tm2 = re.sub('<code>.*?</code>', '', tm1, flags=re.DOTALL)
    tm3 = re.sub('<[^>]+>©', '', tm1, flags=re.DOTALL)
    return tm3.replace("\n", "")

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     --------------------------------------- 2.9/587.7 MB 15.2 MB/s eta 0:00:39
     --------------------------------------- 5.2/587.7 MB 13.3 MB/s eta 0:00:44
      ------------------------------------- 10.5/587.7 MB 16.8 MB/s eta 0:00:35
     - ------------------------------------ 18.1/587.7 MB 21.5 MB/s eta 0:00:27
     - ------------------------------------ 23.9/587.7 MB 22.9 MB/s eta 0:00:25
     - ------------------------------------ 27.8/587.7 MB 22.3 MB/s eta 0:00:26
     -- ----------------------------------- 33.3/587.7 MB 22.7 MB/s eta 0:00:25
     -- ----------------------------------- 39.1/587.7 MB 23.2 MB/s eta 0:00:24
     --- ---------------------------------- 46.4/587.7 MB 24.4 MB/s eta 0:00:23
     --- ---------------------

In [10]:
# in this step we are going to remove code syntax from text 
reviews['description_Cleaned_1'] = reviews['description'].apply(normalize_text)

In [11]:
print('Before normalizing text-----\n')
print(reviews['description'][2])
print('\nAfter normalizing text-----\n')
print(reviews['description_Cleaned_1'][2])

Before normalizing text-----

Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.

After normalizing text-----

Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.


We can see a huge difference after normalizing our text. Now we can see our text is more manageable. This will help us to explore the reviews and later making summarizer.

We are also seeing that there are some punctuation and stopwords. We also don't need them. In the first place, I don't remove them because we are gonna need this in future when we will make summarizer. So let's make another column that will store our normalized text without punctuation and stopwords.

## 3.1 Clean text before feeding it to spaCy

In [None]:
punctuations = '!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~©'
# Define function to cleanup text by removing personal pronouns, stopwords, and puncuation
def cleanup_text(docs, logging=False):
    texts = []
    doc = nlp(docs, disable=['parser', 'ner'])
    tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
    tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
    tokens = ' '.join(tokens)
    texts.append(tokens)
    return pd.Series(texts)
reviews['Description_Cleaned'] = reviews['description_Cleaned_1'].apply(lambda x: cleanup_text(x, False))

In [None]:
print('Reviews description with punctuatin and stopwords---\n')
print(reviews['description_Cleaned_1'][0])
print('\nReviews description after removing punctuation and stopwrods---\n')
print(reviews['Description_Cleaned'][0])

Wow! See! Now our text looks much readable and less messy!

# 4. Distribution of Points
In this section, I will try understand the distribution of points. Here points mean number of upvote the 	description got in social media(such as facebook,twitter etc).

In [None]:
plt.subplot(1, 2, 1)
(reviews['points']).plot.hist(bins=30, figsize=(30,5), edgecolor='white',range=[0,150])
plt.xlabel('Number of points', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Number of points description', fontsize=17)
plt.show()

The description of points lies between 80 to 100 mostly. Majority of the description got points between 80 to 100.

# 5. Analyze reviews description
In this section, I will try to analyze wine description. In Wine Reviews, the wine description plays a vital role. A good description can make your wine  stand out. It also helps get a reviews faster. Lastly, It will help you get some points. Let's see what we can find in the  wine description.

In [None]:
reviews['Title_len'] = reviews['Description_Cleaned'].str.split().str.len()
rev = reviews.groupby('Title_len')['points'].mean().reset_index()
trace1 = go.Scatter(
    x = rev['Title_len'],
    y = rev['points'],
    mode = 'lines+markers',
    name = 'lines+markers'
)
layout = dict(title= 'Average points by wine description Length',
              yaxis = dict(title='Average points'),
              xaxis = dict(title='wine description Length'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

# 6. Description Summarizer
![Imgur](https://i.imgur.com/DrvohGg.jpg?1)
> In this step, I will try to make a description summarizer. There is a huge amount of research going for text summarization. But I will try to do a simple technique for text summarization. The technique describes below.

### 6.1 Convert Paragraphs to Sentences
> We first need to convert the whole paragraph into sentences. The most common way of converting paragraphs to sentences is to split the paragraph whenever a period is encountered.

### 6.2 Text Preprocessing
> After converting paragraph to sentences, we need to remove all the special characters, stop words and numbers from all the sentences.

### 6.3 Tokenizing the Sentences
> We need to tokenize all the sentences to get all the words that exist in the sentences

### 6.4 4. Find Weighted Frequency of Occurrence
> Next we need to find the weighted frequency of occurrences of all the words. We can find the weighted frequency of each word by dividing its frequency by the frequency of the most occurring word.

### 6.5 Replace Words by Weighted Frequency in Original Sentences
> The final step is to plug the weighted frequency in place of the corresponding words in original sentences and finding their sum. It is important to mention that weighted frequency for the words removed during preprocessing (stop words, punctuation, digits etc.) will be zero and therefore is not required to be added

### 6.6 Sort Sentences in Descending Order of Sum
> The final step is to sort the sentences in inverse order of their sum. The sentences with highest frequencies summarize the text.

In [None]:
# this is function for text summarization
def generate_summary(text_without_removing_dot, cleaned_text):
    sample_text = text_without_removing_dot
    doc = nlp(sample_text)
    sentence_list=[]
    for idx, sentence in enumerate(doc.sents): # we are using spacy for sentence tokenization
        sentence_list.append(re.sub(r'[^\w\s]','',str(sentence)))

    stopwords = nltk.corpus.stopwords.words('english')

    word_frequencies = {}  
    for word in nltk.word_tokenize(cleaned_text):  
        if word not in stopwords:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1


    maximum_frequncy = max(word_frequencies.values())

    for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)


    sentence_scores = {}  
    for sent in sentence_list:  
        for word in nltk.word_tokenize(sent.lower()):
            if word in word_frequencies.keys():
                if len(sent.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word]
                    else:
                        sentence_scores[sent] += word_frequencies[word]


    summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

    summary = ' '.join(summary_sentences)
    print("Original Text:\n")
    print(text_without_removing_dot)
    print('\n\nSummarized text:\n')
    print(summary)  

Now we have written the function let's try to summarize some descriptions.

In [None]:
generate_summary(reviews['description_Cleaned_1'][8], reviews['Description_Cleaned'][8])

In [None]:
generate_summary(reviews['description_Cleaned_1'][100], reviews['Description_Cleaned'][100])

In [None]:
generate_summary(reviews['description_Cleaned_1'][500], reviews['Description_Cleaned'][500])

That's awesome! We successfully made a simple winemag description summarizer.

# 7. Conclusion
> Thanks for reading this notebook. If you have any suggestion feel free to reach me in the comment. And don't forget to upvote. 👍
> Stay in touch for more update. Thank you. &#128526;