## load spaCy Model and create a document summariser

In [1]:
import spacy
from collections import Counter
from string import punctuation

nlp = spacy.load("en_core_web_lg")

In [2]:
import pandas as pd
import numpy as np
import datetime as dt
import os

### Top Sentence Function
source : https://betterprogramming.pub/extractive-text-summarization-using-spacy-in-python-88ab96d1fd97

In [3]:
def top_sentence(text, limit):
    """
    Tokenise the text input, and find important keywords.
    
    Args:
        text : str  The input text. Can be short paragraph or a big chunk of text.
        limit : int Determines how many sentences to return.
    
    Returns:
        list
    """
    keyword = []
    pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
    doc = nlp(text.lower()) #1
    for token in doc: #loop over each doc
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue # Ignore the token if it is a stopword or punctuation.
        if(token.pos_ in pos_tag):
            keyword.append(token.text) #append the token to a list, if it is defined.
    
    # Normalise the list of keywords.
    freq_word = Counter(keyword)
    max_freq = Counter(keyword).most_common(1)[0][1]
    for w in freq_word:
        freq_word[w] = (freq_word[w]/max_freq)
        
    # sentence strength : importances by identifying occurrence of important keywords and sum of value.
    sent_strength={}
    for sent in doc.sents: #8
        for word in sent: #9  Loop over each word in a sentence based on spaCy’s tokenization
            if word.text in freq_word.keys(): #10
                if sent in sent_strength.keys():
                    sent_strength[sent]+=freq_word[word.text]#11
                else:
                    sent_strength[sent]=freq_word[word.text]#12
    
    summary = []
    
    sorted_x = sorted(sent_strength.items(), key=lambda kv: kv[1], reverse=True)
    
    counter = 0
    for i in range(len(sorted_x)):
        summary.append(str(sorted_x[i][0]).capitalize())

        counter += 1
        if(counter >= limit): #This determines how many sentences are to be returned from the function.
            break
            
    return ' '.join(summary)  

Test the sentence summariser

In [4]:
all_texts = pd.read_csv('../data/02_intermediate/all_text_labels.csv')
all_texts = all_texts.drop(['Unnamed: 0'], axis=1)
all_texts.head()

Unnamed: 0,label,text
0,business,Ad sales boost Time Warner profit\n\nQuarterly...
1,business,Dollar gains on Greenspan speech\n\nThe dollar...
2,business,Yukos unit buyer faces loan claim\n\nThe owner...
3,business,High fuel prices hit BA's profits\n\nBritish A...
4,business,Pernod takeover talk lifts Domecq\n\nShares in...


In [5]:
example_text = all_texts.iloc[50]['text']
print(top_sentence(example_text, 5))

Fiat mulls ferrari market listing

ferrari could be listed on the stock market as part of an overhaul of fiat's carmaking operations, the financial times has reported. 

it said fiat was set to restructure its business after reaching a $2bn (1.53bn euros; â£1.05bn) settlement with gm about fiat's ownership. The financial times said fiat may transfer maserati within its wholly- owned alfa romeo division in an effort to exploit commercial synergies. Fiat owns a 56% stake in ferrari -best known for its dominant formula one motor racing team - having first bought into the business in 1969. Steps being considered include listing ferrari and bringing maserati and alfa romeo closer together, it said.


In [6]:
example_text

"Fiat mulls Ferrari market listing\n\nFerrari could be listed on the stock market as part of an overhaul of Fiat's carmaking operations, the Financial Times has reported.\n\nIt said Fiat was set to restructure its business after reaching a $2bn (1.53bn euros; Â£1.05bn) settlement with GM about Fiat's ownership. Steps being considered include listing Ferrari and bringing Maserati and Alfa Romeo closer together, it said. Despite strong sales of Alfa Romeo, Fiat's car business is making a loss.\n\nUnder the proposals - which the paper said could be announced within days - the iconic sportscar maker could be listed separately on the market. Fiat owns a 56% stake in Ferrari -best known for its dominant Formula One motor racing team - having first bought into the business in 1969. It considered floating Ferrari in 2002 but opted to sell a minority stake to Italian bank Mediobanca for 775m euros ($1bn). That sale valued Ferrari - which owns the Maserati brand - at 2.3bn euros. The price tag w

## Using Transformers from hugging face

Hugging Face provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, and more in over 100 languages.

Using Gradio with Hugging Face
To create our app, we will be using Gradio, which allows us to create a UI for our Hugging Face model easily.

Reference : https://medium.com/bitgrit-data-science-publication/build-a-news-article-summarizer-app-with-hugging-face-and-gradio-99d173428204

In [9]:
from transformers import pipeline
import gradio as gr
from gradio.mix import Parallel, Series
import warnings
warnings.filterwarnings("ignore")

In [10]:
# select Summariser Models that are top downloaded.Initialise the parallel interface

os.environ["CURL_CA_BUNDLE"]=""

io1 = gr.Interface.load('huggingface/sshleifer/distilbart-cnn-12-6')
io2 = gr.Interface.load("huggingface/facebook/bart-large-cnn")
io3 = gr.Interface.load("huggingface/google/pegasus-xsum")  
io4 = gr.Interface.load("huggingface/sshleifer/distilbart-cnn-6-6")                   


Fetching model from: https://huggingface.co/sshleifer/distilbart-cnn-12-6
Fetching model from: https://huggingface.co/facebook/bart-large-cnn
Fetching model from: https://huggingface.co/google/pegasus-xsum
Fetching model from: https://huggingface.co/sshleifer/distilbart-cnn-6-6


In [11]:
iface = Parallel(io1, io2, io3, io4,
                 theme='huggingface', 
                 inputs = gr.inputs.Textbox(lines = 10, label="Text"))

iface.launch(share=True)

Running on local URL:  http://127.0.0.1:7860/
Running on public URL: https://43943.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://www.huggingface.co/spaces)


(<fastapi.applications.FastAPI at 0x1f480639940>,
 'http://127.0.0.1:7860/',
 'https://43943.gradio.app')