# # Text Summarization using NLP

In [1]:
# Install BeautifulSoup 4 - pip install beautifulsoup4
# Install lxml - pip install lxml

## Text Summarization Part 1 - Introduction to BeautifulSoup

In [2]:
# Importing the libraries
import bs4 as bs
import urllib.request

# Gettings the data source
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/Global_warming').read()

# Parsing the data/ creating BeautifulSoup object
soup = bs.BeautifulSoup(source,'lxml')


## Text Summarization Part 2 - Fetching the data

In [3]:
# Importing the libraries
import re

# Fetching the data
text = ""
for paragraph in soup.find_all('p'):
    text += paragraph.text


## Text Summarization Part 3 - Preprocessing the data 

In [4]:
# Importing the libraries
import nltk

# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',text)
text = re.sub(r'\s+',' ',text)
clean_text = text.lower()
clean_text = re.sub(r'\W',' ',clean_text)
clean_text = re.sub(r'\d',' ',clean_text)
clean_text = re.sub(r'\s+',' ',clean_text)


## Text Summarization Part 4 - Tokenization

In [5]:
import nltk
nltk.download('stopwords')

# Tokenize sentences
sentences = nltk.sent_tokenize(text)

# Stopword list
stop_words = nltk.corpus.stopwords.words('english')


[nltk_data] Downloading package stopwords to /Users/Pi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Text Summarization Part 5 - Word counts and weights

In [6]:
# Word counts 
word2count = {}
for word in nltk.word_tokenize(clean_text):
    if word not in stop_words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

# Converting counts to weights
max_count = max(word2count.values())
for key in word2count.keys():
    word2count[key] = word2count[key]/max_count


## Text Summarization Part 6 - Sentence scores

In [7]:
# Product sentence scores    
sent2score = {}
for sentence in sentences:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word2count.keys():
            if len(sentence.split(' ')) < 25:
                if sentence not in sent2score.keys():
                    sent2score[sentence] = word2count[word]
                else:
                    sent2score[sentence] += word2count[word]


## Text Summarization Part 7 - Getting the Summary

In [9]:
# Importing the libraries
import heapq

                    
# Gettings best 5 lines             
best_sentences = heapq.nlargest(5, sent2score, key=sent2score.get)

print('---------------------------------------------------------')
for sentence in best_sentences:
    print(sentence)


---------------------------------------------------------
Global warming usually refers to human-induced warming of the Earth system, whereas climate change can refer to natural as well as anthropogenic change.
 Climate change includes both global warming driven by human-induced emissions of greenhouse gases and the resulting large-scale shifts in weather patterns.
Climate change impacts can be mitigated by reducing greenhouse gas emissions and by enhancing sinks that absorb greenhouse gases from the atmosphere.
The long-term effects of climate change include further ice melt, ocean warming, sea level rise, and ocean acidification.
To determine the human contribution to climate change, known internal climate variability and natural external forcings need to be ruled out.
