In [1]:
import os
os.getcwd()

'C:\\Users\\Anindita\\Documents\\ISB\\Capstone Project\\Projects\\Capstone Peak Health\\Code\\Text Summarization'

### 1. NLTK

- Step 1. Read from source — Read the unabridged content from the source, a file in the case of this exercise.

- Step 2. Perform formatting and cleanup — Format and clean up our format so that it is free of extra white space or other issues.

- Step 3. Tokenize input — Take the input and break it up into its individual words.

- Step 4. Scoring — Score (count) the frequency of each word in the input and score sentences based on word score.

- Step 5. Selection — Choose the top N sentences based on their score.

In [11]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.probability import FreqDist
from heapq import nlargest
from collections import defaultdict

In [12]:
def read_file(path):
    try:
        with open(path, 'r') as file:
            return file.read()
    except IOError as e:
        print("Fatal Error: File ({}) could not be located or is not readable.".format(path))

In [13]:
def sanitize_input(data):
    replace = {
        ord('\f') : ' ',
        ord('\t') : ' ',
        ord('\n') : ' ',
        ord('\r') : None
    }
    return data.translate(replace)

In [14]:
def tokenize_content(content):
    stop_words = set(stopwords.words('english') + list(punctuation))
    words = word_tokenize(content.lower())
    
    return [
        sent_tokenize(content),
        [word for word in words if word not in stop_words]    
    ]

In [15]:
def score_tokens(filtered_words, sentence_tokens):
    word_freq = FreqDist(filtered_words)
    ranking = defaultdict(int)
    for i, sentence in enumerate(sentence_tokens):
        for word in word_tokenize(sentence.lower()):
            if word in word_freq:
                ranking[i] += word_freq[word]
    return ranking

In [16]:
def summarize(ranks, sentences, length):
    if int(length) > len(sentences): 
        print("Error, more sentences requested than available. Use --l (--length) flag to adjust.")
        exit()
    indexes = nlargest(length, ranks, key=ranks.get)
    final_sentences = [sentences[j] for j in sorted(indexes)]
    return ' '.join(final_sentences) 

In [19]:
content = read_file("testtext.txt")
content = sanitize_input(content)
#print(content,"\n")
sentence_tokens, word_tokens = tokenize_content(content)  
#print(sentence_tokens,"\n")
#print(word_tokens,"\n")
    
sentence_ranks = score_tokens(word_tokens, sentence_tokens)
#print(sentence_ranks,"\n")

print(summarize(sentence_ranks, sentence_tokens, 1))

Not much growth in terms of skills you will be idle in most situations client is god for higher management Good company to start with but no great benefits are there if you are still in your 30s and are in need of a good package Company culture and word life balance is good but the compensation is not there Other companies do provide night shift allowance and pay on national holidays and festivals or at least provide gift to their employees but HCL do nothing like this just provide a comp off that's it that's something need to be changed You need to provide some strong reson to your employees to be in your company for longer period of time you have a job security here and can enjoy along with work as it has gym and games related options.


### 2. Gensim

In [20]:
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords



In [21]:
print ('Summary:')   
print (summarize(str(content), ratio=0.01))

Summary:
Job security is there Client or Customer centric company and to make client or customer happy they forgot about the engineer Management always listen what the manager says they have set criteria to judge the ratings of engineer or person who is working there but all are fake who makes satisfy the manager only gets good rating and appraisal Before covid they never allowed to engineers any work from home facility It is a growing organization and has multiple mechanical engineering projects currently.
I have gone through lot of them most of them are useless.Regarding the hike,there isn't even a hike There are 4 quarterly reviews to capture how good you performed in each quarter but it's just a fake game.Everyone in hcl gets almost same hike(usually 6-15%) I joined around 2019 with a fresher salary of 2.5LPA,Then got deployed to these scrap work .Somehow i got lucky .I influenced one of my customers they gave me some important work.Then i learnt and i left .But people who joined a

### 3. Lex Rank

In [2]:
!pip install lexrank

!pip install sumy



In [8]:
from lexrank import LexRank
from lexrank.mappings.stopwords import STOPWORDS
from path import Path

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

In [10]:
parser=PlaintextParser.from_file("testtext.txt",Tokenizer("english"))
# Using LexRank
summarizer = LexRankSummarizer()
#Summarize the document with 4 sentences
summary = summarizer(parser.document,1)
for sentence in summary:
    print(sentence)

Not much growth in terms of skills you will be idle in most situations client is god for higher management Good company to start with but no great benefits are there if you are still in your 30s and are in need of a good package Company culture and word life balance is good but the compensation is not there Other companies do provide night shift allowance and pay on national holidays and festivals or at least provide gift to their employees but HCL do nothing like this just provide a comp off that's it that's something need to be changed You need to provide some strong reson to your employees to be in your company for longer period of time you have a job security here and can enjoy along with work as it has gym and games related options.


In [103]:
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
 
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.edmundson import EdmundsonSummarizer
from sumy.summarizers.kl import KLSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.reduction import ReductionSummarizer

In [106]:
LANGUAGE = "english"
SENTENCES_COUNT = 2

if __name__ == "__main__":
    parser = PlaintextParser.from_file("testtext.txt", Tokenizer(LANGUAGE))

 
    print ("\033[1m LsaSummarizer \033[0m")
    summarizer = LsaSummarizer(Stemmer(LANGUAGE))#Latent Semantic Analysis
    summarizer.stop_words = get_stop_words(LANGUAGE)
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
         
    print ("\033[1m LuhnSummarizer \033[0m")     
    summarizer = LuhnSummarizer(Stemmer(LANGUAGE)) #Word freq method
    summarizer.stop_words = ("I", "am", "the", "you", "are", "me", "is", "than", "that", "this",)
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
         
    print ("\033[1m EdmundsonSummarizer \033[0m")     
    summarizer = EdmundsonSummarizer() #cue phrase method
    words = ("deep", "learning", "neural" )
    summarizer.bonus_words = words
    words = ("another", "and", "some", "next",)
    summarizer.stigma_words = words
    words = ("another", "and", "some", "next",)
    summarizer.null_words = words
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)   
        
    print("\033[1m KL Summarizer \033[0m")
    summarizer = KLSummarizer()
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
        
    print("\033[1m TextRank Summarizer \033[0m")
    summarizer = TextRankSummarizer()
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
        
    print("\033[1m Reduction Summarizer \033[0m")
    summarizer = ReductionSummarizer()
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)

[1m LsaSummarizer [0m
There was a notification of walkin interview on naukri.com , the company clearly mentioned , to show the notification to the ground floor security , other wise they'll not let you enter , I reached the company on time , the security guard asked me about the mail , I showed him the notification , but he refused to let me enter , I also told him that this is what companies permission is but still he refused , I showed him the date and venue in the notification but he did not bulge , on contrary he's behaviour was very rude, and seems to be throwing his weight around , when I asked him again he started raising his voice , shouting and misbehaving , I went to the watchmans manager who was seeing all this from distance , and he asked me to leave , I again politely explained him every thing , but he just showed me the gate , I had placed a leave for this interview , and came all the way from L B nagar , quite far away from hitec city, and this is what you get , and it

### TL-DR 

In [32]:
import nltk
import string
from heapq import nlargest

In [47]:
# If the length of the text is greater than 20, take a 10th of the sentences
if content.count(". ") > 500:
    length = int(round(content.count(". ")/10, 0))
# Otherwise return five sentences
else:
    length = 1

In [48]:
# Remove punctuation
nopunc = [char for char in content if char not in string.punctuation]
nopunc = ''.join(nopunc)
# Remove stopwords
processed_text =[word for word in nopunc.split() if word.lower() not in nltk.corpus.stopwords.words('english')]

In [49]:
# Create a dictionary to store word frequency
word_freq = {}
# Enter each word and its number of occurrences
for word in processed_text:
    if word not in word_freq:
        word_freq[word] = 1
    else:
        word_freq[word] = word_freq[word] + 1

In [50]:
# Divide all frequencies by max frequency to give store of (0, 1]
max_freq = max(word_freq.values())
for word in word_freq.keys():
    word_freq[word] = (word_freq[word]/max_freq)

In [51]:
# Create a list of the sentences in the text
sent_list = nltk.sent_tokenize(content)
# Create an empty dictionary to store sentence scores
sent_score = {}
for sent in sent_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_freq.keys():
            if sent not in sent_score.keys():
                sent_score[sent] = word_freq[word]
            else:
                sent_score[sent] = sent_score[sent] + word_freq[word]

In [52]:
summary_sents = nlargest(length, sent_score, key = sent_score.get)
summary = ' '.join(summary_sents)
print(summary)

Not much growth in terms of skills you will be idle in most situations client is god for higher management Good company to start with but no great benefits are there if you are still in your 30s and are in need of a good package Company culture and word life balance is good but the compensation is not there Other companies do provide night shift allowance and pay on national holidays and festivals or at least provide gift to their employees but HCL do nothing like this just provide a comp off that's it that's something need to be changed You need to provide some strong reson to your employees to be in your company for longer period of time you have a job security here and can enjoy along with work as it has gym and games related options.


### PyTextRank

In [84]:
!pip install pytextrank
!python -m spacy download en_core_web_sm

Collecting pytextrank
  Using cached pytextrank-3.2.3-py3-none-any.whl (30 kB)
Collecting graphviz>=0.13
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
Collecting icecream>=2.1
  Downloading icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Collecting asttokens>=2.0.1
  Using cached asttokens-2.0.5-py2.py3-none-any.whl (20 kB)
Collecting executing>=0.3.1
  Downloading executing-0.9.0-py2.py3-none-any.whl (16 kB)
Installing collected packages: executing, asttokens, icecream, graphviz, pytextrank
Successfully installed asttokens-2.0.5 executing-0.9.0 graphviz-0.20.1 icecream-2.1.3 pytextrank-3.2.3
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


2022-07-26 16:02:19.832810: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-07-26 16:02:19.833027: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [85]:
import spacy
import pytextrank

In [86]:
en_nlp = spacy.load("en_core_web_sm")
en_nlp.add_pipe("textrank")
doc = en_nlp(content)

In [87]:
for combination in doc._.phrases:
    print(combination.text, combination.rank, combination.count)

Good work 0.091465941190635 1
Good work environment 0.08851008437063486 5
good work life balance 0.08724648347656681 1
good job security 0.0855892799422587 1
higher management Good company 0.08332342256445688 1
good salary hike 0.0831587028186492 7
good banking projects 0.08300550671783283 1
Good training 0.08028788994474875 1
good opportunity 0.07983464999239352 2
good culture 0.07903111656661684 1
good place 0.07644995334101981 2
Good learning 0.07544460778957043 1
good rating 0.07527916090313254 3
good package 0.07472332866029903 1
Good learning environment 0.07468425687280983 1
Good cafeteria 0.07430376605789342 1
good working culture 0.0740938778844844 1
good worker 0.07375625636872718 1
good learner 0.07347023684182166 1
good rest 0.07335794708793617 1
Good Experience 0.07293555900745397 1
avery good organization 0.06956938826420953 1
Company job security 0.06933894073611542 1
Overall good experience 0.068504808201687 1
good increment orelse 0.06829581811564385 1
Overall good Com

the product 0.0071161214188473435 1
- Transport coverage 0.007086460716308198 1
1.I 0.007068027635456652 1
2.HCL 0.007068027635456652 2
Freedom 0.007068027635456652 1
Learning 0.007068027635456652 1
People 0.007068027635456652 4
Trainer 0.007068027635456652 1
case 0.007068027635456652 1
juniors 0.007068027635456652 1
wil 0.007068027635456652 1
2 lakhs 0.00700980738170173 2
no one 0.006931578997120872 2
every week 0.006917761870186199 1
a brand 0.006781705856719527 1
100 reasons 0.006750187391506544 1
any reason 0.006750187391506544 1
the date 0.006670431413857427 1
no clue 0.006424989296690451 1
a comp 0.0063771348071187985 1
a billable resource 0.00631881305408109 1
a experienced resource 0.00631881305408109 1
one (contribution 0.0062211693831159195 1
my Project 0.00621346601620178 1
two years.my request 0.006182482839974246 1
the multinationals 0.006040330050651913 1
any kind 0.005850052894746712 1
these kind 0.005850052894746712 1
all codes 0.005794687341198205 1
all section 0.00575

In [88]:
en_nlp = spacy.load("en_core_web_sm")
en_nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] } })
doc = en_nlp(content)
for phrase in doc._.phrases[:5]:
    print(phrase)

Phrase(text='Good work', chunks=[Good work], count=1, rank=0.09143881676920286)
Phrase(text='Good work environment', chunks=[Good work environment, Good work environment, Good work environment, Good work environment, Good work environment], count=5, rank=0.0884850950162702)
Phrase(text='good work life balance', chunks=[good work life balance], count=1, rank=0.08713796492577786)
Phrase(text='good job security', chunks=[good job security], count=1, rank=0.08555192670598187)
Phrase(text='higher management Good company', chunks=[higher management Good company], count=1, rank=0.08330559422272783)


In [90]:
tr = doc._.textrank

In [91]:
for sent in tr.summary(limit_phrases=10, limit_sentences=2):
    print(sent)

company policies are good , appraisals are fair and they give good salary hike and bonus every year Management is good work life balance is good.
Good work and environment and culture.temmates are supportive and the management is transparent.tjere is a huge cafeteria and gaming arcade for chilling .


### Using SUMMA Package

In [92]:
!pip install summa



In [93]:
from summa import summarizer
from summa import keywords

In [98]:
print(summarizer.summarize(content))

Job security is there Client or Customer centric company and to make client or customer happy they forgot about the engineer Management always listen what the manager says they have set criteria to judge the ratings of engineer or person who is working there but all are fake who makes satisfy the manager only gets good rating and appraisal Before covid they never allowed to engineers any work from home facility It is a growing organization and has multiple mechanical engineering projects currently.
Hcl is avery good organization,But the middle management is shit.
You wil get career growth if you joined in internal projects.If it is a service-based project (external project from other vendors ) my request is please don't join.Because you will end being nothing.Most of the product developement companies give the documentation andless important works to these kind of service companies.If you joined here you will get deployed these mokka(scrap) works.These companies uses their own tools fo

In [99]:
print(keywords.keywords(content))

work
works
worked
management
managers
company
job
project
hcl
manager says
employees
employee
developement companies
salaries
working interview weekend
training
trained
trainings
learning
learned
overall good experience
time
timing
times
timely
engineering projects
opportunities
opportunity
like
liked
salary differes
team
teams
skilled
skills
skill
learn new
u
years
year
yearly
month
monthly
months
appraisal
appraisals
based
development
developers
develop
developer
developed
policy
policies
great
seniors
ask
asked
asking
asks
provided
provides
provide
providing
culture
cultural
allowance
allowed
allow
benefits
balanced
expectation
expect
expected
expectations
people
leave benefit
joined
join
joining
technical growth
pay
paying
secured
security
secure
leaving
leaves
role
roles
politely
politics
political
poor
environment
different
client
clients
senior leads
banking
campus
performed
performance
perform
facility
facilities
tech
annual
annually
life balance
subjects
subjected
got
bank rel

In [102]:
summarizer.summarize(content, ratio=0.01)

"Not much growth in terms of skills you will be idle in most situations client is god for higher management Good company to start with but no great benefits are there if you are still in your 30s and are in need of a good package Company culture and word life balance is good but the compensation is not there Other companies do provide night shift allowance and pay on national holidays and festivals or at least provide gift to their employees but HCL do nothing like this just provide a comp off that's it that's something need to be changed You need to provide some strong reson to your employees to be in your company for longer period of time you have a job security here and can enjoy along with work as it has gym and games related options.\nSeniors, team leaders, managers as well as your co-workers are very helpful and cooperative, good for freshers to learn and built skills and to get exposure of corporate world, 5 days working, WFH during pandemic, company will provide systems /laptop