In [1]:
import numpy as np
import pandas as pd
import nltk
import re

In [2]:
nltk.download('punkt') # one time execution

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vijay.shankar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
import pdfhandler

In [3]:
txt = pdfhandler.readpdftext("Heather A to NCP Nov 07 IP survey.pdf")

reading pdf as text file....




file ready. file format:  <class 'str'>


## Tokenizing Sentences

In [6]:
from nltk.tokenize import sent_tokenize
sentences = []
#for s in txt:
  #sentences.append(sent_tokenize(s))

sentences.append(sent_tokenize(txt))
sentences = [y for x in sentences for y in x] # flatten list

In [7]:
sentences[:5]

['      Heather Alpha to Ninian Central Petrofac Facilities Management Group Limited 16 inch Crude Oil Pipeline   Executive Summary  A survey of the Petrofac Facilities Management Group Limited Heather Alpha to Ninian Central pipeline was completed by PII Pipeline Solutions from 3rd to 4th November 2007.',
 'A total of 5306 metal loss features have been detected on the inspection survey of which the deepest was 52% (internal corrosion).',
 'These are mainly concentrated from the launch to 1.5km orientated around the 6:00 o™clock position.',
 'Approximately 7% of the total number of spools have metal loss reported within them.',
 'The majority of these are internal and are characteristic of corrosion.']

## Word Tokenizing

In [8]:
from nltk.tokenize import word_tokenize
tokenisedwords = []
#for s in txt:
  #sentences.append(sent_tokenize(s))

tokenisedwords.append(word_tokenize(txt))
tokenisedwords = [y for x in tokenisedwords for y in x] # flatten list

In [9]:
tokenisedwords[:5]

['Heather', 'Alpha', 'to', 'Ninian', 'Central']

In [10]:
nltk2 = nltk.Text(tokenisedwords)

In [14]:
nltk2.concordance('integrity')

Displaying 7 of 7 matches:
) Determine the current structural integrity of the pipeline related to the ope
ked '*** ' . Dents will affect the integrity of the pipel ine and are potential
ng : - Assessment This involves an Integrity Assessment which relates the sever
anies world-wide ; - pioneered new integrity assessment methods now accepted by
, which could pose a threat to the integrity of the Pipeline , the Contractor a
efore should pose no threat to the integrity of the Pipeline . PII Ref : - 1070
res which may be of concern to the integrity of the Pipeline are highlighted an


## Getting the Word Embeddings from Global Vectorization of Words (GloVe)

In [15]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [16]:
len(word_embeddings)

400000

Now we have 400,000 different terms stored in the dictionary – ‘word_embeddings’

In [17]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [18]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vijay.shankar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [19]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [20]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [21]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

## Vector Representation of Sentences

In [23]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

Now, let’s create vectors for our sentences. We will first fetch vectors (each of size 100 elements) for the constituent words in a sentence and then take mean/average of those vectors to arrive at a consolidated vector for the sentence.

In [24]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

## Similarity Matrix Preparation

The next step is to find similarities between the sentences, and we will use the cosine similarity approach for this challenge. Let’s create an empty similarity matrix for this task and populate it with cosine similarities of the sentences.

Let’s first define a zero matrix of dimensions (n * n).  We will initialize this matrix with cosine similarity scores of the sentences. Here, n is the number of sentences.

In [25]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

We will use Cosine Similarity to compute the similarity between a pair of sentences.

In [26]:
from sklearn.metrics.pairwise import cosine_similarity

And initialize the matrix with cosine similarity scores.

In [27]:
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

## Applying Page Rank Algorithm

Before proceeding further, let’s convert the similarity matrix sim_mat into a graph. The nodes of this graph will represent the sentences and the edges will represent the similarity scores between the sentences. On this graph, we will apply the PageRank algorithm to arrive at the sentence rankings.

In [28]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

## Extracting Summary

Time to extract the top N sentences based on their rankings for summay generation. Let us see if the code summarises the document well.

In [29]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [30]:
# Extract top 10 sentences as the summary
for i in range(10):
  print(ranked_sentences[i][1])

3.13 Nominal Wall Thickness Listing  All pipe spool nominal wall thickness changes detected by the Inspection System will be reported in the following format:-  (a) The number of the girth weld at which the change in the pipe spool nominal wall thickness occurs  (b) Distance of the girth weld from the start of the Component Line (absolute distance)  (c) Distance from the girth weld to the next identified pipe spool nominal wall thickness change (length)  (d) Nominal wall thickness of the spools downstream from the girth weld  (e) Indication as to whether the spools are in Major or Minor Segments  (f) Estimated Repair Factor   (i) Internal Design Pressure assigned to the spools when determining the Calculated Pressure   (ii) MAOP assigned to the spools when determining the ERF 3.14 Pipeline Listing  The Pipeline Listing will provide the following information:-  (a) Girthweld number   PII  Ref:- 107048 ITT/PFM/LBL/5128  Schedule 5  (b) Distance to the downstream girth weld, detected Meta

In [18]:
import gensim

In [19]:
print(gensim.summarization.summarizer.summarize(txt, ratio=1, word_count=300))

Schematic Location Summary: FeatureRef1Ref212.8m51052012.3m50051012.8m49050012.7m48049012.4m470480FLOWGirth WeldNumberPipe Length  ¬ Heather Alpha Ninian Central ®   Inspection Sheet Number 15 107048_16A  Feature Description  Type: Internal Metal Loss  Orientation: 05:45 (o™clock)  Axial length:  412 mm   Circumferential width:  224 mm   Depth - Peak:  31% WT   Pressure Ratio (ERF):  0.371  Feature Selection Rule: 7  Nominal Pipe wall thickness for spool: 15.90 mm   Absolute Distance from Launch: 852.2 metres Comments:  This metal loss feature has the appearance of corrosion.
Schematic Location Summary: FeatureRef1Ref212.8m51052012.3m50051012.8m49050012.7m48049012.4m470480FLOWGirth WeldNumberPipe Length  ¬ Heather Alpha Ninian Central ®   Inspection Sheet Number 15 107048_16A  Feature Description  Type: Internal Metal Loss  Orientation: 05:45 (o™clock)  Axial length:  412 mm   Circumferential width:  224 mm   Depth - Peak:  31% WT   Pressure Ratio (ERF):  0.371  Feature Selection Rule:

In [4]:
import sumy

In [5]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

In [6]:
parser = PlaintextParser.from_string(txt,Tokenizer("english"))

In [7]:
# Using LexRank
summarizer = LexRankSummarizer()
#Summarize the document with 2 sentences
summary = summarizer(parser.document, 2)

In [8]:
for sentence in summary:
    print(sentence)

It should be noted that mid-wall metal loss features would be classified as external; - the predicted peak depth of the metal loss feature; - the predicted axial length of the metal loss feature; - the orientation of the metal loss feature, as viewed in the direction of flow; - the calculated ERF value for the metal loss feature; and, - those metal loss features whic h have undergone detailed processing and analysis are indicated by a *.
These are: Metal Loss Features Each entry for a metal loss feature consists of: - the upstream girth weld number; - the relative distance along the pipeline to the upstream edge of the metal loss feature from the previous (upstream) girth weld; - the absolute distance along the pipeline to the upstream edge of the metal loss feature;  - the predicted axial length of the metal loss feature; - the predicted circumferential widt h of the metal loss feature;  - the wall thickness of the spool; - ML to denote that the entry refers to a metal loss feature; -

In [9]:
from sumy.summarizers.luhn import LuhnSummarizer

In [10]:
summarizer_luhn = LuhnSummarizer()
summary_1 =summarizer_luhn(parser.document,2)

In [11]:
for sentence in summary_1:
    print(sentence)

Pipeline ListingHeather Alpha to Ninian CentralGirth WeldNumber Relative Distance(metres)  Absolute Distance(metres) CommentPeak DepthLength(mm)ERFOrientation(hrs:mins)20   0.0  3.1   0.0  3.1   SEAMLESS START HEATHER ALPHA1.1  4.2   BALL VALVE1.5  4.6   INT ML12%290.257 06:301.6  4.7   INT ML27%330.260 05:451.8  4.9   INT ML9%260.257 05:151.8  4.9   INT ML15%520.260 06:3030   2.2  5.3   0.3  5.6   SUPPORT0.4  5.7   300 MM OFFTAKE-WELDOLET 12:000.9  6.2   50 MM OFFTAKE-WELDOLET 03:001.2  6.5   50 MM OFFTAKE-WELDOLET 12:001.4  6.8   50 MM OFFTAKE-WELDOLET 06:001.9  7.2   50 MM OFFTAKE-WELDOLET 09:0040   2.0  7.3   0.4  7.7   INT ML21%840.268 05:300.5  7.8   INT ML17%470.260 06:150.6  7.9   INT ML13%720.262 05:150.7  8.0   INT ML15%720.263 06:151.1  8.3   BALL VALVE1.4  8.7  *INT ML26%280.259 06:151.4  8.7  *INT ML17%180.257 05:451.5  8.8  *INT ML21%1140.273 06:151.5  8.8  *INT ML16%310.258 05:301.7  8.9  *INT ML23%860.269 06:151.7  9.0  *INT ML17%190.257 05:301.8  9.1  *INT ML26%940.273

In [12]:
from sumy.summarizers.lsa import LsaSummarizer

In [13]:
summarizer_lsa = LsaSummarizer()
summary_2 =summarizer_lsa(parser.document,2)

In [14]:
for sentence in summary_2:
    print(sentence)

Once this has been done, any external metal loss , dents or the girth weld that contains an anomaly should be easily identified.
The Severity Table includes all pipe spools containing a Metal Loss Feature with Pressure Sentenced Ratio equal to or greater than unity.


In [15]:
## Alternative Method using stopwords
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
summarizer_lsa2 = LsaSummarizer()
summarizer_lsa2 = LsaSummarizer(Stemmer("english"))
summarizer_lsa2.stop_words = get_stop_words("english")

In [16]:
for sentence in summarizer_lsa2(parser.document,2):
    print(sentence)

It should be noted that the ASME B31 pressure sentencing formulae strictly applies to isolated areas of corrosion in the main body of line pipe operating at stress levels not exceeding 72% SMYS (Specified Minimum Yield Strength).
Extreme care should be exercised when attempting to measure remaining ligament thicknesses directly within an area of external damage bec ause there is extra couplant under the transducer when mounted on concave surfaces which results in an overestimated reading.
