# Random Automatic Keyword Extraction (RAKE) - Project

<img src= "https://www.fleetscience.org/sites/default/files/images/customer%20ai.gif">

## Summary of RAKE

The main purpose of the Random Automatic Keyword Extraction Algorithm is to extract meaningful keywords from the given text file. It can be implemented without the need of a corpus. The implementation of RAKE is also considered relatively easier than most of the other algorithms. Some of the drawbacks include not being able to extract meaningful words, that may not be accurate enough. It extracts key phrases from texts/individual documents.RAKE has better performance on long key phrase extraction compared to text mining algorithms.

## Analytic Approach

The approach considered for this project is to conduct a descriptive statistics analyisis. The goal is to ensure that RAKE is implemented on the documents to extract meaningful keywords and phrases.

## Data Requirements & Collection

The type of datasets that can be used for this particular approach can be in the form of pdf, csv files and json files. In this notebook, we will be exploring the scientific paper (datasets) with csv files as they are much more easier in extracting words. There are also pdfs that can be used to extract words, however, their results are too vague. 

## Coding 

### Import Statements

In [13]:
import numpy as np
import re
import string
import operator
import PyPDF2
import nltk
import pandas as pd
import re

from collections import Counter, defaultdict
from enum import Enum
from itertools import chain, groupby, product
from typing import Callable, DefaultDict, Dict, List, Optional, Set, Tuple
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
from rake_nltk import Rake

In [14]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 

Word = str
Sentence = str
Phrase = Tuple[str, ...]


class Metric(Enum):

    DEGREE_TO_FREQUENCY_RATIO = 0  # Uses d(w)/f(w) as the metric
    WORD_DEGREE = 1  # Uses d(w) alone as the metric
    WORD_FREQUENCY = 2  # Uses f(w) alone as the metric


class Rake:

    def __init__(
            self,
            stopwords: Optional[Set[str]] = None,
            punctuations: Optional[Set[str]] = None,
            language: str = 'english',
            ranking_metric: Metric = Metric.DEGREE_TO_FREQUENCY_RATIO,
            max_length: int = 100000,
            min_length: int = 1,
            include_repeated_phrases: bool = True,
            sentence_tokenizer: Optional[Callable[[str], List[str]]] = None,
            word_tokenizer: Optional[Callable[[str], List[str]]] = None,
    ):

        if isinstance(ranking_metric, Metric):
            self.metric = ranking_metric
        else:
            self.metric = Metric.DEGREE_TO_FREQUENCY_RATIO

        self.stopwords: Set[str]
        if stopwords:
            self.stopwords = stopwords
        else:
            self.stopwords = set(nltk.corpus.stopwords.words(language))

        
        self.punctuations: Set[str]
        if punctuations:
            self.punctuations = punctuations
        else:
            self.punctuations = set(string.punctuation)

        self.to_ignore: Set[str] = set(chain(self.stopwords, self.punctuations))

        self.min_length: int = min_length
        self.max_length: int = max_length

        self.include_repeated_phrases: bool = include_repeated_phrases

        self.sentence_tokenizer: Callable[[str], List[str]]
        if sentence_tokenizer:
            self.sentence_tokenizer = sentence_tokenizer
        else:
            self.sentence_tokenizer = nltk.tokenize.sent_tokenize
        self.word_tokenizer: Callable[[str], List[str]]
        if word_tokenizer:
            self.word_tokenizer = word_tokenizer
        else:
            self.word_tokenizer = nltk.tokenize.wordpunct_tokenize

        self.frequency_dist: Dict[Word, int]
        self.degree: Dict[Word, int]
        self.rank_list: List[Tuple[float, Sentence]]
        self.ranked_phrases: List[Sentence]
  

    def extract_keywords_from_text(self, text: str):
        sentences: List[Sentence] = self._tokenize_text_to_sentences(text)
        self.extract_keywords_from_sentences(sentences)

    def extract_keywords_from_sentences(self, sentences: List[Sentence]):
        phrase_list: List[Phrase] = self._generate_phrases(sentences)
        self._build_frequency_dist(phrase_list)
        self._build_word_co_occurrence_graph(phrase_list)
        self._build_ranklist(phrase_list)
        
    def preprocess(self, sentence: str):
        sentence=str(sentence)
        sentence=sentence.replace(r'[^\w\s]+', "") #punctuation
        sentence = sentence.lower() #Lowercase
        sentence=sentence.replace('{html}',"") #
        cleanr = re.compile('<.*?>')
        cleantext = re.sub(cleanr, '', sentence)
        rem_url=re.sub(r'http\S+', '',cleantext)
        rem_num = re.sub('[0-9]+', '', rem_url)
        tokenizer = RegexpTokenizer(r'\w+')
        tokens = tokenizer.tokenize(rem_num)  
        filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
        stem_words=[stemmer.stem(w) for w in filtered_words]
        lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
        return " ".join(filtered_words)

    def get_ranked_phrases(text) -> List[Sentence]:
        return text.ranked_phrases

    def get_ranked_phrases_with_scores(self) -> List[Tuple[float, Sentence]]:
        return self.rank_list

    def get_word_frequency_distribution(self) -> Dict[Word, int]:
        return self.frequency_dist

    def get_word_degrees(self) -> Dict[Word, int]:
        return self.degree

    def _tokenize_text_to_sentences(self, text: str) -> List[Sentence]:
        return self.sentence_tokenizer(text)

    def _tokenize_sentence_to_words(self, sentence: Sentence) -> List[Word]:
        return self.word_tokenizer(sentence)

    def _build_frequency_dist(self, phrase_list: List[Phrase]) -> None:
        self.frequency_dist = Counter(chain.from_iterable(phrase_list))

    def _build_word_co_occurrence_graph(self, phrase_list: List[Phrase]) -> None:
        co_occurrence_graph: DefaultDict[Word, DefaultDict[Word, int]] = defaultdict(lambda: defaultdict(lambda: 0))
        for phrase in phrase_list:
            for (word, coword) in product(phrase, phrase):
                co_occurrence_graph[word][coword] += 1
        self.degree = defaultdict(lambda: 0)
        for key in co_occurrence_graph:
            self.degree[key] = sum(co_occurrence_graph[key].values())

    def _build_ranklist(self, phrase_list: List[Phrase]):
        self.rank_list = []
        for phrase in phrase_list:
            rank = 0.0
            for word in phrase:
                if self.metric == Metric.DEGREE_TO_FREQUENCY_RATIO:
                    rank += 1.0 * self.degree[word] / self.frequency_dist[word]
                elif self.metric == Metric.WORD_DEGREE:
                    rank += 1.0 * self.degree[word]
                else:
                    rank += 1.0 * self.frequency_dist[word]
            self.rank_list.append((rank, ' '.join(phrase)))
        self.rank_list.sort(reverse=True)
        self.ranked_phrases = [ph[1] for ph in self.rank_list]

    def _generate_phrases(self, sentences: List[Sentence]) -> List[Phrase]:
        phrase_list: List[Phrase] = []
        for sentence in sentences:
            word_list: List[Word] = [word.lower() for word in self._tokenize_sentence_to_words(sentence)]
            phrase_list.extend(self._get_phrase_list_from_words(word_list))

        if not self.include_repeated_phrases:
            unique_phrase_tracker: Set[Phrase] = set()
            non_repeated_phrase_list: List[Phrase] = []
            for phrase in phrase_list:
                if phrase not in unique_phrase_tracker:
                    unique_phrase_tracker.add(phrase)
                    non_repeated_phrase_list.append(phrase)
            return non_repeated_phrase_list

        return phrase_list

    def _get_phrase_list_from_words(self, word_list: List[Word]) -> List[Phrase]:
        groups = groupby(word_list, lambda x: x not in self.to_ignore)
        phrases: List[Phrase] = [tuple(group[1]) for group in groups if group[0]]
        return list(filter(lambda x: self.min_length <= len(x) <= self.max_length, phrases))


In [15]:
r = Rake()

In [None]:
pdf_file = open('Entropy.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
num_pages = read_pdf.getNumPages()
for x in range(num_pages):
    pageObj = read_pdf.getPage(x)
    page = pageObj.extractText()
    text = str(page)
    print(text)

In [17]:
r.preprocess(text)
r.extract_keywords_from_text(text)
r.get_ranked_phrases()

['digging language model — maximum entropy phrase extraction',
 'halyomorpha halys using molecular gut content analysis',
 '30 august – 3 september 2010',
 '11 – 14 august 1999',
 '12 – 16 september 2016',
 'keyword extraction using word co',
 '1 – 5 september 2008',
 '4 – 6 december 2010',
 '25 – 29 october 2014',
 '25 – 26 july 2004',
 'neural probabilistic language model',
 'practical automatic keyphrase extraction',
 'rapid automatic keyword extraction',
 'information extraction algorithm based',
 'open access article distributed',
 'keyword extraction based',
 'planning using triz',
 'hidden markov model',
 '1 – 4',
 'natural language processin',
 'natural language processing',
 '917 – 921',
 '564 – 577',
 '54 – 58',
 '54 – 58',
 '51 – 68',
 '491 – 509',
 '46 – 53',
 '4348 – 4360',
 '404 – 411',
 '254 – 255',
 '1532 – 1543',
 '1216 – 1247',
 '1137 – 1155',
 'based hmm approach',
 'http :// creativecommons',
 'emnlp ), doha',
 'creative commons attribution',
 'chinese news document

In [18]:
r.get_ranked_phrases_with_scores()

[(54.166666666666664,
  'digging language model — maximum entropy phrase extraction'),
 (42.8, 'halyomorpha halys using molecular gut content analysis'),
 (25.515151515151512, '30 august – 3 september 2010'),
 (24.18181818181818, '11 – 14 august 1999'),
 (22.015151515151516, '12 – 16 september 2016'),
 (21.666666666666668, 'keyword extraction using word co'),
 (21.015151515151516, '1 – 5 september 2008'),
 (19.68181818181818, '4 – 6 december 2010'),
 (18.848484848484848, '25 – 29 october 2014'),
 (18.848484848484848, '25 – 26 july 2004'),
 (17.5, 'neural probabilistic language model'),
 (16.666666666666668, 'practical automatic keyphrase extraction'),
 (16.166666666666668, 'rapid automatic keyword extraction'),
 (14.666666666666668, 'information extraction algorithm based'),
 (14.5, 'open access article distributed'),
 (11.500000000000002, 'keyword extraction based'),
 (11.0, 'planning using triz'),
 (11.0, 'hidden markov model'),
 (10.681818181818182, '1 – 4'),
 (10.5, 'natural langua

In [19]:
pdf_file = open('DataDriven.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
num_pages = read_pdf.getNumPages()
for x in range(num_pages):
    pageObj = read_pdf.getPage(x)
    page = pageObj.extractText()
    text = str(page)
    print(text)

 























Data-driven materials research enabled by naturallanguage processing and information extraction


Cite as: Appl. Phys. Rev.7, 041317 (2020);doi: 10.1063/5.0021106Submitted: 7 July 2020.Accepted: 19 November 2020.
Published Online: 21 December 2020



Elsa A.Olivetti,1,a)
Jacqueline M.Cole,2,3,4
EdwardKim,5
OlgaKononova,6,7GerbrandCeder,6,7
Thomas Yong-JinHan,8
and Anna M.Hiszpanski8


AFFILIATIONS
1Department of Materials Science and Engineering, MIT, Cambridge, Massachusetts 02139, USA
2Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue,Cambridge CB3 0HE, United Kingdom
3ISIS Neutron and Muon Source, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus,Didcot OX11 0QX, United Kingdom
4Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site,Philippa Fawcett Drive, Cambridge CB3 0AS, United Kingdom
5Science, Evaluation, and Measurement, Xero, Toronto, Ontario M5H 

researchers to perform signiﬁcant background research, test hypothe-ses, survey the ﬁeld, and form a sound basis for designing andperforming experimental work, saving hours if not months of labor-intensive literature surveying and wasted experiments. Text extractioncan provide data that drive search-engine development in the scientiﬁcdomain and a beginning of active learning systems tied to automatedmaterials discovery and synthesis platforms.
35–37
Beyond data extraction and visualization, researchers may alsoleverage NLP to derive fundamental insight across these data; forexample, NLP may be used to ﬁnd relationships between compoundsby mapping materials mentioned in the text to corresponding chemi-cal structures. This identiﬁcation of relationships and trends is fre-quently done by using various ML techniques on the extracted data.Scientists can search for similar chemical structures or substructures,meaning that text information can be combined with knowledge fromestablished comput

algorithm and hyperparameters but obviously also on the source data.Recent work explored the impact of similarity between pre-trainingdata and target task, particularly in the area of word embeddings.
62
This work proposed to select pre-trained data using the target vocabu-lary covered rate (percentage of the target vocabulary that is also pre-sent in the source data) and language-model perplexity (if the modelﬁnds a sentence very unlikely, in other words dissimilar from the datawhere this language model is trained on, it will assign a low probabilityand therefore high perplexity). The authors found that the effective-ness of pre-trained word vectors depends on whether the source datahave a high vocabulary intersection with target data, while pre-trainedlanguage models can gain more beneﬁts from a similar source. Wenote, therefore, that the choice of corpus used for the training processis critical, pointing to the quality of text and domain-speciﬁcityrequirements.
38A range of word-emb

more of a physical focus include polymers,68,69,102Curie and N/C19eelmagnetic phase-transition temperatures,
51and pulsed-laser depositionprocessing conditions of complex oxides.
103Efforts that can be linkedto physical properties, but are currently focused on materials chemis-try, include solid-state reactions for all inorganic materials, synthesisof inorganic oxides,
47,48,104zeolites,105and nanomaterials.57
Repositories of materials metrology data are also being curated usingNLP tools. For example, a database of UV/vis absorption spectralcharacteristics was auto-generated by mining the experimental valuesof the wavelength of maximum absorption,k
max, and molar extinctioncoefﬁcients,e, of chemicals from the literature.
106Metrology data offera more general data platform to serve an entire physics community;the example given will aid a wide range of optical and optoelectronicapplications. We offer some speciﬁcity around each of these examples.Within the domain of polymers, leading tex

greater value in terms of the much greater effort that is expended toproduce these types of more specialized data. For example, STEMimages that display defects in steels
122or defects that cause structuraltransformations in tungsten sulﬁde
123have been analyzed quantita-tively using deep-learning methods. Interatomic interaction potentialshave also been extracted from STM images using Bayesian infer-ence.
124However, none of these efforts are generalizable or scalable tothe high-throughput data extraction and quantitative analysis ofmicroscopy images, which is needed for data-driven approaches tomaterials physics.The software tool, ImageDataExtractor,
125begins to address thisissue, shifting from assisting manual analysis of images to a generictool that auto-extracts and quantiﬁes microscopy images from docu-ments. This tool executes an autonomous pipeline of image-recognition methods to detect particles in a series of microscopyimages and quantify them in terms of shape, size, and rad

123A. Maksovet al., “Deep learning analysis of defect and phase evolution duringelectron beam-induced transformations in WS 2,”NPJ Comput. Mater.5,12(2019).
124L. Vlcek, A. Maksov, M. Pan, R. K. Vasudevan, and S. V. Kalinin, “Knowledgeextraction from atomically resolved images,”ACS Nano11, 10313–10320(2017).
125K. T. Mukaddem, E. J. Beard, B. Yildirim, and J. M. Cole,“ImageDataExtractor: A tool to extract and quantify data from microscopyimages,”J. Chem. Inf. Model.60, 2492 (2020).
126R. Smith, “An overview of the Tesseract OCR engine,”inNinth InternationalConference on Document Analysis and Recognition (ICDAR 2007)(IEEE,2007), Vol. 2, pp. 629–633.
127C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deepconvolutional networks,”IEEE Trans. Pattern Anal. Mach. Intell.38, 295–307(2016).
128C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional net-work for image super-resolution,” inEuropean Conference on ComputerVision(Springer, 2014), pp. 184–199.
12

In [20]:
r.preprocess(text)

'maksovet deep learning analysis defect phase evolution duringelectron beam induced transformations npj comput mater vlcek maksov pan vasudevan kalinin knowledgeextraction atomically resolved images acs nano mukaddem beard yildirim cole imagedataextractor tool extract quantify data microscopyimages chem inf model smith overview tesseract ocr engine inninth internationalconference document analysis recognition icdar ieee vol dong loy tang image super resolution using deepconvolutional networks ieee trans pattern anal mach intell dong loy tang learning deep convolutional net work image super resolution ineuropean conference computervision springer visual pattern recognition moment invariants ire trans inf theory kim han han machine vision driven automatic recogni tion particle size morphology sem images nanoscale szegedy vanhoucke ioffe shlens wojna rethinking theinception architecture computer vision proceedings ieeeconference computer vision pattern recognition xia nan inception ﬂower 

In [21]:
r.extract_keywords_from_text(text)

In [22]:
r.get_ranked_phrases()

['selective carbon nanotube growth conditions via auto',
 'chemical structure information fromdigital raster images ,” chem',
 '“ optical structure recognition software torecover chemical information',
 'resolution using deepconvolutional networks ,” ieee trans',
 'atomically resolved images ,” acs nano11',
 'maksovet al ., “ deep learning analysis',
 'context dataset ,” in2018 ieee winterapplications',
 'ﬂower classiﬁcation ,” in2017 2ndinternational conference',
 'tesseract ocr engine ,” inninth internationalconference',
 'lation without parallel data ,” arxiv',
 'type ii porous liquids ,” chem',
 'image captioningand visual question answering ,”',
 'parket al ., “ automated extraction',
 'natural language processing tasks ,” inf',
 'optical chemical structure recognition ,” j',
 'mated experimentation ,” acs nano8',
 'moment invariants ,” ire trans',
 'inverse temperature crystallization ,” chemrxiv',
 'xuet al ., “ show',
 'liet al ., “ robot',
 'grosmanet al ., “ eras',
 'anderson

In [23]:
r.get_ranked_phrases_with_scores()

[(49.0, 'selective carbon nanotube growth conditions via auto'),
 (45.366666666666674,
  'chemical structure information fromdigital raster images ,” chem'),
 (44.87499999999999,
  '“ optical structure recognition software torecover chemical information'),
 (38.7952380952381,
  'resolution using deepconvolutional networks ,” ieee trans'),
 (34.11666666666667, 'atomically resolved images ,” acs nano11'),
 (33.66666666666667, 'maksovet al ., “ deep learning analysis'),
 (33.2952380952381, 'context dataset ,” in2018 ieee winterapplications'),
 (32.86666666666667,
  'ﬂower classiﬁcation ,” in2017 2ndinternational conference'),
 (32.36666666666667,
  'tesseract ocr engine ,” inninth internationalconference'),
 (31.366666666666667, 'lation without parallel data ,” arxiv'),
 (31.2, 'type ii porous liquids ,” chem'),
 (31.066666666666666, 'image captioningand visual question answering ,”'),
 (30.5, 'parket al ., “ automated extraction'),
 (28.422222222222224, 'natural language processing tasks

In [24]:
corpus = "The invention relates to a book-keeping machine intended for the calculation and recording of new balances resulting from old balances of an account and the amounts received and paid out. A balance mechanism, which in known manner comprises two kinds of counting wheels, viz. adding and subtracting wheels, serves for the calculation of the new balances."

In [25]:
dp = pd.read_csv('papers.csv', encoding='unicode_escape')

In [26]:
dp

Unnamed: 0,paper_id,title,keywords,abstract,session,year
0,1,Ensemble Statistical and Heuristic Models for ...,"statistical word alignment, ensemble learning,...",Statistical word alignment models need large a...,Ensemble Methods,2014
1,2,Improving Spectral Learning by Using Multiple ...,"representation, spectral learning, discrete fo...",Spectral learning algorithms learn an unknown ...,Ensemble Methods,2014
2,3,Applying Swarm Ensemble Clustering Technique f...,"software defect prediction, particle swarm opt...",Number of defects remaining in a system provid...,Ensemble Methods,2014
3,4,Reducing the Effects of Detrimental Instances,"filtering, label noise, instance weighting",Not all instances in a data set are equally be...,Ensemble Methods,2014
4,5,Concept Drift Awareness in Twitter Streams,"twitter, adaptation models, time-frequency ana...",Learning in non-stationary environments is not...,Ensemble Methods,2014
...,...,...,...,...,...,...
443,444,A Machine Learning Tool for Supporting Advance...,"machine-learning,unsupervised-learning,knowled...","In the current era of big data, high volumes o...",Machine Learning on Big Data,2017
444,445,Advanced ECHMM-Based Machine Learning Tools fo...,"workload characterization,hmm,cepstral coeffic...",We present a novel approach for accurate chara...,Machine Learning on Big Data,2017
445,446,A Cluster Analysis of Challenging Behaviors in...,"cluster analysis,autism spectrum disorder,chal...","We apply cluster analysis to a sample of 2,116...",Machine Learning Applications in Psychiatric R...,2017
446,447,Predicting Psychosis Using the Experience Samp...,"predicting psychosis,esm,mhealth,svm,gaussian ...",Smart phones have become ubiquitous in the rec...,Machine Learning Applications in Psychiatric R...,2017


In [27]:
with open('papers.csv', 'r') as csvfile:
    csvtext = csvfile.readlines()

mylist = []
for line in csvtext:
    mylist.append(tuple(line.strip().split(', ')))
print(mylist)



In [28]:
text = str(mylist)

In [29]:
r.preprocess(text)
r.extract_keywords_from_text(text)
r.get_ranked_phrases()

['arctic sea ice extent forecasting using support vector regression ," arctic sea ice \',',
 'lecture vdeo indexing using boosted margin maximizing neural networks ," video indexing',
 'extensive empirical study utilizing four bootstrap approaches within bagging framework using three feature rankers along',
 'next generation wireless networks using coloured petri nets ," next generation wireless networks',
 'topic novelty detection using infinite variational inverted dirichlet mixture models ," inverted dirichlet',
 'sunspot number using minimum error entropy cost based kernel adaptive filters ," kernel methods',
 'shot periodic activity recognition using convolutional neural networks ," human activity recognition \',',
 'uncertainty quantified matrix completion using bayesian hierarchical matrix factorization ," bayesian analysis \',',
 'based inferential control system using kernel principle component analysis ," control systems \',',
 'audio signal reconstruction using cartesian gen

In [30]:
data = r.get_ranked_phrases_with_scores()

In [31]:
print(data)



In [32]:
df = pd.DataFrame (data, columns = ['score', 'phrase'])
print (df)

           score                                             phrase
0      75.698352  arctic sea ice extent forecasting using suppor...
1      74.915661  lecture vdeo indexing using boosted margin max...
2      67.043804  extensive empirical study utilizing four boots...
3      64.746550  next generation wireless networks using colour...
4      63.029834  topic novelty detection using infinite variati...
...          ...                                                ...
28618   1.000000                                                 *.
28619   1.000000                                                "',
28620   1.000000                                                "',
28621   1.000000                                                "',
28622   1.000000                                                "',

[28623 rows x 2 columns]


In [18]:
r.preprocess(corpus)
r.extract_keywords_from_text(corpus)
r.get_ranked_phrases()

['known manner comprises two kinds',
 'keeping machine intended',
 'new balances resulting',
 'new balances',
 'old balances',
 'subtracting wheels',
 'invention relates',
 'counting wheels',
 'balance mechanism',
 'amounts received',
 'viz',
 'serves',
 'recording',
 'paid',
 'calculation',
 'calculation',
 'book',
 'adding',
 'account']

In [25]:
data = r.get_ranked_phrases_with_scores()

In [32]:
print(data)

[(25.0, 'known manner comprises two kinds'), (9.0, 'keeping machine intended'), (7.833333333333334, 'new balances resulting'), (4.833333333333334, 'new balances'), (4.333333333333334, 'old balances'), (4.0, 'subtracting wheels'), (4.0, 'invention relates'), (4.0, 'counting wheels'), (4.0, 'balance mechanism'), (4.0, 'amounts received'), (1.0, 'viz'), (1.0, 'serves'), (1.0, 'recording'), (1.0, 'paid'), (1.0, 'calculation'), (1.0, 'calculation'), (1.0, 'book'), (1.0, 'adding'), (1.0, 'account')]


In [44]:
df = pd.DataFrame (data, columns = ['score', 'phrase'])
print (df)

        score                            phrase
0   25.000000  known manner comprises two kinds
1    9.000000          keeping machine intended
2    7.833333            new balances resulting
3    4.833333                      new balances
4    4.333333                      old balances
5    4.000000                subtracting wheels
6    4.000000                 invention relates
7    4.000000                   counting wheels
8    4.000000                 balance mechanism
9    4.000000                  amounts received
10   1.000000                               viz
11   1.000000                            serves
12   1.000000                         recording
13   1.000000                              paid
14   1.000000                       calculation
15   1.000000                       calculation
16   1.000000                              book
17   1.000000                            adding
18   1.000000                           account


In [45]:
df.to_csv('rake.csv', index = False)

In [33]:
from __future__ import absolute_import
from __future__ import print_function
from six.moves import range
import sys

In [34]:
rake_object = stopwords.words('english')

In [35]:
type(rake_object)

list

In [36]:
top = len(data)

test_score = df['score'] 
test_name= df['phrase'] 

total_precision = 0
total_recall = 0

In [37]:
    correct = 0
    num_manual_keywords = len(rake_object)
    for i in range(0, len(data)):
        if data[i][0] in set(df['score']):
            correct += 1
            total_precision += correct/float(len(data))
            total_recall += correct/float(len(rake_object))
    print('correct:', correct, 'out of', num_manual_keywords)

correct: 28623 out of 179


In [40]:
avg_precision = round(total_precision*100/float(len(data)), 2)
avg_recall = round(total_recall/float(len(data)), 2)

avg_fmeasure = round(2*avg_precision*avg_recall/(avg_precision + avg_recall), 2)
avg_accuracy = round((total_precision + total_recall)/(float(len(data))*2),2)

print("Precision", avg_precision, "Recall", avg_recall, "F-Measure", avg_fmeasure, "Accuracy", avg_accuracy)

Precision 50.0 Recall 79.96 F-Measure 61.53 Accuracy 40.23
