In [90]:
text = "The goal of Google Research is to work on long-term, ambitious problems, with an emphasis on solving ones that will dramatically help people throughout their daily lives. In pursuit of that goal in 2019, we made advances in a broad set of fundamental research areas, applied our research to new and emerging areas such as healthcare and robotics, open sourced a wide variety of code and continued collaborations with Google product teams to build tools and services that are dramatically more helpful for our users. As we start 2020, it’s useful to take a step back and assess the research work we’ve done over the past year, and also to look forward to what sorts of problems we want to tackle in the upcoming years. In that spirit, this blog post is a survey of some of the research-focused work done by Google researchers and engineers during 2019 (in the spirit of similar reviews for 2018, and more narrowly focused reviews of some work in 2017 and 2016). For a more comprehensive look, please see our research publications in 2019. In 2018, we published a set of AI Principles that provide a framework by which we evaluate our own research and applications of technologies such as machine learning in our products. In June 2019, we published a one-year update about how these principles are being put into practice in many different aspects of our research and product development life cycles. Since many of the areas touched on by the principles are active areas of research in the broader AI and machine learning research community (such as bias, safety, fairness, accountability, transparency and privacy in machine learning systems), our goals are to apply the best currently-known techniques in these areas to our work, and also to do research to continue to advance the state of the art in these important areas. There is enormous potential for machine learning to help with many important societal issues. We have been doing work in several such areas, as well as working to enable others to apply their creativity and skills to solving such problems. Floods are the most common and the most deadly natural disaster on the planet, affecting approximately 250 million people each year. We have been using machine learning, computation and better sources of data to make significantly more accurate flood forecasts, and then to deliver actionable alerts to the phones of millions of people in the affected regions. We also hosted a workshop that brought together researchers with expertise in flood forecasting, hydrology and machine learning from Google and the broader research community to discuss ways to collaborate further on this important problem."

In [91]:
text

'The goal of Google Research is to work on long-term, ambitious problems, with an emphasis on solving ones that will dramatically help people throughout their daily lives. In pursuit of that goal in 2019, we made advances in a broad set of fundamental research areas, applied our research to new and emerging areas such as healthcare and robotics, open sourced a wide variety of code and continued collaborations with Google product teams to build tools and services that are dramatically more helpful for our users. As we start 2020, it’s useful to take a step back and assess the research work we’ve done over the past year, and also to look forward to what sorts of problems we want to tackle in the upcoming years. In that spirit, this blog post is a survey of some of the research-focused work done by Google researchers and engineers during 2019 (in the spirit of similar reviews for 2018, and more narrowly focused reviews of some work in 2017 and 2016). For a more comprehensive look, please 

In [92]:
# This summarizer is based on the , from an “TextRank” algorithm by Mihalcea et al.
# This algorithm was later improved upon by Barrios et al., by introducing something called a “BM25 ranking function.

In [93]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [94]:
from pprint import pprint as print
from gensim.summarization import summarize

In [95]:
print(summarize(text))

2020-01-12 20:08:44,349 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-01-12 20:08:44,350 : INFO : built Dictionary(134 unique tokens: ['ambiti', 'daili', 'dramat', 'emphasi', 'goal']...) from 13 documents (total 211 corpus positions)


('In pursuit of that goal in 2019, we made advances in a broad set of '
 'fundamental research areas, applied our research to new and emerging areas '
 'such as healthcare and robotics, open sourced a wide variety of code and '
 'continued collaborations with Google product teams to build tools and '
 'services that are dramatically more helpful for our users.\n'
 'We also hosted a workshop that brought together researchers with expertise '
 'in flood forecasting, hydrology and machine learning from Google and the '
 'broader research community to discuss ways to collaborate further on this '
 'important problem.')


In [96]:
print(summarize(text, split=True))

2020-01-12 20:08:44,373 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-01-12 20:08:44,375 : INFO : built Dictionary(134 unique tokens: ['ambiti', 'daili', 'dramat', 'emphasi', 'goal']...) from 13 documents (total 211 corpus positions)


['In pursuit of that goal in 2019, we made advances in a broad set of '
 'fundamental research areas, applied our research to new and emerging areas '
 'such as healthcare and robotics, open sourced a wide variety of code and '
 'continued collaborations with Google product teams to build tools and '
 'services that are dramatically more helpful for our users.',
 'We also hosted a workshop that brought together researchers with expertise '
 'in flood forecasting, hydrology and machine learning from Google and the '
 'broader research community to discuss ways to collaborate further on this '
 'important problem.']


In [97]:
# Using the “ratio” parameter, you specify what fraction of sentences in the original text should be returned as output

In [98]:
print(summarize(text, ratio=0.5))

2020-01-12 20:08:44,400 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-01-12 20:08:44,402 : INFO : built Dictionary(134 unique tokens: ['ambiti', 'daili', 'dramat', 'emphasi', 'goal']...) from 13 documents (total 211 corpus positions)


('The goal of Google Research is to work on long-term, ambitious problems, '
 'with an emphasis on solving ones that will dramatically help people '
 'throughout their daily lives.\n'
 'In pursuit of that goal in 2019, we made advances in a broad set of '
 'fundamental research areas, applied our research to new and emerging areas '
 'such as healthcare and robotics, open sourced a wide variety of code and '
 'continued collaborations with Google product teams to build tools and '
 'services that are dramatically more helpful for our users.\n'
 'In 2018, we published a set of AI Principles that provide a framework by '
 'which we evaluate our own research and applications of technologies such as '
 'machine learning in our products.\n'
 'Since many of the areas touched on by the principles are active areas of '
 'research in the broader AI and machine learning research community (such as '
 'bias, safety, fairness, accountability, transparency and privacy in machine '
 'learning system

In [99]:
# Using the “word_count” parameter, we specify the maximum amount of words we want in the summary.

In [100]:
print(summarize(text, word_count=50))

2020-01-12 20:08:44,430 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-01-12 20:08:44,431 : INFO : built Dictionary(134 unique tokens: ['ambiti', 'daili', 'dramat', 'emphasi', 'goal']...) from 13 documents (total 211 corpus positions)


('In pursuit of that goal in 2019, we made advances in a broad set of '
 'fundamental research areas, applied our research to new and emerging areas '
 'such as healthcare and robotics, open sourced a wide variety of code and '
 'continued collaborations with Google product teams to build tools and '
 'services that are dramatically more helpful for our users.')


In [101]:
# Keyword extraction

In [103]:
from gensim.summarization import keywords
print(keywords(text))

('research\n'
 'researchers\n'
 'year\n'
 'years\n'
 'product\n'
 'products\n'
 'floods\n'
 'flood\n'
 'people\n'
 'natural\n'
 'affecting\n'
 'affected\n'
 'areas\n'
 'problems\n'
 'problem\n'
 'important\n'
 'safety\n'
 'accountability')


In [114]:
# Try on bigger text

In [110]:
with open(r'C:\Users\bnawa\turing_imitation-game.txt', 'r') as f2:
    data = f2.read()
    print(data[:10])

'433\nVOL. L'


In [112]:
type(data)

str

In [115]:
print(summarize(data, ratio=0.02))

2020-01-12 20:16:46,738 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-01-12 20:16:46,750 : INFO : built Dictionary(1457 unique tokens: ['vol', 'lix', 'octob', 'quarterli', 'review']...) from 1450 documents (total 5063 corpus positions)


('I PROPOSE to consider the question, ‘Can machines think?’ This should\n'
 'We now ask the question, ‘What will happen when a machine takes\n'
 'means to put the appropriate instruction table into the machine so that it\n'
 'This example is typical of discrete state machines.\n'
 'It will seem that given the initial state of the machine and the input\n'
 'But the number of states of which such a machine\n'
 'the number of states of three Manchester machines put together.\n'
 'Given the table corresponding to a discrete state machine it is possible\n'
 'behaviour of any discrete state machine.\n'
 'played with the machine in question (as B) and the mimicking digital\n'
 'discrete state machine, is described by saying that they are universal\n'
 'various new machines to do various computing processes.\n'
 'suggested tentatively that the question, ‘Can machines think?’ should be\n'
 'general and ask ‘Are there discrete state machines which would do well?’\n'
 'ready to proceed to the deb

In [118]:
print(keywords(data, ratio=0.02))

('machines\n'
 'machine\n'
 'number\n'
 'numbers\n'
 'state\n'
 'states\n'
 'stated\n'
 'certain\n'
 'digital\n'
 'digits\n'
 'computing\n'
 'computers\n'
 'computations\n'
 'computation\n'
 'computable\n'
 'time\n'
 'times\n'
 'man\n'
 'form\n'
 'forms\n'
 'forming\n'
 'argument\n'
 'arguments\n'
 'human\n'
 'answer\n'
 'answers\n'
 'answered')


In [119]:
# Montemurro and Zanette’s entropy based keyword extraction algorithm
# identify words that play a significant role in the large-scale structure of a text

In [120]:
from gensim.summarization import mz_keywords

In [131]:
print(mz_keywords(data,scores=True,threshold=0.001)[:10])

[('the', 0.014861524445267834),
 ('to', 0.007767813733759304),
 ('of', 0.007746836544575982),
 ('a', 0.007642469804406734),
 ('in', 0.00491688390579003),
 ('is', 0.004873761611889092),
 ('be', 0.004746023441471577),
 ('that', 0.004656688509147202),
 ('it', 0.004414697682557859),
 ('computers', 0.003700867781298998)]


In [130]:
print(mz_keywords(data,scores=True,weighted=False,threshold=1.0)[:10])

[('laws', 3.0072808371918014),
 ('conduct', 2.350930600530324),
 ('soul', 2.350930600530324),
 ('child', 2.295789851187828),
 ('punishments', 2.1465665962663456),
 ('rewards', 2.1465665962663456),
 ('souls', 2.1465665962663456),
 ('thinks', 2.1465665962663456),
 ('winter', 2.1465665962663456),
 ('instruction', 1.99605438170079)]
