## Keyword Extraction

RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

In [1]:
import os
import sys
import requests
print(sys.version)

import re
import pandas as pd

3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0]


In [2]:
# !pip install --upgrade rake_nltk

In [3]:
from rake_nltk import Rake

In [4]:
pd.set_option('max_colwidth', 500)

#### Copy files to local FS from GCP bucket

In [5]:
def get_gcs_data (bucket_name, folder_name, file_name, path_local):
    url = 'https://storage.googleapis.com/' + bucket_name + '/' + folder_name + '/' + file_name
    r = requests.get(url)
    open(path_local + '/' + file_name , 'wb').write(r.content)

In [6]:
path_books = '/home/jupyter/data/books'

In [7]:
bucket_name = 'msca-bdp-data-open'
folder_name = 'books'
file_name = ['3boat10.txt']
path_local = path_books

os.makedirs(path_local, exist_ok=True)

for file in file_name:
    get_gcs_data (bucket_name = bucket_name,
                 folder_name = folder_name,
                 file_name = file,
                 path_local = path_local)
    print('Downloaded: ' + file)

Downloaded: 3boat10.txt


#### Extract keywords

In [8]:
text = '''

The University of Chicago is an urban research university that has driven new ways of thinking since 1890. Our commitment to free and open inquiry draws inspired scholars to our global campuses, where ideas are born that challenge and change the world.

We empower individuals to challenge conventional thinking in pursuit of original ideas. Students in the College develop critical, analytic, and writing skills in our rigorous, interdisciplinary core curriculum. Through graduate programs, students test their ideas with UChicago scholars, and become the next generation of leaders in academia, industry, nonprofits, and government.

UChicago research has led to such breakthroughs as discovering the link between cancer and genetics, establishing revolutionary theories of economics, and developing tools to produce reliably excellent urban schooling. We generate new insights for the benefit of present and future generations with our national and affiliated laboratories: Argonne National Laboratory, Fermi National Accelerator Laboratory, and the Marine Biological Laboratory in Woods Hole, Massachusetts.

The University of Chicago is enriched by the city we call home. In partnership with our neighbors, we invest in Chicago's mid-South Side across such areas as health, education, economic growth, and the arts. Together with our medical center, we are the largest private employer on the South Side.

In all we do, we are driven to dig deeper, push further, and ask bigger questions—and to leverage our knowledge to enrich all human life. Our diverse and creative students and alumni drive innovation, lead international conversations, and make masterpieces. Alumni and faculty, lecturers and postdocs go on to become Nobel laureates, CEOs, university presidents, attorneys general, literary giants, and astronauts. 
'''

In [9]:
print(text)



The University of Chicago is an urban research university that has driven new ways of thinking since 1890. Our commitment to free and open inquiry draws inspired scholars to our global campuses, where ideas are born that challenge and change the world.

We empower individuals to challenge conventional thinking in pursuit of original ideas. Students in the College develop critical, analytic, and writing skills in our rigorous, interdisciplinary core curriculum. Through graduate programs, students test their ideas with UChicago scholars, and become the next generation of leaders in academia, industry, nonprofits, and government.

UChicago research has led to such breakthroughs as discovering the link between cancer and genetics, establishing revolutionary theories of economics, and developing tools to produce reliably excellent urban schooling. We generate new insights for the benefit of present and future generations with our national and affiliated laboratories: Argonne National Labo

In [10]:
r = Rake() # Uses stopwords for english from NLTK, and all punctuation characters.

r.extract_keywords_from_text(text)

keywords = r.get_ranked_phrases() # To get keyword phrases ranked highest to lowest.

keywords[:20]

['produce reliably excellent urban schooling',
 'open inquiry draws inspired scholars',
 'ask bigger questions —',
 'fermi national accelerator laboratory',
 'marine biological laboratory',
 'thinking since 1890',
 'lead international conversations',
 'largest private employer',
 'interdisciplinary core curriculum',
 'generate new insights',
 'establishing revolutionary theories',
 'college develop critical',
 'argonne national laboratory',
 'urban research university',
 'south side across',
 'driven new ways',
 'challenge conventional thinking',
 'become nobel laureates',
 'alumni drive innovation',
 'uchicago scholars']

### Extracting keywords from the book with RAKE

In [11]:
directory = '/home/jupyter/data/books/'
book = '3boat10.txt'
f = open(directory+book)
textRaw = f.read()
text = re.sub(r'\n', ' ', textRaw)

In [12]:
%time
r.extract_keywords_from_text(text)
keywords = r.get_ranked_phrases()

keywords[:20]

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 11.2 µs


['french mercenaries hover like crouching wolves without',
 'old j ., anyhow ," rejoined harris',
 'chilly churches behind wheezy old men',
 'project gutenberg etext three men',
 'caught fifteen dozen perch yesterday evening ;"',
 'bellow forth roystering drinking songs',
 'eighteen pounds six ounces ," said',
 'bally old coffin ," observed george',
 'aunt maria would mildly observe',
 'scowled fiercely round upon us',
 'marlow manor owned saxon algar',
 'launch would give one final shriek',
 'boards rouses every evil instinct',
 'stout old ladies simply nowhere',
 'ever heard herr slossenn boschen',
 'herr slossenn boschen accompanied',
 'would get herr slossenn boschen',
 'everything ," uncle podger would reply',
 'known disease going within ten miles',
 'completely ruined one late autumn']

### Extracting keywords from news articles with RAKE

In [13]:
news_path = 'https://storage.googleapis.com/msca-bdp-data-open/news/news_toyota.json'

news_df = pd.read_json(news_path, orient='records', lines=True)
news_df.shape

(100, 4)

In [14]:
# Filter non-English news
news_df = news_df[news_df['language']=='english'].reset_index(drop=True)

In [15]:
# Remove /n characters to avoid problems with analysis
news_df['text'] = news_df['text'].map(lambda x: re.sub(r'\n', '.  ', str(x)))

In [16]:
pd.set_option('display.max_colwidth', 200)
news_df[['text']].head(5)

Unnamed: 0,text
0,"QR Code Link to This Post All maintenance receipts available, one owner truck. Cash sale. No trades. 6477478013"
1,"0 . NEW YORK: Automakers reported mixed US car sales in January, with strong demand for SUVs and pickup trucks continuing to provide a cushion in a declining overall auto market. . Ford and Fiat..."
2,transmission: automatic 2005 Toyota Camry LE 4 door 4 cyl AUTOMATIC VERY CLEAN INSIDE CLOTH INTERIOR NICE. Just has a Little Hale damage car runs. GREAT 167300 MILEAGE. CALL show contact info . ...
3,favorite this post Brand New Toyota Avalon Floor Mats - $115 (New Britain) hide this posting unhide QR Code Link to This Post I have a set of front and rear original Toyota Avalon floor mats in bl...
4,more ads by this user QR Code Link to This Post Black w/Piano Black w/Perforated NuLuxe Seat Trim. CARFAX One-Owner. 31/21 Highway/City MPG Obsidian Priced below KBB Fair Purchase Price!Locally ow...


In [17]:
r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.

def rake_implement(x,r):
    r.extract_keywords_from_text(x) # r.extract_keywords_from_text(<text to process>)
    return r.get_ranked_phrases() # r.get_ranked_phrases() # To get keyword phrases ranked highest to lowest.

In [18]:
news_df['rake_phrases']=news_df['text'].apply(lambda x: rake_implement(x,r))

#### Appending RAKE keywords to Pandas DF

In [19]:
news_df[['text', 'rake_phrases']].head(5)

Unnamed: 0,text,rake_phrases
0,"QR Code Link to This Post All maintenance receipts available, one owner truck. Cash sale. No trades. 6477478013","[qr code link, one owner truck, maintenance receipts available, cash sale, trades, post, 6477478013]"
1,"0 . NEW YORK: Automakers reported mixed US car sales in January, with strong demand for SUVs and pickup trucks continuing to provide a cushion in a declining overall auto market. . Ford and Fiat...","[solid 16 million vehicles amid low unemployment, ram truck brand fell 16 percent, automakers reported mixed us car sales, us car sales fell last year, vehicles ,” said mark laneve, detroit auto s..."
2,transmission: automatic 2005 Toyota Camry LE 4 door 4 cyl AUTOMATIC VERY CLEAN INSIDE CLOTH INTERIOR NICE. Just has a Little Hale damage car runs. GREAT 167300 MILEAGE. CALL show contact info . ...,"[automatic 2005 toyota camry le 4 door 4 cyl automatic, little hale damage car runs, clean inside cloth interior nice, call show contact info, great 167300 mileage, 2450 6473894894, transmission]"
3,favorite this post Brand New Toyota Avalon Floor Mats - $115 (New Britain) hide this posting unhide QR Code Link to This Post I have a set of front and rear original Toyota Avalon floor mats in bl...,"[post brand new toyota avalon floor mats, rear original toyota avalon floor mats, posting unhide qr code link, offers post id, show contact info, mats based, new britain, original wrapping, toyota..."
4,more ads by this user QR Code Link to This Post Black w/Piano Black w/Perforated NuLuxe Seat Trim. CARFAX One-Owner. 31/21 Highway/City MPG Obsidian Priced below KBB Fair Purchase Price!Locally ow...,"[red carpet elite programs define lexus luxury, 5l v6 dohc dual vvt, user qr code link, toyota vehicles unless certified, perforated nuluxe seat trim, kbb fair purchase price, city mpg obsidian pr..."


#### Selecting on RAKE keywords

In [20]:
news_df['rake_phrases']=news_df['text'].apply(lambda x: rake_implement(x,r)).apply(', '.join)

news_df[['text', 'rake_phrases']][news_df['rake_phrases'].str.contains("camry", na=False)].head(5)

Unnamed: 0,text,rake_phrases
1,"0 . NEW YORK: Automakers reported mixed US car sales in January, with strong demand for SUVs and pickup trucks continuing to provide a cushion in a declining overall auto market. . Ford and Fiat...","solid 16 million vehicles amid low unemployment, ram truck brand fell 16 percent, automakers reported mixed us car sales, us car sales fell last year, vehicles ,” said mark laneve, detroit auto sh..."
2,transmission: automatic 2005 Toyota Camry LE 4 door 4 cyl AUTOMATIC VERY CLEAN INSIDE CLOTH INTERIOR NICE. Just has a Little Hale damage car runs. GREAT 167300 MILEAGE. CALL show contact info . ...,"automatic 2005 toyota camry le 4 door 4 cyl automatic, little hale damage car runs, clean inside cloth interior nice, call show contact info, great 167300 mileage, 2450 6473894894, transmission"
7,transmission: automatic QR Code Link to This Post I'm selling this really nice Camry. Automatic transmission. Leather. Moon roof. Toyota certified parts on car. New tires. Alloy wheels. Smog cert ...,"tagged til next september, automatic qr code link, text show contact info, toyota certified parts, really nice camry, offers post id, automatic transmission, unsolicited services, smog cert, new t..."
9,"Filed under: Earnings/Financials , Chrysler , Fiat , Ford , Honda , Nissan , RAM , Toyota . Continue reading Trucks, SUVs — and Camry — shine in mixed U.S. January vehicle sales . Trucks, SUVs —...","january vehicle sales originally appeared, 01 feb 2018 17, autoblog http :// ift, january vehicle sales, feeds .. permalink, camry — shine, camry — shine, continue reading trucks, suvs —, suvs —, ..."
20,"cars As-it-happens update ⋅ February 2, 2018 NEWS 5 hot vehicles in January: Toyota, Nissan among leaders USA TODAY Here are five of the best-selling vehicles in the industry in January, several o...","nissan among leaders usa today, happens update ⋅ february 2, 2018 news 5 hot vehicles, rss feed send feedback, stalwart camry sedan, selling vehicles, remarkable feat, recently overhauled, plungin..."


In [21]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Sat, 29 October 2022 10:37:37'