# Keyphrase Extraction

In [1]:
import requests
from bs4 import BeautifulSoup
import os.path
from dateutil import parser
import pandas as pd
import numpy as np

### Wikipedia as a data source
https://en.wikipedia.org/wiki/GameStop

In [2]:
import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.WIKI
)

In [3]:
# https://en.wikipedia.org/wiki/Mongol_invasion_of_Europe
p_wiki = wiki_wiki.page('GameStop')
print (p_wiki.text)


GameStop Corp. is an American video game, consumer electronics, and gaming merchandise retailer. The company is headquartered in Grapevine, Texas (a suburb of Dallas), and is the largest video game retailer worldwide. As of January 30, 2021, the company operated 4,816 stores including 3,192 in the United States, 253 in Canada, 417 in Australia and New Zealand and 954 in Europe under the GameStop, EB Games, EB Games Australia, Micromania-Zing, ThinkGeek and Zing Pop Culture brands. The company was founded in Dallas in 1984 as Babbage's, and took on its current name in 1999. 
The company's performance declined during the mid-late 2010s due to the shift of video game sales to online shopping and downloads and failed investments by GameStop in smartphone retail. In 2021 however, the company's stock price skyrocketed due to a short squeeze orchestrated by users of the Internet forum r/wallstreetbets. The company received significant media attention during January and February 2021 due to th

## Keyphrase Extraction
http://bdewilde.github.io/blog/2014/09/23/intro-to-automatic-keyphrase-extraction/

In [4]:
def extract_candidate_words(text, good_tags=set(['JJ','JJR','JJS','NN','NNP','NNS','NNPS'])):
    import itertools, nltk, string

    # exclude candidates that are stop words or entirely punctuation
    punct = set(string.punctuation)
    stop_words = set(nltk.corpus.stopwords.words('english'))
    # tokenize and POS-tag words
    tagged_words = itertools.chain.from_iterable(nltk.pos_tag_sents(nltk.word_tokenize(sent)
                                                                    for sent in nltk.sent_tokenize(text)))
    # filter on certain POS tags and lowercase all words
    candidates = [word.lower() for word, tag in tagged_words
                  if tag in good_tags and word.lower() not in stop_words
                  and not all(char in punct for char in word)]

    return candidates

In [5]:
def score_keyphrases_by_textrank(text, percentKeywords=0.05, maxKeywords=-1):
    from itertools import takewhile, tee
    from nltk.stem import WordNetLemmatizer
  
    import networkx, nltk
    
    lemmatizer = WordNetLemmatizer()
    
    # tokenize for all words, and extract *candidate* words
    words = [lemmatizer.lemmatize(word.lower())
             for sent in nltk.sent_tokenize(text)
             for word in nltk.word_tokenize(sent)]
    candidates = extract_candidate_words(text)
    # build graph, each node is a unique candidate
    graph = networkx.Graph()
    graph.add_nodes_from(set(candidates))
    # iterate over word-pairs, add unweighted edges into graph
    def pairwise(iterable):
        """s -> (s0,s1), (s1,s2), (s2, s3), ..."""
        a, b = tee(iterable)
        next(b, None)
        return zip(a, b)
    for w1, w2 in pairwise(candidates):
        if w2:
            graph.add_edge(*sorted([w1, w2]))
    # score nodes using default pagerank algorithm, sort by score, keep top percentKeywords
    ranks = networkx.pagerank(graph)
    if 0 < percentKeywords < 1:
        percentKeywordsMaxIdx = int(round(len(candidates) * percentKeywords))
    else:
        percentKeywordsMaxIdx = int(round(len(candidates) * 0.05))
    if (maxKeywords > 0):
        percentKeywordsMaxIdx = int(min(maxKeywords,percentKeywordsMaxIdx))

    word_ranks = {word_rank[0]: word_rank[1]
                  for word_rank in sorted(ranks.items(), key=lambda x: x[1], reverse=True)[:percentKeywordsMaxIdx]}
    keywords = set(word_ranks.keys())
    # merge keywords into keyphrases
    keyphrases = {}
    j = 0
    for i, word in enumerate(words):
        if i < j:
            continue
        if word in keywords:
            kp_words = list(takewhile(lambda x: x in keywords, words[i:i+10]))
            avg_pagerank = sum(word_ranks[w] for w in kp_words) / float(len(kp_words))
            keyphrases[' '.join(kp_words)] = avg_pagerank
            # counter as hackish way to ensure merged keyphrases are non-overlapping
            j = i + len(kp_words)

    return sorted(keyphrases.items(), key=lambda x: x[1], reverse=True)

In [6]:
keyphrases = score_keyphrases_by_textrank(p_wiki.text,0.01)

In [7]:
for keyphrase in keyphrases:
    print (keyphrase[0], keyphrase[1])

gamestop 0.03776213050237676
gamestop company 0.027921758604550824
gamestop store 0.02236003681664294
gamestop chairman 0.02124046013745743
company 0.01808138670672488
game 0.007732813690265353
babbage 0.007505791895491477
march 0.007180009623425898
store 0.006957943130909118
video game store 0.006697646343757228
video game 0.006567497950181282
new video game 0.0064743995163550755
new 0.006288202648702662
neostar 0.005848345567265955
video game retailer 0.005719021623155195
store due 0.00557331002463594
new business 0.005286391211012499
neostar chairman 0.005283567669902027
chairman 0.004718789772538099
february 0.0044437721915037325
business 0.004284579773322336
due 0.004188676918362761
retailer 0.004022068969103022
january 0.003940766126619485


In [8]:
newsArticle = u"""
AMC, Other Meme Stocks Turn Options Market Upside Down
Flurry of activity in meme-stock options underscores investors’ fear of missing out on surges

Traders last week spent $11.6 billion on options contracts tied to AMC.
By Gunjan Banerji
June 8, 2021 5:30 am ET

The meteoric rally in meme stocks such as AMC Entertainment Holdings Inc. AMC +4.67% and GameStop Corp. GME +16.52% has unleashed a burst of options trading, upending traditional dynamics in the market for stock bets.

The rush into the stocks coincided with frenzied trading for options—contracts that allow investors to bet on price moves in stocks or protect their portfolios. The once-obscure corner of the market has boomed this year like never before, with many new investors trying their hands during the pandemic shutdowns.

The complicated contracts can be risky to use but have mushroomed into a feature of the meme mania this year. Some individual investors have said that they are drawn to the thrill of options trading, happy to take on higher risks for the prospect of big payouts. They have used the bets to turbocharge their positions, eager to ride the relentless momentum in stocks like GameStop and AMC.

Call options, which allow investors the right to purchase stocks at a set price in the future, have recorded particularly heavy trading. Internet traders and others have favored them for making bullish bets in pursuit of mammoth gains. Their relatively low cost—with just one contract covering 100 shares—has lured many into the market, with activity rising to a fever pitch in recent sessions.

Traders last week spent $11.6 billion on options contracts tied to AMC, more than on the SPDR S&P 500 ETF Trust, Invesco QQQ Trust and Tesla Inc. combined, according to Cboe Global Markets data. Options on those stocks are typically among the market’s most popular.

The recent activity in meme stock options underscores investors’ fear of missing out on the surges. Many traders were positioning for even greater gains in AMC shares. The stock soared 83% last week, surpassing its record hit six years ago. Some of the most popular options contracts on AMC have been bullish calls pegged to shares jumping to $145 or $100.

The stock surged another 15% Monday to start the week and settled at $55, bringing gains for the year to 2494%. Other retail favorites such as GameStop, BlackBerry Ltd. and Koss Corp. also rallied.

“The perceived risk is not that AMC is going to go down to $10. The risk that everybody is worried about is AMC going up to $1000,” said Henry Schwartz, head of product intelligence at Cboe Global Markets. “It does kind of challenge all the normal assumptions that especially professionals tend to make.”

The options-trading activity at times can stoke bigger moves in the shares themselves, traders say, exacerbating swings. The intense activity in meme stocks has also overturned dynamics within the world of options and volatility trading.

Market volatility is a key input to pricing options. The higher the volatility, the pricier options can be: If a stock is recording more extreme swings, that increases the chances the options will pay out. Implied volatility, a measure of how turbulent traders expect stocks to be over a given time frame, typically drops as stocks go up,and climbs when they fall.

Some of the meme stocks have defied those expectations. As AMC share prices hit a record last week, implied volatility for the stock jumped to the highest level in around four months, according to Susquehanna Financial Group. Meanwhile, expected swings in GameStop and BlackBerry hit the highest levels in months—even as the stocks surged.

“If the market crashed tomorrow, would things get quieter or they’d go crazy? Well, they’d get more crazy, that spooks everybody,” Mr. Schwartz said. “What happens in these meme stocks is they also get much more volatile when the stocks go up.”

And typically, investors pay more to protect themselves from stock declines than they do for bullish wagers. That hasn’t been the case at times in meme stocks and a handful of other bets over the past year, like some special-purpose acquisition companies, analysts said.

“These traditional relationships between volatility and stocks have been turned on their heads in meme stocks,” said Chris Murphy, co-head of derivatives strategy at Susquehanna.
"""

In [9]:
keyphrases = score_keyphrases_by_textrank(newsArticle,0.05)
for keyphrase in keyphrases:
    print (keyphrase[0], keyphrase[1])

amc 0.026848854944986993
market 0.01962557833692728
market volatility 0.018226449906846014
volatility 0.01682732147676475
stock 0.016513564914189503
meme stock 0.016460878822269313
meme 0.016408192730349123
volatility trading 0.014992205392970567
activity 0.014058117158663045
trading 0.013157089309176388
year 0.012474610692923296
gamestop 0.010878518109953075


In [10]:
from textacy import *
import textacy.ke

en = textacy.load_spacy_lang("en_core_web_sm", disable=("parser",))
doc = textacy.make_spacy_doc(newsArticle, lang=en)

In [11]:
print("SGRank output: ", [kps for kps, weights in 
textacy.ke.sgrank(doc)])

SGRank output:  ['Cboe Global Markets', 'trader last week', 'AMC', 'stock', 'meme stock', 'stock option', 'option contract', 'option trading', 'market', 'investor']
