# spaCy Introduction
spaCy is a python library for natural language processing (NLP).
It provides neural network models for tagging, parsing, named entity recognition (NER) , text classification.

Below you find a few examples on how to use space for tokenization and NER. 
## Install dependencies

In [None]:
import sys
def installModule(projectName:str, moduleName:str=None):
    '''Installs and loads the given module if not already installed'''
    if moduleName is None:
        moduleName=projectName
    if moduleName not in sys.modules:
        !python -m pip install --no-input $projectName
        %reload_ext $moduleName
        print(f'{projectName} installed')
    else:
        print(f'{projectName} found')

installModule('spacy')
installModule('newspaper3k', 'newspaper')

## Download Models

In [None]:
!python -m spacy download en_core_web_sm

##  Examples
### Initial Example
taken from  https://spacy.io/

In [None]:
import spacy

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

## Finding Locations

The named entity tags __LOC__ and __GPE__(Geopolitical Entity) indicate found location entites.

In [None]:
import spacy
from tabulate import tabulate
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")
# taken from https://obamawhitehouse.archives.gov/issues/foreign-policy/asia-trip-2012
text="""
 Trip Schedule

SUNDAY

    President Obama visits Wat Pho Royal Monastery in Bangkok, Thailand
    President Obama receives a Royal Audience with King Bhumibol Adulyadej
    President Obama takes part in a formal welcome ceremony at Thai Koo Fah Building
    President and Prime Minister Yingluck Shinawatra of Thailand will hold a joint press conference, and then attend an official dinner together. This press conference was open press.
    Read more about President Obama's first stop in Asia 

MONDAY

    President Obama travels to Burma where he meets with President Thein Sein and Aung San Suu Kyi and delivers a speech to encourage Burma’s ongoing democratic transition.
    Read more about President Obama promised support for the people of Burma
    In the evening, the President will travel to Cambodia, where he will attend the East Asia Summit and meet with the leaders of the Association of Southeast Asian Nations.

TUESDAY

    President Obama attended the East Asia Summit in Cambodia

"""
doc = nlp(text)
foundEntities=[{"Text":entity.text, "Entity Tag":entity.label_} for entity in doc.ents]
print(tabulate(foundEntities, headers="keys"))

## URL as Text Input

https://www.engr.psu.edu/ce/courses/ce584/concrete/library/construction/curing/Composition%20of%20cement.htm as text input

In [None]:
from newspaper import Article
import spacy
url="https://www.engr.psu.edu/ce/courses/ce584/concrete/library/construction/curing/Composition%20of%20cement.htm"
article = Article(url)
article.download()
article.parse()
text=article.text

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
# Find named entities, phrases and concepts
foundEntities=[{"Text":entity.text, "Entity Tag":entity.label_} for entity in doc.ents]
print(tabulate(foundEntities, headers="keys"))