# Introduction to NLP with spaCy
Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human (natural) languages. It enables computers to read, understand, and derive meaning from human languages.

spaCy is a powerful and fast NLP library with pre-trained models for various languages, designed specifically for production use. spaCy helps us perform multiple NLP tasks, such as tokenization, named entity recognition (NER), and semantic similarity assessments.

## Installation and Setup
Before we start, ensure that spaCy is installed and that you have downloaded the language model for English. If not, you can install spaCy and download the English model using the following commands:

In [10]:
# !pip install spacy
# !python -m spacy download en_core_web_sm

# For Linux
'''!pip install spacy
!python -m spacy download en_core_web_sm

# For Mac
!pip3 install spacy
!python3 -m spacy download en_core_web_sm

# For Windows
%pip install spacy
%python -m spacy download en_core_web_sm'''

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Getting Started with spaCy
First, let's import spaCy and load the English language model.

In [1]:
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_sm')


## Tokenization
Tokenization is the process of splitting a text into meaningful segments, called tokens. This is the first step in NLP and is essential for further analyses.

In [25]:
# Sample text
text = "Apple is looking at buying U.K. startup for $1 billion. Apple trees are awesome."

# Process the text
doc = nlp(text)

# Iterate over tokens
for token in doc:
    print(token.text)


Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion
.
Apple
trees
are
awesome
.


## Part-of-Speech (POS) Tagging and Dependency Parsing
After tokenization, spaCy can also perform POS tagging and dependency parsing to understand the grammatical structure of a sentence.

In [28]:
for token in doc:
    print(f"{token.text:{12}} {token.pos_:{16}} {token.dep_}")


Apple        PROPN            nsubj
is           AUX              aux
looking      VERB             ROOT
at           ADP              prep
buying       VERB             pcomp
U.K.         PROPN            dobj
startup      NOUN             dep
for          ADP              prep
$            SYM              quantmod
1            NUM              compound
billion      NUM              pobj
.            PUNCT            punct
Apple        NOUN             compound
trees        NOUN             nsubj
are          AUX              ROOT
awesome      ADJ              acomp
.            PUNCT            punct


In [32]:
from spacy import displacy

displacy.render(doc, style='dep', jupyter=True, options={'distance': 200})

## Named Entity Recognition (NER)
NER is a method of extracting the entities (names of things, such as companies, locations, quantities) from a text. This is useful in many applications, such as information retrieval, fact extraction, and content classification.



In [61]:
! py -m spacy download en_core_web_lg

^C


Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     ---------------------------------------- 0.1/587.7 MB 3.3 MB/s eta 0:02:58
     ---------------------------------------- 0.2/587.7 MB 3.1 MB/s eta 0:03:13
     ---------------------------------------- 0.2/587.7 MB 3.1 MB/s eta 0:03:13
     ---------------------------------------- 0.2/587.7 MB 3.1 MB/s eta 0:03:13
     ---------------------------------------- 0.2/587.7 MB 3.1 MB/s eta 0:03:13
     ---------------------------------------- 0.4/587.7 MB 1.5 MB/s eta 0:06:22
     ---------------------------------------- 0.7/587.7 MB 2.0 MB/s eta 0:04:50
     ---------------------------------------- 1.0/587.7 MB 2.8 MB/s eta 0:03:33
     ---------------------------------------- 1.3/587.7 MB 3.3 MB/s eta 0:03:01
     -------------------------


[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [60]:
# Process a new text
nlp = spacy.load('en_core_web_lg')
text = "Apple decided to buy U.K. startup for $1 billion in 2021. Twitter is battling Apple for top dog spot."

# Process the text
doc = nlp(text)

# Iterate over the detected entities
for ent in doc.ents:
    print(f"{ent.text:{17}} {ent.label_}")


Apple             ORG
U.K.              GPE
$1 billion        MONEY
2021              DATE
Y                 ORG
Apple             ORG


## Semantic Similarity Assessments
spaCy can compare two documents, tokens, or spans and estimate similarity based on word vectors. This is useful for recommendation systems, search engines, and more.

In [81]:
# To avoid the warning let us try this with the larger model also
# !python3 -m spacy download en_core_web_md
nlp = spacy.load('en_core_web_md')

# Comparing two sentences
doc1 = nlp("A")
doc2 = nlp("a")

# Compute similarity
similarity = doc1.similarity(doc2)
print(f"Document similarity: {similarity:.2f}")

# We need the larger model to continue with the next examples

Document similarity: 0.63


## Building a Basic Film Recommendation Engine
For the film recommendation engine, we'll start by creating a dummy dataset of film descriptions. Then, we'll use spaCy to compute semantic similarities between a query film and the dataset to recommend similar films.

In [91]:
# Dummy dataset
films = {
    "Film A": "A sci-fi adventure set in the future",
    "Film B": "A documentary about the history of aviation",
    "Film C": "A romantic comedy set in New York",
}

# Query
query = "An adventure through the skies"

# Process the query and each film description
query_doc = nlp(query)
scores = {}

for film, description in films.items():
    film_doc = nlp(description)
    similarity = query_doc.similarity(film_doc)
    scores[film] = round(similarity,2)

# Recommend the film with the highest similarity
recommended_film = max(scores, key=scores.get)
print(f'{scores}')
print(f"Recommended Film: {recommended_film}")


{'Film A': 0.52, 'Film B': 0.62, 'Film C': 0.44}
Recommended Film: Film B


## Conclusion and Next Steps
In this lecture, we've covered the basics of NLP using spaCy, including tokenization, NER, semantic similarity, and built a basic film recommendation engine. spaCy is a powerful tool for NLP tasks, and we encourage you to explore its capabilities further.

For more advanced topics and applications, consider exploring dependency parsing, custom NER models, and integrating spaCy with machine learning frameworks for comprehensive NLP solutions.