# Introduction to NLP with spaCy
Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human (natural) languages. It enables computers to read, understand, and derive meaning from human languages.

spaCy is a powerful and fast NLP library with pre-trained models for various languages, designed specifically for production use. spaCy helps us perform multiple NLP tasks, such as tokenization, named entity recognition (NER), and semantic similarity assessments.

## Installation and Setup
Before we start, ensure that spaCy is installed and that you have downloaded the language model for English. If not, you can install spaCy and download the English model using the following commands:

In [1]:
# !pip install spacy
# !python -m spacy download en_core_web_sm
!pip3 install spacy
!python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Getting Started with spaCy
First, let's import spaCy and load the English language model.

In [2]:
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_sm')


## Tokenization
Tokenization is the process of splitting a text into meaningful segments, called tokens. This is the first step in NLP and is essential for further analyses.

In [3]:
# Sample text
text = "Apple is looking at buying U.K. startup for $1 billion"

# Process the text
doc = nlp(text)

# Iterate over tokens
for token in doc:
    print(token.text)


Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


## Part-of-Speech (POS) Tagging and Dependency Parsing
After tokenization, spaCy can also perform POS tagging and dependency parsing to understand the grammatical structure of a sentence.

In [4]:
for token in doc:
    print(f"{token.text:{12}} {token.pos_:{6}} {token.dep_}")


Apple        PROPN  nsubj
is           AUX    aux
looking      VERB   ROOT
at           ADP    prep
buying       VERB   pcomp
U.K.         PROPN  dobj
startup      NOUN   dep
for          ADP    prep
$            SYM    quantmod
1            NUM    compound
billion      NUM    pobj


## Named Entity Recognition (NER)
NER is a method of extracting the entities (names of things, such as companies, locations, quantities) from a text. This is useful in many applications, such as information retrieval, fact extraction, and content classification.



In [5]:
# Process a new text
text = "Apple decided to buy U.K. startup for $1 billion in 2021."

# Process the text
doc = nlp(text)

# Iterate over the detected entities
for ent in doc.ents:
    print(f"{ent.text:{17}} {ent.label_}")


Apple             ORG
U.K.              GPE
$1 billion        MONEY
2021              DATE


## Semantic Similarity Assessments
spaCy can compare two documents, tokens, or spans and estimate similarity based on word vectors. This is useful for recommendation systems, search engines, and more.

In [6]:
# Comparing two sentences
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Compute similarity
similarity = doc1.similarity(doc2)
print(f"Document similarity: {similarity:.2f}")


Document similarity: 0.37


  similarity = doc1.similarity(doc2)


In [12]:
# To avoid the warning let us try this with the larger model also
# !python3 -m spacy download en_core_web_md
nlp = spacy.load('en_core_web_md')

# Comparing two sentences
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Compute similarity
similarity = doc1.similarity(doc2)
print(f"Document similarity: {similarity:.2f}")

# We need the larger model to continue with the next examples

Document similarity: 0.69


## Building a Basic Film Recommendation Engine
For the film recommendation engine, we'll start by creating a dummy dataset of film descriptions. Then, we'll use spaCy to compute semantic similarities between a query film and the dataset to recommend similar films.

In [13]:
# Dummy dataset
films = {
    "Film A": "A sci-fi adventure set in the future",
    "Film B": "A documentary about the history of aviation",
    "Film C": "A romantic comedy set in New York",
}

# Query
query = "A futuristic adventure"

# Process the query and each film description
query_doc = nlp(query)
scores = {}

for film, description in films.items():
    film_doc = nlp(description)
    similarity = query_doc.similarity(film_doc)
    scores[film] = similarity

# Recommend the film with the highest similarity
recommended_film = max(scores, key=scores.get)
print(f"Recommended Film: {recommended_film}")


Recommended Film: Film A


## Conclusion and Next Steps
In this lecture, we've covered the basics of NLP using spaCy, including tokenization, NER, semantic similarity, and built a basic film recommendation engine. spaCy is a powerful tool for NLP tasks, and we encourage you to explore its capabilities further.

For more advanced topics and applications, consider exploring dependency parsing, custom NER models, and integrating spaCy with machine learning frameworks for comprehensive NLP solutions.