# Spacy
SpaCy is an open-source, fast, and efficient library for Natural Language Processing (NLP) in Python. It is designed to handle a variety of common NLP tasks like tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and more. SpaCy is widely used in both academic and industrial applications due to its speed, accuracy, and ease of use.
###### Key Features of SpaCy:
- Tokenization (token.text)
- Part-of-Speech Tagging ( token.pos_)
- Named Entity Recognition (for ent in doc.ents : print(ent.text, ent.label_))
- Dependency Parsing (token.dep_)
- Lemmatization (token.lemma_)
- Vectorization ((token1.similarity(token2)))
- Text Classification
- Pre-trained Models eg: (nlp = spacy.load('en_core_web_sm'))
- Pipeline Customization
- Fast and Efficient

In [4]:
#!pip install spacy
#!pip install --upgrade numpy
#!pip install --upgrade h5py
#!python -m spacy download en_core_web_sm

import spacy

In [5]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("data science and ai has great career ahead")

1. English
- en_core_web_sm — Small model (fast, less accurate)
- en_core_web_md — Medium model (balanced between speed and accuracy)
- en_core_web_lg — Large model (most accurate, slower)
- en_core_web_trf — Transformer-based model (very accurate, slower, but good for complex tasks)

2. French
- fr_core_news_sm — Small model
- fr_core_news_md — Medium model
- fr_core_news_lg — Large model

3. German
- de_core_news_sm — Small model
- de_core_news_md — Medium model
- de_core_news_lg — Large model

4. Similarly we have various models for different languages

5. Other Multilingual Models 
- xx_ent_wiki_sm — A multilingual model trained on Wikipedia data that can handle many languages. Good for general tasks where specific language models aren’t available. It supports over 50 languages.
- xx_sent_ud_sm — A multilingual sentence boundary detection model for many languages.

In [7]:
for token in doc:
    print(token.text,":",token.pos,'-->',token.lemma_,token.dep_)
#token.dep_ (Syntactic Dependency): The syntactic dependency label shows how the token is grammatically connected to other tokens in the sentence.

data : 92 --> data compound
science : 92 --> science nsubj
and : 89 --> and cc
ai : 100 --> ai conj
has : 100 --> have ROOT
great : 84 --> great amod
career : 92 --> career dobj
ahead : 86 --> ahead advmod


In [8]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text,";", token.lemma_,";", token.pos_,";",token.tag_,";",token.dep_,";",token.shape_,";",token.is_alpha,";",token.is_stop)

Apple ; Apple ; PROPN ; NNP ; nsubj ; Xxxxx ; True ; False
is ; be ; AUX ; VBZ ; aux ; xx ; True ; True
looking ; look ; VERB ; VBG ; ROOT ; xxxx ; True ; False
at ; at ; ADP ; IN ; prep ; xx ; True ; True
buying ; buy ; VERB ; VBG ; pcomp ; xxxx ; True ; False
U.K. ; U.K. ; PROPN ; NNP ; nsubj ; X.X. ; False ; False
startup ; startup ; VERB ; VBD ; ccomp ; xxxx ; True ; False
for ; for ; ADP ; IN ; prep ; xxx ; True ; True
$ ; $ ; SYM ; $ ; quantmod ; $ ; False ; False
1 ; 1 ; NUM ; CD ; compound ; d ; False ; False
billion ; billion ; NUM ; CD ; pobj ; xxxx ; True ; False


text = """There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.
Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[4] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured """

In [10]:
text="""The tiger (Panthera tigris) is a large cat and a member of the genus Panthera native to Asia. It has a powerful, muscular body with a large head and paws, a long tail and orange fur with black, mostly vertical stripes. It is traditionally classified into nine recent subspecies, though some recognise only two subspecies, mainland Asian tigers and the island tigers of the Sunda Islands.

Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia. The tiger is an apex predator and preys mainly on ungulates, which it takes by ambush. It lives a mostly solitary life and occupies home ranges, defending these from individuals of the same sex. The range of a male tiger overlaps with that of multiple females with whom he mates. Females give birth to usually two or three cubs that stay with their mother for about two years. When becoming independent, they leave their mother's home range and establish their own.

Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China and on the islands of Java and Bali. Today, the tiger's range is severely fragmented. It is listed as Endangered on the IUCN Red List of Threatened Species, as its range is thought to have declined by 53% to 68% since the late 1990s. Major threats to tigers are habitat destruction and fragmentation due to deforestation, poaching for fur and the illegal trade of body parts for medicinal purposes. Tigers are also victims of human–wildlife conflict as they attack and prey on livestock in areas where natural prey is scarce. The tiger is legally protected in all range countries. National conservation measures consist of action plans, anti-poaching patrols and schemes for monitoring tiger populations. In several range countries, wildlife corridors have been established and tiger reintroduction is planned.

The tiger is among the most popular of the world's charismatic megafauna. It has been kept in captivity since ancient times and has been trained to perform in circuses and other entertainment shows. The tiger featured prominently in the ancient mythology and folklore of cultures throughout its historic range and has continued to appear in culture worldwide."""

In [11]:
text

"The tiger (Panthera tigris) is a large cat and a member of the genus Panthera native to Asia. It has a powerful, muscular body with a large head and paws, a long tail and orange fur with black, mostly vertical stripes. It is traditionally classified into nine recent subspecies, though some recognise only two subspecies, mainland Asian tigers and the island tigers of the Sunda Islands.\n\nThroughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia. The tiger is an apex predator and preys mainly on ungulates, which it takes by ambush. It lives a mostly solitary life and occupies home ranges, defending these from individuals of the same sex. The range of a male tiger overlaps with that of multiple females with whom he mates. Females give birth to usually two or three cubs that stay with their 

In [12]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [13]:
stopwords = list(STOP_WORDS)
stopwords

['n‘t',
 'someone',
 'keep',
 'ten',
 'already',
 'itself',
 'throughout',
 'was',
 'across',
 'during',
 'top',
 'third',
 'used',
 'doing',
 'get',
 'she',
 'have',
 '’ll',
 'unless',
 'do',
 'indeed',
 'until',
 'then',
 'can',
 'and',
 'anything',
 'therein',
 'these',
 '’ve',
 'make',
 'together',
 '‘s',
 "'re",
 'due',
 'quite',
 'bottom',
 'myself',
 'just',
 'will',
 'further',
 'i',
 'give',
 'cannot',
 'in',
 'everywhere',
 'anyway',
 'made',
 'ca',
 'five',
 "'d",
 'wherein',
 'would',
 'has',
 '‘ll',
 'should',
 'twelve',
 'really',
 'which',
 'back',
 'nevertheless',
 'one',
 'out',
 'their',
 'them',
 'had',
 'upon',
 'somehow',
 'fifteen',
 'who',
 "'ll",
 'beforehand',
 'perhaps',
 'neither',
 'yourselves',
 'while',
 'done',
 'per',
 'alone',
 'after',
 'only',
 'something',
 'since',
 'last',
 'thereby',
 'latterly',
 'at',
 'anyone',
 'almost',
 'whom',
 'but',
 'before',
 'afterwards',
 'none',
 'whereafter',
 'besides',
 'else',
 'although',
 'when',
 'some',
 'man

In [14]:
len(stopwords)

326

In [15]:
doc=nlp(text)
doc

The tiger (Panthera tigris) is a large cat and a member of the genus Panthera native to Asia. It has a powerful, muscular body with a large head and paws, a long tail and orange fur with black, mostly vertical stripes. It is traditionally classified into nine recent subspecies, though some recognise only two subspecies, mainland Asian tigers and the island tigers of the Sunda Islands.

Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia. The tiger is an apex predator and preys mainly on ungulates, which it takes by ambush. It lives a mostly solitary life and occupies home ranges, defending these from individuals of the same sex. The range of a male tiger overlaps with that of multiple females with whom he mates. Females give birth to usually two or three cubs that stay with their mot

In [16]:
# lets get the tokens from text

tokens = [token.text for token in doc]
print(tokens)

['The', 'tiger', '(', 'Panthera', 'tigris', ')', 'is', 'a', 'large', 'cat', 'and', 'a', 'member', 'of', 'the', 'genus', 'Panthera', 'native', 'to', 'Asia', '.', 'It', 'has', 'a', 'powerful', ',', 'muscular', 'body', 'with', 'a', 'large', 'head', 'and', 'paws', ',', 'a', 'long', 'tail', 'and', 'orange', 'fur', 'with', 'black', ',', 'mostly', 'vertical', 'stripes', '.', 'It', 'is', 'traditionally', 'classified', 'into', 'nine', 'recent', 'subspecies', ',', 'though', 'some', 'recognise', 'only', 'two', 'subspecies', ',', 'mainland', 'Asian', 'tigers', 'and', 'the', 'island', 'tigers', 'of', 'the', 'Sunda', 'Islands', '.', '\n\n', 'Throughout', 'the', 'tiger', "'s", 'range', ',', 'it', 'inhabits', 'mainly', 'forests', ',', 'from', 'coniferous', 'and', 'temperate', 'broadleaf', 'and', 'mixed', 'forests', 'in', 'the', 'Russian', 'Far', 'East', 'and', 'Northeast', 'China', 'to', 'tropical', 'and', 'subtropical', 'moist', 'broadleaf', 'forests', 'on', 'the', 'Indian', 'subcontinent', 'and', 'S

In [17]:
tokens

['The',
 'tiger',
 '(',
 'Panthera',
 'tigris',
 ')',
 'is',
 'a',
 'large',
 'cat',
 'and',
 'a',
 'member',
 'of',
 'the',
 'genus',
 'Panthera',
 'native',
 'to',
 'Asia',
 '.',
 'It',
 'has',
 'a',
 'powerful',
 ',',
 'muscular',
 'body',
 'with',
 'a',
 'large',
 'head',
 'and',
 'paws',
 ',',
 'a',
 'long',
 'tail',
 'and',
 'orange',
 'fur',
 'with',
 'black',
 ',',
 'mostly',
 'vertical',
 'stripes',
 '.',
 'It',
 'is',
 'traditionally',
 'classified',
 'into',
 'nine',
 'recent',
 'subspecies',
 ',',
 'though',
 'some',
 'recognise',
 'only',
 'two',
 'subspecies',
 ',',
 'mainland',
 'Asian',
 'tigers',
 'and',
 'the',
 'island',
 'tigers',
 'of',
 'the',
 'Sunda',
 'Islands',
 '.',
 '\n\n',
 'Throughout',
 'the',
 'tiger',
 "'s",
 'range',
 ',',
 'it',
 'inhabits',
 'mainly',
 'forests',
 ',',
 'from',
 'coniferous',
 'and',
 'temperate',
 'broadleaf',
 'and',
 'mixed',
 'forests',
 'in',
 'the',
 'Russian',
 'Far',
 'East',
 'and',
 'Northeast',
 'China',
 'to',
 'tropical'

In [18]:
len(tokens)

453

In [19]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [20]:
word_frequencies = {}

for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [21]:
word_frequencies

{'tiger': 11,
 'Panthera': 2,
 'tigris': 1,
 'large': 3,
 'cat': 1,
 'member': 1,
 'genus': 1,
 'native': 1,
 'Asia': 3,
 'powerful': 1,
 'muscular': 1,
 'body': 2,
 'head': 1,
 'paws': 1,
 'long': 1,
 'tail': 1,
 'orange': 1,
 'fur': 2,
 'black': 1,
 'vertical': 1,
 'stripes': 1,
 'traditionally': 1,
 'classified': 1,
 'recent': 1,
 'subspecies': 2,
 'recognise': 1,
 'mainland': 1,
 'Asian': 1,
 'tigers': 3,
 'island': 1,
 'Sunda': 1,
 'Islands': 1,
 '\n\n': 3,
 'range': 9,
 'inhabits': 1,
 'mainly': 2,
 'forests': 3,
 'coniferous': 1,
 'temperate': 1,
 'broadleaf': 2,
 'mixed': 1,
 'Russian': 1,
 'Far': 1,
 'East': 1,
 'Northeast': 1,
 'China': 2,
 'tropical': 1,
 'subtropical': 1,
 'moist': 1,
 'Indian': 1,
 'subcontinent': 1,
 'Southeast': 1,
 'apex': 1,
 'predator': 1,
 'preys': 1,
 'ungulates': 1,
 'takes': 1,
 'ambush': 1,
 'lives': 1,
 'solitary': 1,
 'life': 1,
 'occupies': 1,
 'home': 2,
 'ranges': 1,
 'defending': 1,
 'individuals': 1,
 'sex': 1,
 'male': 1,
 'overlaps': 1,


In [22]:
max_frequency = max(word_frequencies.values())
max_frequency

11

In [23]:
#to get normalized/weighted frequencies you should devide all frequencies with 11

for word in word_frequencies.keys():
    word_frequencies[word] =  word_frequencies[word]/max_frequency

In [24]:
#print(word_frequencies)
word_frequencies
#this is the normalized frequencies of each word

{'tiger': 1.0,
 'Panthera': 0.18181818181818182,
 'tigris': 0.09090909090909091,
 'large': 0.2727272727272727,
 'cat': 0.09090909090909091,
 'member': 0.09090909090909091,
 'genus': 0.09090909090909091,
 'native': 0.09090909090909091,
 'Asia': 0.2727272727272727,
 'powerful': 0.09090909090909091,
 'muscular': 0.09090909090909091,
 'body': 0.18181818181818182,
 'head': 0.09090909090909091,
 'paws': 0.09090909090909091,
 'long': 0.09090909090909091,
 'tail': 0.09090909090909091,
 'orange': 0.09090909090909091,
 'fur': 0.18181818181818182,
 'black': 0.09090909090909091,
 'vertical': 0.09090909090909091,
 'stripes': 0.09090909090909091,
 'traditionally': 0.09090909090909091,
 'classified': 0.09090909090909091,
 'recent': 0.09090909090909091,
 'subspecies': 0.18181818181818182,
 'recognise': 0.09090909090909091,
 'mainland': 0.09090909090909091,
 'Asian': 0.09090909090909091,
 'tigers': 0.2727272727272727,
 'island': 0.09090909090909091,
 'Sunda': 0.09090909090909091,
 'Islands': 0.09090909

In [25]:
sentence_tokens = [sent for sent in doc.sents]
# means we defined sent(first) as a variable that has value inside of sentences in doc.sents
sentence_tokens

[The tiger (Panthera tigris) is a large cat and a member of the genus Panthera native to Asia.,
 It has a powerful, muscular body with a large head and paws, a long tail and orange fur with black, mostly vertical stripes.,
 It is traditionally classified into nine recent subspecies, though some recognise only two subspecies, mainland Asian tigers and the island tigers of the Sunda Islands.
 ,
 Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia.,
 The tiger is an apex predator and preys mainly on ungulates, which it takes by ambush.,
 It lives a mostly solitary life and occupies home ranges, defending these from individuals of the same sex.,
 The range of a male tiger overlaps with that of multiple females with whom he mates.,
 Females give birth to usually two or three cubs that sta

In [26]:
len(sentence_tokens)

20

In [27]:
sentence_tokens

[The tiger (Panthera tigris) is a large cat and a member of the genus Panthera native to Asia.,
 It has a powerful, muscular body with a large head and paws, a long tail and orange fur with black, mostly vertical stripes.,
 It is traditionally classified into nine recent subspecies, though some recognise only two subspecies, mainland Asian tigers and the island tigers of the Sunda Islands.
 ,
 Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia.,
 The tiger is an apex predator and preys mainly on ungulates, which it takes by ambush.,
 It lives a mostly solitary life and occupies home ranges, defending these from individuals of the same sex.,
 The range of a male tiger overlaps with that of multiple females with whom he mates.,
 Females give birth to usually two or three cubs that sta

In [28]:
# we are going to calculate the sentence score, to calculate the sentence score 
sentence_scores={}

for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

In [29]:
sentence_scores

{The tiger (Panthera tigris) is a large cat and a member of the genus Panthera native to Asia.: 1.7272727272727268,
 It has a powerful, muscular body with a large head and paws, a long tail and orange fur with black, mostly vertical stripes.: 1.5454545454545452,
 It is traditionally classified into nine recent subspecies, though some recognise only two subspecies, mainland Asian tigers and the island tigers of the Sunda Islands.
 : 1.818181818181818,
 Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia.: 3.9090909090909074,
 The tiger is an apex predator and preys mainly on ungulates, which it takes by ambush.: 1.7272727272727268,
 It lives a mostly solitary life and occupies home ranges, defending these from individuals of the same sex.: 0.9090909090909092,
 The range of a male tige

In [30]:
from heapq import nlargest  # to retrieve the n largest elements from an iterable

In [31]:
#lets say our case study says 40% sentence with maximum scores
select_length = int(len(sentence_tokens)*0.4) #calculates 40% of the length of sentence_tokens
select_length

8

In [32]:
#we have to select maximum 5 sentences out of all sentences 
summary = nlargest(select_length, sentence_scores,key=sentence_scores.get)
#nlargest(n, iterable, key=None)
#n: Specifies the number of largest elements
#iterable: from which you want to select the largest n elements.
#key:A function that takes an element from the iterable and returns a value that will be used to compare the elements. 
    #If key is not provided, the elements themselves are compared (i.e., their natural order is used).
summary

[Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia.,
 Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China and on the islands of Java and Bali.,
 The tiger featured prominently in the ancient mythology and folklore of cultures throughout its historic range and has continued to appear in culture worldwide.,
 In several range countries, wildlife corridors have been established and tiger reintroduction is planned.
 ,
 The range of a male tiger overlaps with that of multiple females with whom he mates.,
 The tiger is legally protected in all range countries.,
 National conservation measures consist of action plans, anti-poaching patrols and schemes for monitoring tiger po

In [33]:
sentence_scores

{The tiger (Panthera tigris) is a large cat and a member of the genus Panthera native to Asia.: 1.7272727272727268,
 It has a powerful, muscular body with a large head and paws, a long tail and orange fur with black, mostly vertical stripes.: 1.5454545454545452,
 It is traditionally classified into nine recent subspecies, though some recognise only two subspecies, mainland Asian tigers and the island tigers of the Sunda Islands.
 : 1.818181818181818,
 Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia.: 3.9090909090909074,
 The tiger is an apex predator and preys mainly on ungulates, which it takes by ambush.: 1.7272727272727268,
 It lives a mostly solitary life and occupies home ranges, defending these from individuals of the same sex.: 0.9090909090909092,
 The range of a male tige

In [34]:
# if i need to combine these top 6 sentencs then 
final_summary = [word.text for word in summary]
final_summary

["Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia.",
 'Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China and on the islands of Java and Bali.',
 'The tiger featured prominently in the ancient mythology and folklore of cultures throughout its historic range and has continued to appear in culture worldwide.',
 'In several range countries, wildlife corridors have been established and tiger reintroduction is planned.\n\n',
 'The range of a male tiger overlaps with that of multiple females with whom he mates.',
 'The tiger is legally protected in all range countries.',
 'National conservation measures consist of action plans, anti-poaching patrols and schemes for moni

In [35]:
print(summary) # we get the final summary by our model

[Throughout the tiger's range, it inhabits mainly forests, from coniferous and temperate broadleaf and mixed forests in the Russian Far East and Northeast China to tropical and subtropical moist broadleaf forests on the Indian subcontinent and Southeast Asia., Since the early 20th century, tiger populations have lost at least 93% of their historic range and are locally extinct in West and Central Asia, in large areas of China and on the islands of Java and Bali., The tiger featured prominently in the ancient mythology and folklore of cultures throughout its historic range and has continued to appear in culture worldwide., In several range countries, wildlife corridors have been established and tiger reintroduction is planned.

, The range of a male tiger overlaps with that of multiple females with whom he mates., The tiger is legally protected in all range countries., National conservation measures consist of action plans, anti-poaching patrols and schemes for monitoring tiger populati