# Text Processing with SpaCy

- **Introduction:**

This script showcases a standard text processing pipeline using the SpaCy library. SpaCy is a modern and fast NLP library that allows you to perform a variety of text analysis tasks. From tokenization to Named Entity Recognition (NER), SpaCy provides easy-to-use tools and pre-trained models to help you process text data efficiently.

- **Objective:**

The primary objective of this code is to demonstrate various text preprocessing steps including tokenization, stop word removal, lemmatization, part-of-speech tagging, named entity recognition, and additional preprocessing such as removing punctuation and numbers. These steps are crucial in preparing text data for downstream tasks like text classification, sentiment analysis, etc.

## 1. Importing the necessary library

First, ensure you have SpaCy installed along with the language model. 

You can install SpaCy and download the language model using the following commands:

In [1]:
#pip install spacy
#python -m spacy download en_core_web_sm

In [2]:
import spacy

- `spacy` is a library for advanced Natural Language Processing in Python.

## 2. Loading a SpaCy model

In [3]:
nlp = spacy.load('en_core_web_sm')

`spacy.load('en_core_web_sm')` loads a pre-trained statistical model for English. This model supports identification of token boundaries (tokenization) and other linguistic annotations.

## 2. Input text

In [4]:
text = "Apple Inc. is an American multinational technology company headquartered in Cupertino, California."

This is the text we want to process using SpaCy.

## 3. Processing text with SpaCy

In [5]:
doc = nlp(text)

In [6]:
type(doc)

spacy.tokens.doc.Doc

Here, `nlp` is applied to the text, which invokes the model on the text and produces a `Doc` object enriched with linguistic annotations.

### 1. Tokenization

In [7]:
tokenization_example = [token.text for token in doc]
print("Tokenization:", tokenization_example)

Tokenization: ['Apple', 'Inc.', 'is', 'an', 'American', 'multinational', 'technology', 'company', 'headquartered', 'in', 'Cupertino', ',', 'California', '.']


- Tokenization breaks text into words, phrases, symbols, or other meaningful elements called tokens. In this code, we extract the text of each token from the `doc` object.

### 2. Stop Word Removal

In [8]:
stop_word_removal_example = [token.text for token in doc if not token.is_stop]
print("Stop Word Removal:", stop_word_removal_example)

Stop Word Removal: ['Apple', 'Inc.', 'American', 'multinational', 'technology', 'company', 'headquartered', 'Cupertino', ',', 'California', '.']


- Get the list of stop words in English

In [9]:
# Get the list of stop words
stop_words = nlp.Defaults.stop_words

# If you want to print the stop words, you can convert the set to a list and print it
print(list(stop_words))

['quite', 'toward', 'myself', 'third', '’ve', 'none', 'full', 'have', 'some', 'more', 'out', 'therefore', 'become', 'beyond', 're', 'thereupon', '‘d', 'while', 'again', 'nobody', 'without', 'here', 'had', 'yourself', 'our', 'one', 'nothing', 'off', 'thereby', 'name', 'next', 'noone', 'its', 'nowhere', 'elsewhere', '’m', 'alone', 'himself', 'those', 'within', '’d', 'six', 'through', 'than', 'four', 'an', 'does', 'unless', 'anything', 'meanwhile', 'behind', 'ourselves', 'formerly', 'keep', 'over', 'due', 'anyhow', 'becomes', 'each', 'last', 'they', 'hers', 'make', 'from', 'several', 'much', 'whereas', 'go', 'still', '’ll', 'whose', 'together', 'ten', 'show', 'further', 'on', 'various', 'neither', 'except', '’re', 'against', 'ever', 'can', 'whether', 'the', 'and', 'beforehand', 'often', 'before', 'five', 'anyway', 'after', 'fifteen', 'other', 'former', 'therein', 'both', 'him', 'perhaps', 'another', 'who', 'has', 'latterly', "'m", 'twelve', 'whenever', 'wherever', 'a', 'really', 'via', 'p

- Stop words are common words that are often removed in the preprocessing step. `token.is_stop` checks if a token is a stop word.

### 3. Lemmatization

In [10]:
lemmatization_example = [token.lemma_ for token in doc]
print("Lemmatization:", lemmatization_example)

Lemmatization: ['Apple', 'Inc.', 'be', 'an', 'american', 'multinational', 'technology', 'company', 'headquarter', 'in', 'Cupertino', ',', 'California', '.']


In [11]:
lemmatization_example = [token.lemma_ for token in doc]
print("Lemmatization:", lemmatization_example)

Lemmatization: ['Apple', 'Inc.', 'be', 'an', 'american', 'multinational', 'technology', 'company', 'headquarter', 'in', 'Cupertino', ',', 'California', '.']


- Lemmatization reduces words to their base or root form. Here, `token.lemma_`gets the lemma of each token.

**Note:** SpaCy offers lemmatization instead of stemming because it considers lemmatization to be more accurate and reasonable for understanding words in context.

### 4. Part-of-Speech (POS) Tagging

In [12]:
pos_tagging_example = [(token.text, token.pos_) for token in doc]
print("Part-of-Speech Tagging:", pos_tagging_example)

Part-of-Speech Tagging: [('Apple', 'PROPN'), ('Inc.', 'PROPN'), ('is', 'AUX'), ('an', 'DET'), ('American', 'ADJ'), ('multinational', 'ADJ'), ('technology', 'NOUN'), ('company', 'NOUN'), ('headquartered', 'VERB'), ('in', 'ADP'), ('Cupertino', 'PROPN'), (',', 'PUNCT'), ('California', 'PROPN'), ('.', 'PUNCT')]


- POS tagging assigns a part of speech label to each token (e.g., noun, verb, adjective). `token.pos_` gets the POS tag of each token.

### 5. Named Entity Recognition (NER)

In [13]:
ner_example = [(ent.text, ent.label_) for ent in doc.ents]
print("Named Entity Recognition:", ner_example)

Named Entity Recognition: [('Apple Inc.', 'ORG'), ('American', 'NORP'), ('Cupertino', 'GPE'), ('California', 'GPE')]


- NER identifies and categorizes entities within the text (e.g., person names, organizations, locations). `doc.ents` retrieves the named entities, and `ent.label_` gets the label of each entity

### 6. Additional Preprocessing - Removing Punctuation and Numbers

In [14]:
additional_preprocessing_example = [token.text for token in doc if token.is_alpha]
print("Additional Preprocessing:", additional_preprocessing_example)

Additional Preprocessing: ['Apple', 'is', 'an', 'American', 'multinational', 'technology', 'company', 'headquartered', 'in', 'Cupertino', 'California']


- This step filters out tokens that are not purely alphabetic. `token.is_alpha` checks if a token consists of alphabetic characters only.


### Performing all preprocessing steps at once

In [15]:
# Performing all preprocessing steps at once
cleaned_text = ' '.join([token.lemma_ for token in doc if not token.is_stop and token.is_alpha])

# Displaying the cleaned text
print(cleaned_text)

Apple american multinational technology company headquarter Cupertino California


Here, in a single line of code, we achieve stop word removal, lemmatization, and additional preprocessing to exclude punctuation and numbers. This is done by iterating through each token in the document, verifying that it's not a stop word and that it's an alphabetic token, before extracting the lemma of each token. The resulting lemmatized tokens are then concatenated with spaces to form the cleaned-up text, which is subsequently displayed

These steps provide a good preprocessing pipeline that prepares text data for further analysis or machine learning tasks.