# <center> <font size = 24 color = 'steelblue'> <b> Document vectors

##### <a id = 'c1'>
<font size = 10 color = 'midnightblue'> <b> Introduction

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Doc2vec allows us to directly learn the representations for texts of arbitrary lengths (phrases, sentences, paragraphs and documents), by taking the context of words in the text into account.
- This notebook demonstrates the creation of a document vector using averaging with spaCy.
- spaCy is a python library for natural language processing (NLP) which has a lot of built-in capabilities and features.
- spaCy has different types of models. "The default English language model is `en_core_web_sm`.

##### <a id = 'c1'>
<font size = 10 color = 'midnightblue'> <b>  Install packages (spaCy) and necessary dependencies

In [None]:
!pip install spacy==2.2.4
!python -m spacy download en_core_web_sm

<font size = 5 color = seagreen> <b> Import spacy and load the model

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

<font size = 5 color = seagreen> <b> Define document set.
<div class="alert alert-block alert-success">
<font size = 4> 
    
**Assuming each statement corresponds to a separate document**

In [None]:
documents = ["The analyst reviews the dataset to identify trends and patterns."
             "Data analysis helps businesses make informed decisions based on facts and figures.",
             "In a research project the team gathers data for subsequent analysis.",
             "Charts and graphs are used to visually represent the results of data analysis.",
             "Analyzing customer feedback data provides valuable insights for product improvement."]

In [None]:
processed = [doc.lower().replace(".","") for doc in documents]
print("Document After Pre-Processing:",processed)

<div class="alert alert-block alert-success">
<font size = 4>  <b>Iterate over each document and instantiate an nlp instance.

In [None]:
for doc in processed:
    # Create a spacy object which is a container for accessing linguistic annotations.
    oc_nlp = nlp(doc)
    print("-"*30)

    # This gives the average vector of each document.
    print("Average Vector of '{}'\n".format(doc),doc_nlp.vector)

    # This gives the text of each word in the doc and their respective vectors.
    for token in doc_nlp:
        print()
        print(token.text,token.vector)
