# <center> <font size = 24 color = 'steelblue'> <b> Document vectors

##### <a id = 'c1'>
<font size = 10 color = 'midnightblue'> <b> Introduction

<div class="alert alert-block alert-success">
<font size = 4>
    
- Doc2vec allows us to directly learn the representations for texts of arbitrary lengths (phrases, sentences, paragraphs and documents), by taking the context of words in the text into account.
- This notebook demonstrates the creation of a document vector using averaging with spaCy.
- spaCy is a python library for natural language processing (NLP) which has a lot of built-in capabilities and features.
- spaCy has different types of models. "The default English language model is `en_core_web_sm`.

##### <a id = 'c1'>
<font size = 10 color = 'midnightblue'> <b>  Install packages (spaCy) and necessary dependencies

In [1]:
!pip install spacy==2.2.4
!python -m spacy download en_core_web_sm

Collecting spacy==2.2.4
  Downloading spacy-2.2.4.tar.gz (6.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.1/6.1 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[40 lines of output][0m
  [31m   [0m Collecting setuptools
  [31m   [0m   Using cached setuptools-75.2.0-py3-none-any.whl.metadata (6.9 kB)
  [31m   [0m Collecting wheel
  [31m   [0m   Downloading wheel-0.44.0-py3-none-any.whl.metadata (2.3 kB)
  [31m   [0m Collecting cython>=0.25
  [31m   [0m   Using cached Cython-3.0.11-py2.py3-none-any.whl.metadata (3.2 kB)
  [31m   [0m Collecting cymem<2.1.0,>=2.0.2
  [31m   [0m   Using cached cymem-2.0.8-cp312-cp312-macosx_11_0_arm64.whl.metadata (8.4 kB)
  [31m   [0m Col

<font size = 5 color = seagreen> <b> Import spacy and load the model

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

<font size = 5 color = seagreen> <b> Define document set.
<div class="alert alert-block alert-success">
<font size = 4>
    
**Assuming each statement corresponds to a separate document**

In [3]:
documents = ["The analyst reviews the dataset to identify trends and patterns."
             "Data analysis helps businesses make informed decisions based on facts and figures.",
             "In a research project the team gathers data for subsequent analysis.",
             "Charts and graphs are used to visually represent the results of data analysis.",
             "Analyzing customer feedback data provides valuable insights for product improvement."]

In [8]:
processed = [doc.lower().replace("."," ") for doc in documents]
print("Document After Pre-Processing:",processed)

Document After Pre-Processing: ['the analyst reviews the dataset to identify trends and patterns data analysis helps businesses make informed decisions based on facts and figures ', 'in a research project the team gathers data for subsequent analysis ', 'charts and graphs are used to visually represent the results of data analysis ', 'analyzing customer feedback data provides valuable insights for product improvement ']


<div class="alert alert-block alert-success">
<font size = 4>  <b>Iterate over each document and instantiate an nlp instance.

In [10]:
for doc in processed:
    # Create a spacy object which is a container for accessing linguistic annotations.
    oc_nlp = nlp(doc)
    print("-"*30)
    print

    # This gives the average vector of each document.
    print("Average Vector of '{}'\n".format(doc),oc_nlp.vector) # Use oc_nlp

    # This gives the text of each word in the doc and their respective vectors.
    for token in oc_nlp: # Use oc_nlp
        print()
        print(token.text,token.vector)

------------------------------
Average Vector of 'the analyst reviews the dataset to identify trends and patterns data analysis helps businesses make informed decisions based on facts and figures '
 [ 0.07785316  0.31736797 -0.14752856  0.29918545  0.29225147 -0.00840273
  0.7117144   0.16469043  0.2152286   0.14899766 -0.18650521 -0.03729999
 -0.25050917 -0.73720896  0.25997713  0.04653712 -0.22721027  0.5016644
 -0.15721451 -0.11012223 -0.11394881 -0.24627508 -0.2319234  -0.3713493
  0.06043805 -0.18123116  0.59020734  0.59882826  0.06332148 -0.09449042
 -0.1467828  -0.35565498 -0.07215864 -0.08004297  0.25423113 -0.21949397
  0.09762263  0.1079042  -0.33125502  0.22857122 -0.1757588  -0.03613307
 -0.2863995   0.01829899  0.24756137  0.11859591  0.24029852  0.36100113
 -0.12577805 -0.33827826 -0.39439002 -0.39212787  0.23721431 -0.6999867
 -0.09101016  0.3397203   0.0045377  -0.11772589  0.03711363  0.04490144
  0.07073646 -0.03534032 -0.15815006 -0.292013    0.9458074   0.30380163
 