### Document-term matrix

**Sparse; one dimension/column per unique word in the corpus**

In [66]:
from tabletext import to_text

corpus = ['When the clouds rain they pours', 'The rest is history']

tokenized_corpus = [text.lower().split(' ') for text in corpus]

unique_words = sorted(list(set(tokenized_corpus[0] + tokenized_corpus[1])))

document_term_matrix = [[' '] + unique_words]

for document_number, document in enumerate(tokenized_corpus):
    
    document_counts = [0 for w in unique_words]
    
    for token in document:
        document_counts[unique_words.index(token)] += 1
        
    document_term_matrix.append(['document ' + str(document_number)] + document_counts)
    
print to_text(document_term_matrix)
    

┌────────────┬────────┬─────────┬────┬───────┬──────┬──────┬─────┬──────┬──────┐
│            │ clouds │ history │ is │ pours │ rain │ rest │ the │ they │ when │
├────────────┼────────┼─────────┼────┼───────┼──────┼──────┼─────┼──────┼──────┤
│ document 0 │      1 │       0 │  0 │     1 │    1 │    0 │   1 │    1 │    1 │
├────────────┼────────┼─────────┼────┼───────┼──────┼──────┼─────┼──────┼──────┤
│ document 1 │      0 │       1 │  1 │     0 │    0 │    1 │   1 │    0 │    0 │
└────────────┴────────┴─────────┴────┴───────┴──────┴──────┴─────┴──────┴──────┘


### Natural Language Processing (NLP) packages

Tokenizing, sentence splitting, stopword identification, part-of-speech tagging, lemmatization, named-entity recognition, [syntactic] dependency parsing.


* [**Spacy**](https://spacy.io/) (English, German, Spanish, Portugese, French, Italian, Dutch) and [**textacy**](https://textacy.readthedocs.io/en/stable/).  *My preferred module for English.*
* [**TextBlob**](http://textblob.readthedocs.io/en/dev/) (English), and [textblob-de](http://textblob-de.readthedocs.io/en/latest/) (German). *Mostly for limited, prototyping uses.*  **Best-in-class intro docummentation**
* [Parzu](https://github.com/rsennrich/ParZu) (German). *My preferred module for German.  Prolog.*
* [morphadorner](http://morphadorner.northwestern.edu/morphadorner/).  *For early modern English; perhaps the most accurate POS-tagging of the bunch.  Java.  Command-line interface.*


* [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) (English, Arabic, Chinese, French, German, and Spanish).  *Java. Command-line interface.*
* [OpenNLP](https://opennlp.apache.org/) (English, German, and varied support for others).  *Convenient command-line interface. Java.*
* [pattern](https://www.clips.uantwerpen.be/pattern) (English, German, etc).
* [nltk](https://www.nltk.org/).  *Packaged corpora.  Access to Wordnet.*
* [NodeBox Lingustics](https://www.nodebox.net/code/index.php/Linguistics).  *Lemma-to-morphological form transformations.*

### Me, installing the latest spacy . . . 

I should have done just

    !conda update -y spacy
    
but I expected that I would have problems with the environment(s) on my workstation.

Note that I have to restart the kernel after I removed spacy, because the kernel had already imported a copy of spacy, and removing it from the file system (which is what "conda remove" does) doesn't effect modules already imported in the notebook.

In [67]:
#!conda remove -y spacy
# Restart Kernel after remove
#!conda install -y spacy
#!python -m spacy download en

In [68]:
import spacy
spacy.__version__

'2.0.11'

### Problems installing spacy models?

# TO-DO: make next cell on macbook

### Spacy 101

The doc is fairly good.  ["Usage"](https://spacy.io/usage/) and ["Linguistic Features"](https://spacy.io/usage/linguistic-features).  I use [the API guide for Token](https://spacy.io/api/token) a lot.

### First, load the model

This takes a second, so we do it in a cell by itself, so we don't have to keep doing it.

**nlp** (spacy.lang.en.English) is the model (i.e., language-specific data necessary to process text in that language) along with code to apply the model to passages of raw text.

In [69]:
import spacy

nlp = spacy.load('en')

print 'type(nlp)', type(nlp)

type(nlp) <class 'spacy.lang.en.English'>


### Part of speech tagging, etc

[Penn Treebank part-of-speech tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).  See also [the spacy doc](https://spacy.io/api/annotation).

Is [stemming](http://snowballstem.org/demo.html) better than lemmatization?

In [70]:
# text sort of from http://www.gutenberg.org/ebooks/1286

import re
from tabletext import to_text

text = """There was a certain island in the sea, the only inhabitants of which were an old man, 
whose name was Bob Prospero, and his daughter June Miranda, a very beautiful young lady. 
She came to Chicago so young that she had no memory of having seen any other 
human face than the World Trade Organization.  In June, the beautiful island beautified
beautifully the beauties of the sea.
"""

# Replacing newlines with spaces only so the prints below make sense
doc = nlp(unicode(re.sub('\n', ' ', text)))

for s in doc.sents:
    
    print
    print 's', type(s), s
    
    for t in s:
        print
        print 't', type(t), t
        break
        
    break
    
# https://spacy.io/api/token

results = [['t.i', 't.text', 't.lemma_', 't.pos_', 't.tag_', 't.ent_type_', 't.ent_iob_', 't.is_stop']]

n_not_stopwords_or_punct = 0

print
for s in doc.sents:
    for t in s:
        
        if t.pos_ not in ['SPACE', 'PUNCT'] and t.is_stop == 0:
            n_not_stopwords_or_punct += 1
        
        results.append([t.i, t.text, t.lemma_, t.pos_, t.tag_,
                        t.ent_type_, t.ent_iob_, t.is_stop])
    
    results.append([''])
    
print 'n_not_stopwords_or_punct', n_not_stopwords_or_punct
print
        
print to_text(results)


s <type 'spacy.tokens.span.Span'> There was a certain island in the sea, the only inhabitants of which were an old man,  whose name was Bob Prospero, and his daughter June Miranda, a very beautiful young lady.  

t <type 'spacy.tokens.token.Token'> There

n_not_stopwords_or_punct 35

┌─────┬──────────────┬──────────────┬────────┬────────┬─────────────┬────────────┬───────────┐
│ t.i │ t.text       │ t.lemma_     │ t.pos_ │ t.tag_ │ t.ent_type_ │ t.ent_iob_ │ t.is_stop │
├─────┼──────────────┼──────────────┼────────┼────────┼─────────────┼────────────┼───────────┤
│   0 │ There        │ there        │ ADV    │ EX     │             │ O          │         0 │
├─────┼──────────────┼──────────────┼────────┼────────┼─────────────┼────────────┼───────────┤
│   1 │ was          │ be           │ VERB   │ VBD    │             │ O          │         1 │
├─────┼──────────────┼──────────────┼────────┼────────┼─────────────┼────────────┼───────────┤
│   2 │ a            │ a            │ DET    │ DT

### Dependency parsing

[Clear dependency labels](http://www.mathcs.emory.edu/~choi/doc/cu-2012-choi.pdf) ("Clear" is the name of the label set, and not necessarily descriptive, at least to the uninitiated).  See [the spacy doc](https://spacy.io/api/annotation).

See [this](https://demos.explosion.ai/displacy/).  Note that they use noun chunks as terminals: spacy will do this.

In [71]:
from tabletext import to_text

results = [['t.i', 't.text', 't.head.i', 't.dep_', 'ancestors', 'children']]

for s in doc.sents:
    print
    print s
    print
    for t in s:
        
        ancestors = ', '.join([a.text for a in t.ancestors])
        if len(ancestors) > 25:
            ancestors = ancestors[:22] + '...'
            
        children = ', '.join([c.text for c in t.children])
        if len(children) > 25:
            children = children[:22] + '...'
        
        results.append([t.i, t.text, t.head.i, t.dep_, 
                        ancestors, 
                        children])
    
    results.append([''])
        
print to_text(results)


There was a certain island in the sea, the only inhabitants of which were an old man,  whose name was Bob Prospero, and his daughter June Miranda, a very beautiful young lady.  


She came to Chicago so young that she had no memory of having seen any other  human face than the World Trade Organization.  


In June, the beautiful island beautified beautifully the beauties of the sea.

┌─────┬──────────────┬──────────┬──────────┬───────────────────────────┬───────────────────────────┐
│ t.i │ t.text       │ t.head.i │ t.dep_   │ ancestors                 │ children                  │
├─────┼──────────────┼──────────┼──────────┼───────────────────────────┼───────────────────────────┤
│   0 │ There        │        1 │ expl     │ was                       │                           │
├─────┼──────────────┼──────────┼──────────┼───────────────────────────┼───────────────────────────┤
│   1 │ was          │        1 │ ROOT     │                           │ There, island, were, .    │
├─────

https://spacy.io/usage/visualizers#section-jupyter

In [72]:
from spacy import displacy

for s in doc.sents:
    s_doc = nlp(unicode(s))
    
    displacy.render(s_doc, jupyter=True, style='dep')


In [73]:
print
for s in doc.sents:
    print
    for c in s.noun_chunks:
        print c.text



a certain island
the sea
the only inhabitants
an old man
whose name
Bob Prospero
his daughter
June Miranda
a very beautiful young lady

She
Chicago
she
no memory
any other  human face
the World Trade Organization

June
the beautiful island
the beauties
the sea


### word2vec vectors


https://code.google.com/archive/p/word2vec/

https://github.com/facebookresearch/fastText

https://nlp.stanford.edu/projects/glove/

https://en.wikipedia.org/wiki/Word2vec

https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/


In [74]:
# ONLY PRINTING THE FIRST FIVE ITEMS IN EACH VECTOR . . . 

print
for s in doc.sents:
    print
    for t in s:
        print t.text, '\t', len(t.vector), t.vector[:5]



There 	384 [ 1.09459674  2.25503922 -0.75144684  0.93780494  1.16016471]
was 	384 [-1.8380121  -1.2267915  -3.30175519 -1.46033335  3.94812846]
a 	384 [ 3.27277517 -2.23957086  1.3413285   7.08262014  2.90397859]
certain 	384 [-1.74500895 -3.7057085  -2.66868448  4.53605461 -1.33578479]
island 	384 [ 0.22992104  1.70915306  0.60042489  3.12509537 -0.75114262]
in 	384 [ 0.87824786  0.87629151 -0.17617959  1.92673516  0.39568865]
the 	384 [ 0.90704072  1.28875542  1.28403533  0.79325992  1.01080835]
sea 	384 [ 0.65881044  0.34315869  3.44948959  3.85775518 -0.62723953]
, 	384 [ 0.29537982  1.14307058  1.09943604 -1.17121279  2.99736214]
the 	384 [ 1.07120979 -2.15241051  4.08927917  3.63449717 -1.40206635]
only 	384 [-0.83788848 -2.25628996 -2.92216754  3.19361138 -0.05209458]
inhabitants 	384 [ 2.53459692  3.56259108 -0.94348133  0.60543412 -0.85527706]
of 	384 [ 5.1220293   0.89033198 -0.24673772  0.25163919  0.46142197]
which 	384 [-3.1575079   2.15728068  6.60509968 -1.77766466  0.

### Bonus track. 

[Jerome McGann and Lisa Samuels, "Deformance and Interpretation"](http://raley.english.ucsb.edu/wp-content/Engl800/Deformance.pdf)

[Mark Sample, "Notes towards a Deformed Humanities"](http://www.samplereality.com/2012/05/02/notes-towards-a-deformed-humanities/)

In [75]:
import codecs, re, textwrap

text = codecs.open('Moby_Dick.txt', 'r', encoding='utf-8').read().replace('\r', '')
#text = codecs.open('War_and_Peace.txt', 'r', encoding='utf-8').read().replace('\r', '')

paragraphs = re.split('\n\n+', text)

for p in paragraphs[:10]:
    if p > '':
        if p.startswith('CHAPTER') or p.startswith('BOOK'):
            print
            print '\t', p
        else:
            pdoc = nlp(p)
                
            all_noun_chunks = []
                
            for s in pdoc.sents:
                
                last_punct = ''
                for t in s:
                    if t.pos_ == 'PUNCT' and t.text not in [u'—', u'”', '(', ')', u'“']:
                        last_punct = t.text.replace(u'—', '')
                for c in s.noun_chunks:
                    
                    is_just_pronoun = False
                    if len(c) == 1:
                        for t in c:
                            if t.pos_ == 'PRON':
                                is_just_pronoun = True
                    
                    if is_just_pronoun == False:
                        all_noun_chunks.append((c.text[0].upper() + c.text[1:] + last_punct).replace(u'“', ''))
                
            if len(all_noun_chunks) > 0:

                print
                print '\n'.join(textwrap.wrap(' '.join(all_noun_chunks), 70))



	CHAPTER 1. Loomings.

Little or no money. My purse. Nothing. Shore. The watery part. The
world. A way. The spleen. The circulation. The mouth, A damp, My soul,
Coffin warehouses, The rear, Every funeral, My hypos, Such an upper
hand, A strong moral principle, The street, People's hats, Sea. My
substitute. Pistol. Ball. Cato. His sword. The ship. Nothing. Almost
all men. Their degree. Some time. Very nearly the same feelings. The
ocean.

Your insular city. The Manhattoes. Wharves. Indian isles. Coral
reefs—commerce. Surf. The streets. Its extreme downtown. The battery.
That noble mole. Waves. Breezes. Sight. Land. The crowds. Water-
gazers.

The city. Corlears. Hook. Thence. Whitehall. What? Silent sentinels.
The town. Mortal men. Ocean reveries. The spiles. The pier-heads. The
bulwarks. Ships. China. The rigging. A still better seaward peep.
Landsmen. Week days. Lath. Plaster. Counters. Benches. Desks. The
green fields? What?

More crowds. The water. A dive. Nothing. The extremest li