<a href="https://colab.research.google.com/github/thepartisan101/NLP/blob/master/Textacy_Tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Experiments with Textacy Library

In [50]:
!pip3 install spacy



## 1. Basics: Working with text

In [0]:
import textacy

text = ("Since the so-called \"statistical revolution\" in the late 1980s and mid 1990s, "
         "much Natural Language Processing research has relied heavily on machine learning. "
        "Formerly, many language-processing tasks typically involved the direct hand coding "
        "of rules, which is not in general robust to natural language variation. "
        "The machine-learning paradigm calls instead for using statistical inference "
        "to automatically learn such rules through the analysis of large corpora "
        "of typical real-world examples.")

#### Looking for keywords in context:
*How a words or words are used in the text?*

In [8]:
list(textacy.text_utils.KWIC(text, 'language', window_width=35))

 1980s and mid 1990s, much Natural  Language  Processing research has relied hea
n machine learning. Formerly, many  language -processing tasks typically involve
s not in general robust to natural  language  variation. The machine-learning pa


[]

#### Basic text preprocessing

In [0]:
from textacy import preprocessing

In [10]:
preprocessing.normalize_whitespace(preprocessing.remove_punctuation(text))[:80]

'Since the so called statistical revolution in the late 1980s and mid 1990s much '

## Making a Doc for processing
##### Textacy includes automated language detection to apply rigth pipelines to text

In [11]:
doc = textacy.make_spacy_doc(text)
doc._.preview

100%|██████████| 66.7M/66.7M [00:00<00:00, 73.1MB/s]


'Doc(85 tokens: "Since the so-called "statistical revolution" in...")'

#### Customizing pipeline

In [0]:
en = textacy.load_spacy_lang("en_core_web_sm", disable=("parser",))
doc = textacy.make_spacy_doc(text, lang=en)

#### Keeping text metadata (author, URL, date) during processing

In [13]:
metadata = {
    "title": "Natural-language processing",
    "url": "https://en.wikipedia.org/wiki/Natural-language_processing",
    "source": "wikipedia",
}
doc = textacy.make_spacy_doc((text, metadata))
doc._.meta["title"]

'Natural-language processing'

## Analyzing a Doc

In [14]:
list(textacy.extract.ngrams(
    doc, 3, filter_stops=True, filter_punct=True, filter_nums=False
))

[1980s and mid,
 Natural Language Processing,
 Language Processing research,
 research has relied,
 heavily on machine,
 processing tasks typically,
 tasks typically involved,
 involved the direct,
 direct hand coding,
 coding of rules,
 robust to natural,
 natural language variation,
 learning paradigm calls,
 paradigm calls instead,
 inference to automatically,
 learn such rules,
 analysis of large,
 corpora of typical]

In [17]:
list(textacy.extract.ngrams(doc, 2, min_freq=2))

[Natural Language, natural language]

In [18]:
list(textacy.extract.entities(doc, drop_determiners=True))

[late 1980s, mid 1990s, Natural Language Processing]

In [19]:
pattern = textacy.constants.POS_REGEX_PATTERNS["en"]["NP"]
pattern

'<DET>? <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN>|<PROPN> <PART>?)+'

In [20]:
list(textacy.extract.pos_regex_matches(doc, pattern))

  action="once",


[statistical revolution,
 the late 1980s,
 mid 1990s,
 much Natural Language Processing research,
 machine learning,
 many language,
 processing tasks,
 the direct hand coding,
 rules,
 natural language variation,
 The machine,
 paradigm,
 statistical inference,
 such rules,
 the analysis,
 large corpora,
 typical real-world examples]

####  Identify key terms in a document by a number of algorithms

In [21]:
import textacy.ke
textacy.ke.textrank(doc, normalize='lemma', topn=10)

[('Natural Language Processing research', 0.059959246697826624),
 ('natural language variation', 0.04488350959275309),
 ('direct hand coding', 0.037736661821063354),
 ('statistical inference', 0.03432557996664981),
 ('statistical revolution', 0.034007535820683756),
 ('machine learning', 0.03305919655573349),
 ('mid 1990', 0.026993994406706995),
 ('late 1980', 0.026499549123496648),
 ('processing task', 0.0256684200517989),
 ('general robust', 0.024835834233545625)]

In [22]:
textacy.ke.sgrank(doc, ngrams=(1, 2, 3, 4), normalize="lower", topn=0.1)

[('natural language processing research', 0.3127324263926516),
 ('direct hand coding', 0.09689526575072321),
 ('natural language variation', 0.09379267126456722),
 ('mid 1990s', 0.051042393931910755),
 ('processing tasks', 0.04879465955786855)]

#### Compute basic counts and various readability statistics:

In [23]:
ts = textacy.TextStats(doc)
ts.n_unique_words

57

In [24]:
ts.basic_counts

{'n_chars': 414,
 'n_long_words': 30,
 'n_monosyllable_words': 38,
 'n_polysyllable_words': 19,
 'n_sents': 3,
 'n_syllables': 134,
 'n_unique_words': 57,
 'n_words': 73}

In [25]:
ts.flesch_kincaid_grade_level

15.56027397260274

In [26]:
ts.readability_stats

{'automated_readability_index': 17.448173515981736,
 'coleman_liau_index': 16.32928468493151,
 'flesch_kincaid_grade_level': 15.56027397260274,
 'flesch_reading_ease': 26.84351598173518,
 'gulpease_index': 44.61643835616438,
 'gunning_fog_index': 20.144292237442922,
 'lix': 65.42922374429223,
 'smog_index': 17.5058628484301,
 'wiener_sachtextformel': 11.857779908675797}

In [28]:
bot = doc._.to_bag_of_terms(
    ngrams=(1, 2, 3), entities=True, weighting='count',
    as_strings=True
)
sorted(bot.items(), key=lambda x: x[1], reverse=True)[:15]

[('call', 2),
 ('statistical', 2),
 ('machine', 2),
 ('language', 2),
 ('rule', 2),
 ('learn', 2),
 ('revolution', 1),
 ('late', 1),
 ('1980', 1),
 ('mid', 1),
 ('1990', 1),
 ('Natural', 1),
 ('Language', 1),
 ('Processing', 1),
 ('research', 1)]

## Working with Many Texts
Textacy makes it easy to efficiently stream text and (text, metadata) pairs from disk, regardless of the format or compression of the data.

In [29]:
!wget 'https://github.com/bdewilde/textacy-data/releases/download/capitol_words_py3_v1.0/capitol-words-py3.json.gz'

--2020-04-30 21:03:09--  https://github.com/bdewilde/textacy-data/releases/download/capitol_words_py3_v1.0/capitol-words-py3.json.gz
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/112271207/fdec3610-d3b7-11e7-817c-694c51f29888?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200430%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200430T210309Z&X-Amz-Expires=300&X-Amz-Signature=250cee7c6d7100502ae21613427538c0223c2dfe94976da7f8635a35fb942d75&X-Amz-SignedHeaders=host&actor_id=0&repo_id=112271207&response-content-disposition=attachment%3B%20filename%3Dcapitol-words-py3.json.gz&response-content-type=application%2Foctet-stream [following]
--2020-04-30 21:03:09--  https://github-production-release-asset-2e65be.s3.amazonaws.com/112271207/fdec3610-d3b7-11e7-817c-694c51f29888?X-

In [0]:
records = textacy.io.read_json(
    'capitol-words-py3.json.gz',
    mode='rt', lines=True
)

In [31]:
for record in records:
  doc = textacy.make_spacy_doc((record['text'], {"title": record["title"]}))
  print(doc._.preview)
  print('meta:', doc._.meta)
  break

Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")
meta: {'title': 'JOIN THE SENATE AND PASS A CONTINUING RESOLUTION'}


#### convenient Dataset classes are already implemented in textacy 

In [33]:
import textacy.datasets
ds = textacy.datasets.CapitolWords()
ds.download()
records = ds.records(speaker_name={"Barack Obama", "Joe Biden"})
next(records)

("Mr. President, a few days ago, the world watched as the seeds of democracy began to take root in Iraq. As a result of the sheer courage of the Iraqi people and the untold sacrifices of American soldiers, the success of the elections showed just how far people will go to achieve self-government and rule of law.\nAs Americans, we can take enormous pride in the fact that this kind of courage has been inspired by our own struggle for freedom, by the tradition of democratic law secured by our forefathers and enshrined in our Constitution. It is a tradition that says all men are created equal under the law and that no one is above it.\nThat is why even within the executive branch there is an office dedicated to enforcing the law of the land and applying it to people and to Presidents alike.\nIn this sense, the Attorney General is not like the other Cabinet posts. Unlike the Secretary of State, who is the public face of the President's foreign policy, or the Secretary of Education, whose jo

## Make a Corpus

In [0]:
corpus = textacy.Corpus('en', data=records)

In [38]:
textacy.Corpus(
      textacy.load_spacy_lang("en_core_web_sm", disable=("parser", "tagger")),
        data=ds.texts(speaker_party="R", chamber="House", limit=100))


<textacy.corpus.Corpus at 0x7efc204bf320>

##### You can use basic indexing as well as flexible boolean queries to select documents in a corpus:

In [39]:
corpus[100]._.preview

'Doc(1578 tokens: "Mr. President, in the immediate aftermath of Hu...")'

In [40]:
[doc._.preview for doc in corpus[10:15]]

['Doc(15380 tokens: "Mr. President, I am pleased to join as a cospon...")',
 'Doc(296 tokens: "Today I wish to commend Congressman Bobby Rush ...")',
 'Doc(3479 tokens: " There being no objection, the bill was ordered...")',
 'Doc(17 tokens: "Mr. President, I ask unanimous consent that the...")',
 'Doc(875 tokens: "Mr. President, I rise today to urge my colleagu...")']

In [42]:
obama_docs = list(corpus.get(lambda doc: doc._.meta["speaker_name"] == "Barack Obama"))
len(obama_docs)

410

## Analyze a Corpus
Basic stats are computed on the fly as documents are added (or removed) from a corpus:

In [43]:
corpus.n_docs, corpus.n_sents, corpus.n_tokens

(411, 10914, 270636)

Transform a corpus into a document-term matrix, with flexible tokenization, weighting, and filtering of terms:

In [46]:
import textacy
import textacy.vsm
vectorizer = textacy.Vectorizer(
    tf_type="linear", apply_idf=True, idf_type="smooth", norm="l2",
    min_df=2, max_df=0.95)
doc_term_matrix = vectorizer.fit_transform(
    (doc._.to_terms_list(ngrams=1, entities=True, as_strings=True)
    for doc in corpus))
print(rpr(doc_term_matrix))

AttributeError: ignored

In [47]:
dir(textacy)

['Corpus',
 'DEFAULT_DATA_DIR',
 'TextStats',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'about',
 'cache',
 'constants',
 'corpus',
 'datasets',
 'extract',
 'io',
 'ke',
 'lang_utils',
 'load_spacy_lang',
 'logger',
 'logging',
 'make_spacy_doc',
 'network',
 'preprocessing',
 'set_doc_extensions',
 'similarity',
 'spacier',
 'text_stats',
 'text_utils',
 'utils',
 'vsm']

In [52]:
textacy.Vectorizer

AttributeError: ignored