# Natural Language Processing with Python

- Github repository for this workshop: https://github.com/unmrds/cc-nlp

The processing and analysis of [natural languages](https://en.wikipedia.org/wiki/Natural_language) is a core requirement for extracting structured information from spoken, signed, or written language and for feeding that information into systems or processes that generate insights from, or responses to provided language data. As languages that are naturally evolved and not designed for a specific purpose natural languages pose significant challenges when developing automated systems. 

Natural Language Processing - the class of activities in which language analysis, interpretation, and generation play key roles - is used in many disciplines as is demonstrated by this random sample of recent papers using NLP to address very different research problems:

* "Unsupervised entity and relation extraction from clinical records in Italian" (1)
* "Candyflipping and Other Combinations: Identifying Drug–Drug Combinations from an Online Forum" (2)
* "How Can Linguistics Help to Structure a Multidisciplinary Neo Domain such as Exobiology?" (3)
* "Bag of meta-words: A novel method to represent document for the sentiment classification" (4)
* "Information Needs and Communication Gaps between Citizens and Local Governments Online during Natural Disasters" (5)
* "Mining the Web for New Words: Semi-Automatic Neologism Identification with the NeoCrawler" (6)
* "Distributed language representation for authorship attribution" (7)
* "Toward a computational history of universities: Evaluating text mining methods for interdisciplinarity detection from PhD dissertation abstracts" (8)
* "Ecological momentary interventions for depression and anxiety" (9)

... and many of us interact with NLP on a daily basis (from [*Damn You Autocorrect*](http://www.damnyouautocorrect.com) and [boredpanda](https://www.boredpanda.com/best-funny-siri-responses/?utm_source=google&utm_medium=organic&utm_campaign=organic))...

![](images/combined.jpg)

The increasing availability of electronic textual materials: 

* The [Hathi Trust Research Center](https://analytics.hathitrust.org)
* [Project Gutenberg](https://www.gutenberg.org)
* [US Government Documents](https://www.govinfo.gov)
* [50 different text collections](https://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table) from [UC Irvine's Machine Learning Respository](https://archive.ics.uci.edu/ml/index.html)
* Social networks such as [Twitter](https://developer.twitter.com/en/docs.html)
* The many text corpora available through NLP tools such as [NLTK](http://www.nltk.org/nltk_data/)
* and [many others](https://gengo.ai/datasets/the-best-25-datasets-for-natural-language-processing/)

creates a rich collection of data that may be used to address myriad questions including:

* Changes in language use through time
* Comparisons of writing styles between genre's
* Sentiment analysis based on text content
* Developing systems that can interpret written or spoken language and respond in kind
* Automatically extracting information from large collections of electronic documents
* and many others ...

## A conceptual model for NLP

![Natural Language Processing Pyramid](https://nlpforhackers.io/wp-content/uploads/2016/11/NLP-Pyramid.png)
- from [*Natural Language Processing for Hackers*](https://nlpforhackers.io/intro-natural-language-processing/)

or put another way - with a workflow perspective:

![Generic workflow](images/workflow.png)

## A worked example of this workflow

This is an actual analysis that was performed in support of UNM's Research Strategic Planning effort in 2016. The workflow includes:

1. Acquisition and import of the analysis text - abstracts of electronic dissertations and theses stored in UNM's Instutional Repository
2. Construction of a stop word list for words to be excluded from the analysis based on:
    a. The base NLTK stopword list
    b. An additional word list of words commonly used in academic writing that aren't useful in text analysis
    c. A long additional stopword list
    d. A custom punctuation symbol list
    e. Some additional annoying characters or symbols that arose in the analysis
3. Generation of ngrams ranging in length from 1 to 4 based on the imput abstract text
4. Export of the generated n-grams (and their associated frequencies in the provided text) for further analysis outside of the script

[https://github.com/karlbenedict/2016-10_ETD_Text](https://github.com/karlbenedict/2016-10_ETD_Text)



# Additional Resources

**Start Here for NLTK Foundation**: [*Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit*](https://www.nltk.org/book/) by Steven Bird, Ewan Klein, and Edward Loper

Other Resources
* [A Comparison of Top 6 Python NLP Libraries](https://medium.com/activewizards-machine-learning-company/comparison-of-top-6-python-nlp-libraries-c4ce160237eb)
* [Top 10 Python Libraries for Natural Language Processing (2018)](https://kleiber.me/blog/2018/02/25/top-10-python-nlp-libraries-2018/)
* [A Roundup of Python NLP Libraries](https://nlpforhackers.io/libraries/)
* [LEARNING PATH: Natural Language Processing with Python: A Complete Guide](https://www.safaribooksonline.com/learning-paths/learning-path-natural/9781789539905/?autoplay=false)
* [Lower-level text processing tools in Python](https://www.computerhope.com/unix/pylibtx.htm)

# References Cited

1. Alicante, A., Corazza, A., Isgrò, F., & Silvestri, S. (2016). Unsupervised entity and relation extraction from clinical records in Italian. *Computers in Biology and Medicine*, 72, 263–275. https://doi.org/10.1016/j.compbiomed.2016.01.014

2. Chary, M., Yi, D., & Manini, A. F. (2018). Candyflipping and Other Combinations: Identifying Drug–Drug Combinations from an Online Forum. *Frontiers in Psychiatry*, 9. https://doi.org/10.3389/fpsyt.2018.00135

3. Condamines, A. (2014). How Can Linguistics Help to Structure a Multidisciplinary Neo Domain such as Exobiology? *BIO Web of Conferences*, 2, 06001. https://doi.org/10.1051/bioconf/20140206001

4. Fu, M., Qu, H., Huang, L., & Lu, L. (2018). Bag of meta-words: A novel method to represent document for the sentiment classification. *Expert Systems with Applications*, 113, 33–43. https://doi.org/10.1016/j.eswa.2018.06.052

5. Hong, L., Fu, C., Wu, J., & Frias-Martinez, V. (2018). Information Needs and Communication Gaps between Citizens and Local Governments Online during Natural Disasters. *Information Systems Frontiers*, 20(5), 1027–1039. https://doi.org/10.1007/s10796-018-9832-0

6. Kerremans, D., & Prokic, J. (2018). Mining the Web for New Words: Semi-Automatic Neologism Identification with the NeoCrawler. *Anglia-Zeitschrift Fur Englische Philologie*, 136(2), 239–268. https://doi.org/10.1515/ang-2018-0032

7. Kocher, M., & Savoy, J. (2018). Distributed language representation for authorship attribution. *Digital Scholarship in the Humanities*, 33(2), 425–441. https://doi.org/10.1093/llc/fqx046

8. Nanni, F., Dietz, L., & Ponzetto, S. P. (2018). Toward a computational history of universities: Evaluating text mining methods for interdisciplinarity detection from PhD dissertation abstracts. *Digital Scholarship in the Humanities*, 33(3), 612–620. https://doi.org/10.1093/llc/fqx062

9. Schueller, S. M., Aguilera, A., & Mohr, D. C. (2017). Ecological momentary interventions for depression and anxiety. *Depression and Anxiety*, 34(6), 540–545. https://doi.org/10.1002/da.22649

