### Title: Words - Getting Started with NLP
##### Author: Amo
<strong>email:</strong> kwaby.dad@gmail.com

### 1. Introduction:
The first steps with natural language processing lie with understanding the basic units of text data – words. Because for any single document, words (in various forms) are combined to develop an idea or a concept. In this workbook, I demonstrate how you can take any arbitrary piece of text and extract tokens for further. I also introduce some powerful nlp tools that you can deploy in performing tasks as simple as tokenization to even more sophisticated ones such as dependency parsing, named entity extraction etc
<br>
<br><strong>Goal: </strong> Get a basic understanding of text preprocessing
<br>

Approach:
* manual approach
* using specialized modules



In [1]:
#imports
import json


### 2 Get data
<strong> Data Description:</strong>
* The text used here is a very short but interesting one about insects. It's quite simple and that's why I used it for this demo.
* source: https://breakingnewsenglish.com/1907/190716-insect-pain.html

In [29]:
# load data
with open('insects.txt','r') as inf:
    raw_text = inf.readlines()
    
    
# preview what's in the first few lines of the file

raw_text[:5] 

['## title: insects can feel pain\n',
 '## source: https://breakingnewsenglish.com/1907/190716-insect-pain.html\n',
 'New research shows that insects feel pain. The researchers say it isn\'t the same kind of pain that humans feel. The pain that insects feel is a sensation that is like pain. The research was conducted at the University of Sydney in Australia. Professor Greg Neely, co-author of the research report, said: "People don\'t really think of insects as feeling any kind of pain, but it\'s already been shown in lots of different invertebrate animals that they can sense and avoid dangerous [things] that we [think of] as painful." He added: "We knew that insects could sense \'pain\' but what we didn\'t know is that an injury could lead to long-lasting hyper-sensitivity...in a similar way to human patients\' experiences."\n',
 'The researchers looked at how fruit flies reacted to injuries. The scientists damaged one leg on the flies and allowed the leg to heal. They found that after

### 2. Tokenize text
* <strong>Goal:</strong> As you can see, the text about is stored as list of lines. It's because that's how we read in our
data. We can eaasily change it from a list into one long string of text as folow:

In [30]:
raw_text = raw_text[2:] # exlude the first 2 lines as they are just meta data
raw_string = ''.join(raw_text)
raw_string

'New research shows that insects feel pain. The researchers say it isn\'t the same kind of pain that humans feel. The pain that insects feel is a sensation that is like pain. The research was conducted at the University of Sydney in Australia. Professor Greg Neely, co-author of the research report, said: "People don\'t really think of insects as feeling any kind of pain, but it\'s already been shown in lots of different invertebrate animals that they can sense and avoid dangerous [things] that we [think of] as painful." He added: "We knew that insects could sense \'pain\' but what we didn\'t know is that an injury could lead to long-lasting hyper-sensitivity...in a similar way to human patients\' experiences."\nThe researchers looked at how fruit flies reacted to injuries. The scientists damaged one leg on the flies and allowed the leg to heal. They found that after the leg fully healed, the flies became more sensitive and tried harder to protect their legs. Professor Neely said the pa

### Note: 
<hr>
The above conversion may seem trivial at this point, and in fact it doesn't really matter now. How you chose to present
your text data will depend on the use in mind. However in the next tast we'll use the raw_string version for convenience
<hr>


### 2A. Simple Whitespace tokenization

Quick online reference: <a href='https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization'>Tokenization</a>

In [31]:
#we perform a simple tokenization based on whitespaces between words:

tokenized_text = raw_string.split(' ') # split string on whaitespaces
tokenized_text[:15]

['New',
 'research',
 'shows',
 'that',
 'insects',
 'feel',
 'pain.',
 'The',
 'researchers',
 'say',
 'it',
 "isn't",
 'the',
 'same',
 'kind']

### Notes: 
<hr>
Now we have a list of tokens ready for further processing. We can do the same using nltk module, which use more advanced 
rules to generate token. let's try

### 2B. Tokenization using advanced tools

In [8]:
from nltk import word_tokenize

In [32]:
nltk_tokens  =  word_tokenize(raw_string,language='english')
nltk_tokens[:15] #preview first 10 words

['New',
 'research',
 'shows',
 'that',
 'insects',
 'feel',
 'pain',
 '.',
 'The',
 'researchers',
 'say',
 'it',
 'is',
 "n't",
 'the']

### Notes:
<hr>
There is a clear difference in the outputs from our previous tokenization and the one we've done using nltk. It follows conventions specified by the <a href='ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html'>treebank</a>

### 3. Further Operations on words
Unless your goal is just to do simple word counts of frequencies, most language processing tasks will require additional
operations on tokens to provide more context on what tokens actually represnt in the text. Some of the common steps include:
- Part of speech (POS) tagging
- Lemmatization
- Stemming

In the following lines, I'll demonstrate what a POS tagged text looks like and also perform a quick lemmatization task (you can use stemming or lemmatization to reduce words to their common denominator. I prefer to use word lammas)

#### Online reference: <a href='https://en.wikipedia.org/wiki/Lemmatisation'>Lemmatization</a> 

### Note:
<hr>
We use spacy, an advance LNP tool for demonstration here. We first load the module an initialize the pretrained English
language model. we disable a few of the steps in the processing pipeline since we won't need it at this point

In [37]:
import pandas as pd # we'll store our results in data frame just for presentation purposes

import spacy
nlp = spacy.load('en',disable=['ner','parser'])

In [39]:
# we first create a spacy doc that will automatically parse our text and do a lot for us behind the scenes

doc = nlp(raw_string)
token_info = [(token.text,token.lemma_,token.pos_) for token in doc]

token_df = pd.DataFrame(token_info,columns=['original','lemma','Part of Speech'])
token_df.head(10)

Unnamed: 0,original,lemma,Part of Speech
0,New,new,ADJ
1,research,research,NOUN
2,shows,show,VERB
3,that,that,ADP
4,insects,insect,NOUN
5,feel,feel,VERB
6,pain,pain,NOUN
7,.,.,PUNCT
8,The,the,DET
9,researchers,researcher,NOUN


In [43]:
# save datframe output
token_df.to_csv('token_df.csv')

# save text
## I like using json format to structure data for storing. There are many options for saving your text though

data ={'title':'insects can feel pain',
     'source':'https://breakingnewsenglish.com/1907/190716-insect-pain.html',
     'content':raw_string}
with open('text.json','w') as out:
    json.dump(data,out)
    

### Conclusion

In this notebook I demonstrate how to kickstart an NLP task by understanding what makes up a huge body of text. I also introduce some advanced tools to help you perform simple to advance language processing tasks. In future notebooks, I'll show to perform additional text preprocessing and how to turn a piece text into a neat body of knowledge.