# Natural Language Processing (NLP) Prerequisites
___
In this notebook, we will discuss some of the fundamental (or prerequisite) knowledge needed to be successful in NLP. 

Specifically, we'll be talking about the following:
- [Natural Language Toolkit (NLTK)](#Natural-Language-Toolkit-(NLTK))
- [Reading Text Data](#Reading-Text-Data)
- [Understanding the Data](#Understanding-the-Data)
- [Regular Expressions (RegEx)](#Regular-Expression-(RegEx))
- [Stemming and Lemmatizing](#Stemming-and-Lemmatizing)

We finish the notebook off with a review of everything discussed, as well as instructions for the assignment associated with the material covered. 
- [Review](#Review)

> ### Natural Language Toolkit (NLTK)
> ___
> **NLTK** is a suite of open-source tools created to make NLP processing in Python easier. For installation instructions, see https://www.nltk.org/install.html.
>
>Below, we'll mess around with NLTK a bit. 

In [None]:
# To download packages/corpora:
import nltk
nltk.download()

In [None]:
# To see what you have installed:
dir(nltk)

In [None]:
# To see stopwords (common words that don't have much meaning in the sentence):
from nltk.corpus import stopwords
stopwords.words('english')[0:30]

There is plenty more, so feel free to experiment on your own!

> ### Reading Text Data:
> ___
> Without data, we can't do any processing on it. In this section, we focus on how to load in data.
>
> We also take a look at **Pandas** and see how we can load data into dataframes.

In [None]:
# Read in file:
file = open('filename.txt').read()

# Print some of the data in the file:
file[0:50]

Imagine each line in the file has a label followed by a sentence separated by a tab. We can break the label and sentence up to create a list that's easier to work with by:

In [None]:
split_file = file.replace('\t','\n').split('\n')
split_file[0:4]

We separate the labels and sentences as follows:

In [None]:
labels = split_data[0::2]
sentences = split_file[1::2]
print('Labels:\n',labels)
print('Sentences:\n',sentences)

We will be using Pandas dataframes to work with a corpus (a collection of words). For more information on Pandas, see LINK. 

Here, we'll create a Pandas dataframe and . 

In [None]:
import pandas as pd
corpus = pd.DataFrame({
    'Labels':labels,
    'Sentences':sentences
})
corpus.head()

Note that Pandas can actually load in a tab-delineated file (as the one described above), so let's see how we can do that. 

In [None]:
corpus = pd.read_csv('filename.txt', sep='\t', header=None)
corpus.columns = ['Labels','Sentences']
corpus.head()

> ### Understanding the Data
> ___
> Now that we've loaded out data, we want to take a closer look at it. 
> 
> How is the data shaped/structured? When working with labeled data, how many samples correspond to each label? Are there more samples of one label than samples of the other label(s)? Are we missed any data? 
>
> Let's dive in. 

We start by looking at the shape of our corpus, printing out the number of samples (rows) and the number of features (columns) of our data.

In [None]:
print('Our corpus has {} samples with {} features.'.format(len(corpus),len(corpus.columns)))

Next, we look to see how many samples (rows) correspond to each label (first feature of our data). Assume the data we were working with had labels 'confidential' and 'non-confidential'. 

In [None]:
print('There are {} samples labeled confidential and {} samples labeled non-confidential'.format(len(corpus['Labels']=='confidential'),
                                                                                                 len(corpus['Labels']=='non-confidential')))

Now, let's see how many (if any) samples are missing data. 

In [None]:
print('# of missing labels: {}'.format(corpus['Labels'].isnull().sum()))
print('# of missing sentences: {}'.format(corpus['Sentences'].isnull().sum()))

Look at the outputs above. Was your data skewed? Was there missing data? We will want to keep that in mind moving forward.  

> ### Regular Expression (RegEx)
> ___
> **Regular expressions** are text strings that describe a search pattern. We will use regular expressions to simplify our data searches. 
> 
> Regular expressions will prove to be very helpful when dealing with unstructured data. In those cases, we wlll have to search for specific patterns to try and create meaning from the data. 
>
> Here are a couple of examples that regular expressions can help us with: 
> - identifying whitespace between words/tokens
> - removing punctuation from text
> - cleaning HTML tags from text
> - confirming text meets a specific criteria (such as a password)
> - and many more! 
> 
> Some commonly used regex are shown below: 
> INSERT TABLE
> 
> For a more extensive cheat sheet, see #. 
> 
> Below, we will use regex to **tokenize** strings. **Tokenization** refers to separating text into smaller units called tokens, which can be either words, characters, or subwords.

To use regex in Python, we  make use of the `re` package. 

Let's start by importing this package and then make a few strings to use regular expressions with. 

In [10]:
import re
string1 = 'I am excited to be learning natural language processing'
string2 = 'I        am excited to be         learning         natural language processing'
string3 = 'I|am|excited|to|be|learning|>>>>>|natural|language|processing'

What if we wanted to split the strings up by whitespace? We can use the `re.split` method, using `'\s'` to split the string on a whitespace character (recall the table above). 

In [11]:
print(re.split('\s', string1),'\n')
print(re.split('\s', string2),'\n')
print(re.split('\s', string3),'\n')

['I', 'am', 'excited', 'to', 'be', 'learning', 'natural', 'language', 'processing'] 

['I', '', '', '', '', '', '', '', 'am', 'excited', 'to', 'be', '', '', '', '', '', '', '', '', 'learning', '', '', '', '', '', '', '', '', 'natural', 'language', 'processing'] 

['I|am|excited|to|be|learning|>>>>>|natural|language|processing'] 



We see that splitting by whitespace only helped us for `string1`. What regular expression would help us with `string2` and `string3`? 

Well, `string2` has multiple whitespaces in succession, so perhaps `'\s+'` would handle those. And `string3` has non-alphanumeric characters, so perhaps `'\W+'` would handle those. Let's try this out. 

In [19]:
print(re.split('\s', string1),'\n')
print(re.split('\s+', string2),'\n')
print(re.split('\W+', string3),'\n')

['I', 'am', 'excited', 'to', 'be', 'learning', 'natural', 'language', 'processing'] 

['I', 'am', 'excited', 'to', 'be', 'learning', 'natural', 'language', 'processing'] 

['I', 'am', 'excited', 'to', 'be', 'learning', 'natural', 'language', 'processing'] 



Instead of splitting the strings, let's try using the `re.findall` method to obtain the same results. In this scenario, we will have to use negating regex compared to when we used `re.split`. Does this make sense?

For example, we want to find all sets of non-whitespace characters for `string1` and `string2` (note that `string1`'s regex needs a `+` in it to get said set of characters). Similarly, we want to find all sets of alphanumeric characters for `string3`. 

In [17]:
print(re.findall('\S+', string1),'\n')
print(re.findall('\S+', string2),'\n')
print(re.findall('\w+', string3),'\n')

['I', 'am', 'excited', 'to', 'be', 'learning', 'natural', 'language', 'processing'] 

['I', 'am', 'excited', 'to', 'be', 'learning', 'natural', 'language', 'processing'] 

['I', 'am', 'excited', 'to', 'be', 'learning', 'natural', 'language', 'processing'] 



Awesome! We just learned how to **tokenize** strings using the methods `re.split()` and `re.findall()`. 

There are plenty of other regex methods that will prove to be useful, such as:
- `re.sub()`
- `re.search()`
- `re.match()`
- `re.fullmatch()`
- `re.finditer()`
- `re.escape()`

### Stemming and Lemmatizing

> ### Review
> ___
> 

Before we even start thinking about plugging out data into machine learning models, we typically have to perform these steps on our data first:
1. Remove Punctuation
2. Tokenize
3. Remove Stopwords
4. Lemmatize/Stem

Assignment #1 can be used to check that we know how to do all of these things before moving on. Give it a go, making sure to compare your solution with the one given ONLY AFTER you've completed the assignment on your own. 