# Introduction to NLP in Python
## Q.1: NLP Basics for Text Preprocessing

### Tokenization

Tokenizers divide strings into lists of substrings. After installing the nltk library, let's import the library along with these two built-in methods, *sent_tokenize* and *word_tokenize*. 

In [1]:
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize

1. `sent_tokenize`

The first method, `sent_tokenize`, splits the given text into sentences. This is useful especially if you are dealing with bigger chunks of text with longer sentences.

We will make use of the following sample paragraph about NLP in the healthcare industry. Run the cell below to check out the output.

In [2]:
text = "To get hired for a tech product startup, we all know just doing reporting alone won't distinguish a potential data analyst, a good data analyst is one who has an absolute passion for data. He/she has a strong understanding of the business/product you are running, and will be always seeking meaningful insights to help the team make better decisions"
sent_tokenize(text)

["To get hired for a tech product startup, we all know just doing reporting alone won't distinguish a potential data analyst, a good data analyst is one who has an absolute passion for data.",
 'He/she has a strong understanding of the business/product you are running, and will be always seeking meaningful insights to help the team make better decisions']

If you encounter the "Resource punkt not found" error when running the above cell, you can run the following command `nltk.download('punkt')`
<br/><br/>

2. `word_tokenize`

Likewise, the `word_tokenize` method tokenizes each individual word in the paragraph. Run the cell below to compare the outputs.

In [3]:
word_tokenize(text)

['To',
 'get',
 'hired',
 'for',
 'a',
 'tech',
 'product',
 'startup',
 ',',
 'we',
 'all',
 'know',
 'just',
 'doing',
 'reporting',
 'alone',
 'wo',
 "n't",
 'distinguish',
 'a',
 'potential',
 'data',
 'analyst',
 ',',
 'a',
 'good',
 'data',
 'analyst',
 'is',
 'one',
 'who',
 'has',
 'an',
 'absolute',
 'passion',
 'for',
 'data',
 '.',
 'He/she',
 'has',
 'a',
 'strong',
 'understanding',
 'of',
 'the',
 'business/product',
 'you',
 'are',
 'running',
 ',',
 'and',
 'will',
 'be',
 'always',
 'seeking',
 'meaningful',
 'insights',
 'to',
 'help',
 'the',
 'team',
 'make',
 'better',
 'decisions']

Additionally, feel free to experiment with different sentences and pieces of text and passing them through each tokenizer. 

There are many more types of tokenizers in the nltk library itself, catered to producing various tokens based on the type of data that is needed. You can learn more about tokenizers from the nltk documentation [here](https://www.nltk.org/api/nltk.tokenize.html).

Return back to the StackUp platform, where we will continue on with the quest.

<br/><br/>

### Removing stop words

Stop words are the common words which don't really add much meaning to the text. Some stop words in English includes conjunctions such as for, and, but, or, yet, so, and articles such as a, an, the.

NLTK has pre-defined stop words for English. Let's go ahead and import it by running in the cell below.

In [4]:
# nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

The list stopwords now contains the NLTK predefined stop words. Using the tokenized text from earlier, let's remove the stop words and return the remaining tokens.

In [5]:
tokens = word_tokenize(text)
tokens_no_stopwords = [i for i in tokens if i not in stopwords]
print(tokens_no_stopwords)

['To', 'get', 'hired', 'tech', 'product', 'startup', ',', 'know', 'reporting', 'alone', 'wo', "n't", 'distinguish', 'potential', 'data', 'analyst', ',', 'good', 'data', 'analyst', 'one', 'absolute', 'passion', 'data', '.', 'He/she', 'strong', 'understanding', 'business/product', 'running', ',', 'always', 'seeking', 'meaningful', 'insights', 'help', 'team', 'make', 'better', 'decisions']


<br></br>

### Stemming and Lemmatization

Here, we will experiment using the PorterStemmer and WordNetLemmatizer. Recall from the quest that stemming removes the suffix from the word while lemmatization takes into account the context and what the word means in the sentence.

Play along with different words to compare the outputs produced by a stemmer and a lemmatizer!

In [6]:
# run these lines if they have yet to be downloaded.
# once downloaded, you can comment out the lines.
# nltk.download('wordnet')
# nltk.download('omw-1.4')

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer = PorterStemmer()
lemma = WordNetLemmatizer()

Let's test both methods on various pluralised words.

In [7]:
plurals = ['apples', 'octopuses', 'categories', 'criteria', 'tomatoes', 'matrices', 'hypotheses', 'radii', 'algae', 'cacti']

plurals_stem = [stemmer.stem(plural) for plural in plurals]
plurals_lemma = [lemma.lemmatize(plural) for plural in plurals]

print("Stemming results: ", plurals_stem)
print("Lemmatization results; ", plurals_lemma)

Stemming results:  ['appl', 'octopus', 'categori', 'criteria', 'tomato', 'matric', 'hypothes', 'radii', 'alga', 'cacti']
Lemmatization results;  ['apple', 'octopus', 'category', 'criterion', 'tomato', 'matrix', 'hypothesis', 'radius', 'algae', 'cactus']


In [8]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Define sample text
sample = "In the healthcare industry, NLP is used to analyze large amounts of healthcare-related data. This includes clinical notes and medical imaging reports, and many more. With the help of NLP, healthcare providers can quickly and accurately identify patterns and insights from patient data. For example, NLP can predict patient outcomes, such as the likelihood of readmission or the risk of developing a particular condition. NLP can also be used to extract key information from medical imaging reports. An example can be the size and location of tumours. This can help healthcare providers make more informed treatment decisions. Overall, NLP is a powerful tool that can help improve patient outcomes and enhance the quality of care in the healthcare industry."

# Tokenize into sentences and words
sentences = sent_tokenize(sample)
words = word_tokenize(sample)

# Remove stopwords
stopwords_removed = [word for word in words if word.lower() not in stopwords.words("english")]

# Perform stemming and lemmatization
ps = PorterStemmer()
wnl = WordNetLemmatizer()
sample_stem = [ps.stem(word) for word in stopwords_removed]
sample_lemma = [wnl.lemmatize(word) for word in stopwords_removed]

# Print the results
print(sample, "\n")
print(sentences, "\n")
print(words, "\n")
print(stopwords_removed, "\n")
print("Stemming results: ", sample_stem, "\n")
print("Lemmatization results: ", sample_lemma, "\n")


In the healthcare industry, NLP is used to analyze large amounts of healthcare-related data. This includes clinical notes and medical imaging reports, and many more. With the help of NLP, healthcare providers can quickly and accurately identify patterns and insights from patient data. For example, NLP can predict patient outcomes, such as the likelihood of readmission or the risk of developing a particular condition. NLP can also be used to extract key information from medical imaging reports. An example can be the size and location of tumours. This can help healthcare providers make more informed treatment decisions. Overall, NLP is a powerful tool that can help improve patient outcomes and enhance the quality of care in the healthcare industry. 

['In the healthcare industry, NLP is used to analyze large amounts of healthcare-related data.', 'This includes clinical notes and medical imaging reports, and many more.', 'With the help of NLP, healthcare providers can quickly and accurate