# Natural Langauge Processing (NLP)
---

- Natural language processing (NLP) refers to the branch of computer science
concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. 
- NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models.
- NLTK, or Natural Language Toolkit, is a Python package that can be used for NLP.

## Stemming

---

- Stemming is the process of producing morphological variants of a root/base word.
- Stemming programs are commonly referred to as stemming algorithms or stemmers.
- Stemming is an important part of the pipelining process in Natural language
processing. 
- The input to the stemmer is tokenized words. Some Stemming
algorithms are:

### Porter’s Stemmer

It is based on the idea that the suffixes in the English language are made up of a
combination of smaller and simpler suffixes. This stemmer is known for its speed
and simplicity. The main applications of Porter Stemmer include data mining and
Information retrieval. However, its applications are only limited to English words.

### Snowball Stemmer

When compared to the Porter Stemmer, the Snowball Stemmer can map non-English
words too. Since it supports other languages, the Snowball Stemmers can be called a
multi-lingual stemmer. The Snowball stemmers are also imported from the NLTK
package.

In [11]:
# importing libraries
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import string

In [12]:
# download data for nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/totoro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/totoro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
paragraph = """Science is what we do to find out about the natural world. Natural sciences include, chemistry, biology, geology, astronomy, and physics. Science uses mathematics and logic, which are sometimes called "formal sciences". Natural science makes observations and experiments. Science produces accurate facts, scientific laws and theories. 'Science' also refers to the large amount of knowledge that has been found using this process. Research uses the scientific method. Scientific research uses hypotheses based on ideas or earlier knowledge, which can be categorized through different topics. Then those hypotheses are tested by experiments. People who study and research science and try to find out everything about it are called scientists. Scientists study things by looking at them very carefully, by measuring them, and by doing experiments and tests. Scientists try to explain why things act the way they do, and predict what will happen."""

## Tokenizing using PorterStemmer

In [14]:
# tokenizing using porter stemmer
sentences = nltk.sent_tokenize(paragraph)
sentences

['Science is what we do to find out about the natural world.',
 'Natural sciences include, chemistry, biology, geology, astronomy, and physics.',
 'Science uses mathematics and logic, which are sometimes called "formal sciences".',
 'Natural science makes observations and experiments.',
 'Science produces accurate facts, scientific laws and theories.',
 "'Science' also refers to the large amount of knowledge that has been found using this process.",
 'Research uses the scientific method.',
 'Scientific research uses hypotheses based on ideas or earlier knowledge, which can be categorized through different topics.',
 'Then those hypotheses are tested by experiments.',
 'People who study and research science and try to find out everything about it are called scientists.',
 'Scientists study things by looking at them very carefully, by measuring them, and by doing experiments and tests.',
 'Scientists try to explain why things act the way they do, and predict what will happen.']

In [15]:
stemmer = PorterStemmer()

In [16]:
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)
# output of porter stemmer
sentences

['scienc find natur world .',
 'natur scienc includ , chemistri , biolog , geolog , astronomi , physic .',
 "scienc use mathemat logic , sometim call `` formal scienc '' .",
 'natur scienc make observ experi .',
 'scienc produc accur fact , scientif law theori .',
 "'scienc ' also refer larg amount knowledg found use process .",
 'research use scientif method .',
 'scientif research use hypothes base idea earlier knowledg , categor differ topic .',
 'then hypothes test experi .',
 'peopl studi research scienc tri find everyth call scientist .',
 'scientist studi thing look care , measur , experi test .',
 'scientist tri explain thing act way , predict happen .']

## Tokenize using SnowballStemmer

In [17]:
stemmer2 = SnowballStemmer('english')

In [18]:
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [stemmer2.stem(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)
# output using snowball stemmer
sentences

['scienc find natur world .',
 'natur scienc includ , chemistri , biolog , geolog , astronomi , physic .',
 'scienc use mathemat logic , sometim call `` formal scienc `` .',
 'natur scienc make observ experi .',
 'scienc produc accur fact , scientif law theori .',
 "scienc ' also refer larg amount knowledg found use process .",
 'research use scientif method .',
 'scientif research use hypoth base idea earlier knowledg , categor differ topic .',
 'hypoth test experi .',
 'peopl studi research scienc tri find everyth call scientist .',
 'scientist studi thing look care , measur , experi test .',
 'scientist tri explain thing act way , predict happen .']

## Removing Punctuations


In [19]:
filtered = []
s = set(string.punctuation)
print(' '.join(s))

" { ` ^ = - + < & / . % $ > ! : ( ~ # ? _ , ' ) [ @ \ | } * ; ]


In [20]:
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])

  for j in words:
    if j not in s:
      filtered.append(j)
' '.join(filtered)

'scienc find natur world natur scienc includ chemistri biolog geolog astronomi physic scienc use mathemat logic sometim call `` formal scienc `` natur scienc make observ experi scienc produc accur fact scientif law theori scienc also refer larg amount knowledg found use process research use scientif method scientif research use hypoth base idea earlier knowledg categor differ topic hypoth test experi peopl studi research scienc tri find everyth call scientist scientist studi thing look care measur experi test scientist tri explain thing act way predict happen'

In [1]:
t = """Hello, I am ram and I like to study science books."""

In [None]:
for i in range(len(t)):
    words = nltk.word_tokenize(t)