In [1]:
%autosave 30

Autosaving every 30 seconds


### How do we make the machines to identify sentiments from texts/paragraphs?

For that we've got **`NLP (Natural Language Processing)`**.

In [2]:
import nltk

# --> An entirely a rule-based hard-coded library.

In [3]:
## To download all NLTK's packages altogether at once

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
para = """One Piece (stylized in all caps) is a Japanese manga series written and illustrated by Eiichiro Oda. It has been serialized in Shueisha's shōnen manga magazine Weekly Shōnen Jump since July 1997, with its individual chapters compiled into 104 tankōbon volumes as of November 2022. The story follows the adventures of Monkey D. Luffy, a boy whose body gained the properties of rubber after unintentionally eating a Devil Fruit. With his pirate crew, the Straw Hat Pirates, Luffy explores the Grand Line in search of the deceased King of the Pirates Gol D. Roger's ultimate treasure known as the "One Piece" in order to become the next King of the Pirates.

The manga spawned a media franchise, having been adapted into a festival film produced by Production I.G, and an anime series produced by Toei Animation, which began broadcasting in 1999. Additionally, Toei has developed fourteen animated feature films, one original video animation, and thirteen television specials. Several companies have developed various types of merchandising and media, such as a trading card game and numerous video games. The manga series was licensed for an English language release in North America and the United Kingdom by Viz Media and in Australia by Madman Entertainment. The anime series was licensed by 4Kids Entertainment for an English-language release in North America in 2004 before the license was dropped and subsequently acquired by Funimation in 2007."""

para

'One Piece (stylized in all caps) is a Japanese manga series written and illustrated by Eiichiro Oda. It has been serialized in Shueisha\'s shōnen manga magazine Weekly Shōnen Jump since July 1997, with its individual chapters compiled into 104 tankōbon volumes as of November 2022. The story follows the adventures of Monkey D. Luffy, a boy whose body gained the properties of rubber after unintentionally eating a Devil Fruit. With his pirate crew, the Straw Hat Pirates, Luffy explores the Grand Line in search of the deceased King of the Pirates Gol D. Roger\'s ultimate treasure known as the "One Piece" in order to become the next King of the Pirates.\n\nThe manga spawned a media franchise, having been adapted into a festival film produced by Production I.G, and an anime series produced by Toei Animation, which began broadcasting in 1999. Additionally, Toei has developed fourteen animated feature films, one original video animation, and thirteen television specials. Several companies hav

### # Tokenization: 

**`Tokenization`** is used in **natural language processing** to split paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).

In [4]:
## Paragraph into Sentence tokenization

sentences = nltk.sent_tokenize(para)
sentences

['One Piece (stylized in all caps) is a Japanese manga series written and illustrated by Eiichiro Oda.',
 "It has been serialized in Shueisha's shōnen manga magazine Weekly Shōnen Jump since July 1997, with its individual chapters compiled into 104 tankōbon volumes as of November 2022.",
 'The story follows the adventures of Monkey D. Luffy, a boy whose body gained the properties of rubber after unintentionally eating a Devil Fruit.',
 'With his pirate crew, the Straw Hat Pirates, Luffy explores the Grand Line in search of the deceased King of the Pirates Gol D. Roger\'s ultimate treasure known as the "One Piece" in order to become the next King of the Pirates.',
 'The manga spawned a media franchise, having been adapted into a festival film produced by Production I.G, and an anime series produced by Toei Animation, which began broadcasting in 1999.',
 'Additionally, Toei has developed fourteen animated feature films, one original video animation, and thirteen television specials.',
 '

In [5]:
## Sentences into Words tokenization

words = nltk.word_tokenize(para)
print(len(words))
words

258


['One',
 'Piece',
 '(',
 'stylized',
 'in',
 'all',
 'caps',
 ')',
 'is',
 'a',
 'Japanese',
 'manga',
 'series',
 'written',
 'and',
 'illustrated',
 'by',
 'Eiichiro',
 'Oda',
 '.',
 'It',
 'has',
 'been',
 'serialized',
 'in',
 'Shueisha',
 "'s",
 'shōnen',
 'manga',
 'magazine',
 'Weekly',
 'Shōnen',
 'Jump',
 'since',
 'July',
 '1997',
 ',',
 'with',
 'its',
 'individual',
 'chapters',
 'compiled',
 'into',
 '104',
 'tankōbon',
 'volumes',
 'as',
 'of',
 'November',
 '2022',
 '.',
 'The',
 'story',
 'follows',
 'the',
 'adventures',
 'of',
 'Monkey',
 'D.',
 'Luffy',
 ',',
 'a',
 'boy',
 'whose',
 'body',
 'gained',
 'the',
 'properties',
 'of',
 'rubber',
 'after',
 'unintentionally',
 'eating',
 'a',
 'Devil',
 'Fruit',
 '.',
 'With',
 'his',
 'pirate',
 'crew',
 ',',
 'the',
 'Straw',
 'Hat',
 'Pirates',
 ',',
 'Luffy',
 'explores',
 'the',
 'Grand',
 'Line',
 'in',
 'search',
 'of',
 'the',
 'deceased',
 'King',
 'of',
 'the',
 'Pirates',
 'Gol',
 'D.',
 'Roger',
 "'s",
 'ulti

### # Stemming and Lemmatization:

**`Stemming`** is the process of reducing a word to its **stem** that affixes to suffixes and prefixes or to the roots of words known as **"lemmas"**. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). The stem here might or might not be meaningful.

**`Lemmatization`** is a **text normalization** technique used in Natural Language Processing (NLP), that switches any kind of a word to its base root [meaningful] mode. Lemmatization is responsible for grouping different inflected forms of words into the root form, having the same meaning.

In [6]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

**`Stop words`** are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. 

In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

In [11]:
## Stop words in English
stopwords.words('english')

## Stop words in German
stopwords.words('german')

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 '

In [12]:
stemmer = PorterStemmer()
stemmer

<PorterStemmer>

In [13]:
sentences

['One Piece (stylized in all caps) is a Japanese manga series written and illustrated by Eiichiro Oda.',
 "It has been serialized in Shueisha's shōnen manga magazine Weekly Shōnen Jump since July 1997, with its individual chapters compiled into 104 tankōbon volumes as of November 2022.",
 'The story follows the adventures of Monkey D. Luffy, a boy whose body gained the properties of rubber after unintentionally eating a Devil Fruit.',
 'With his pirate crew, the Straw Hat Pirates, Luffy explores the Grand Line in search of the deceased King of the Pirates Gol D. Roger\'s ultimate treasure known as the "One Piece" in order to become the next King of the Pirates.',
 'The manga spawned a media franchise, having been adapted into a festival film produced by Production I.G, and an anime series produced by Toei Animation, which began broadcasting in 1999.',
 'Additionally, Toei has developed fourteen animated feature films, one original video animation, and thirteen television specials.',
 '

**# Remove `stop words` and then Stem each word..**

In [32]:
## Stemming

stemmed_sentences = []
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in stopwords.words('english')]
    stemmed_sentences.append(' '.join(words))

In [33]:
stemmed_sentences

['one piec ( styliz cap ) japanes manga seri written illustr eiichiro oda .',
 "it serial shueisha 's shōnen manga magazin weekli shōnen jump sinc juli 1997 , individu chapter compil 104 tankōbon volum novemb 2022 .",
 'the stori follow adventur monkey d. luffi , boy whose bodi gain properti rubber unintent eat devil fruit .',
 "with pirat crew , straw hat pirat , luffi explor grand line search deceas king pirat gol d. roger 's ultim treasur known `` one piec '' order becom next king pirat .",
 'the manga spawn media franchis , adapt festiv film produc product i.g , anim seri produc toei anim , began broadcast 1999 .',
 'addit , toei develop fourteen anim featur film , one origin video anim , thirteen televis special .',
 'sever compani develop variou type merchandis media , trade card game numer video game .',
 'the manga seri licens english languag releas north america unit kingdom viz media australia madman entertain .',
 'the anim seri licens 4kid entertain english-languag releas n

**=>** We can clearly observe some of the stemmed words are not making any sense and of course they won't. To make any sense outta them, we've gotta **lemmatize** them.

### The problem with Stemming is it produces intermediate representation of the word which may not have any meaning.

**# Remove `stop words` and then Lemmatize each word..**

In [34]:
from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()
lemma

<WordNetLemmatizer>

In [35]:
## Stemming

lemmatized_sentences = []
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemma.lemmatize(word) for word in words if word not in stopwords.words('english')]
    lemmatized_sentences.append(' '.join(words))

In [37]:
lemmatized_sentences

['One Piece ( stylized cap ) Japanese manga series written illustrated Eiichiro Oda .',
 "It serialized Shueisha 's shōnen manga magazine Weekly Shōnen Jump since July 1997 , individual chapter compiled 104 tankōbon volume November 2022 .",
 'The story follows adventure Monkey D. Luffy , boy whose body gained property rubber unintentionally eating Devil Fruit .',
 "With pirate crew , Straw Hat Pirates , Luffy explores Grand Line search deceased King Pirates Gol D. Roger 's ultimate treasure known `` One Piece '' order become next King Pirates .",
 'The manga spawned medium franchise , adapted festival film produced Production I.G , anime series produced Toei Animation , began broadcasting 1999 .',
 'Additionally , Toei developed fourteen animated feature film , one original video animation , thirteen television special .',
 'Several company developed various type merchandising medium , trading card game numerous video game .',
 'The manga series licensed English language release North 

**=>** Got what we wanted. **Lemmatization prevailed!**