# Introduction to Textmining with NLTK

A short introduction in data processing for textual data and some basic applications for sentiment analysis and

# Basic Setup


Install nltk library for text processing and download some extensions that are required. Also, we install the wordcloud library for plotting our results as wordcloud.

In [None]:
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

!pip install wordcloud

In [None]:
# we import a series of specific functions from the nltk package for processing the texts.
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk import FreqDist

# we import two functions from the library wordcloud to create word clouds; matplotlib.pyplot is used for plotting the wordloud
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# we import pandas for reading in files
import pandas as pd

## Read in the data

In [None]:
corpus = pd.read_csv("https://github.com/casbdai/notebooks2023/raw/main/Module2/Textmining/fake_news.csv")
corpus.head()

We extract the first document and save it as an object text.

In [None]:
text = corpus["text"][1]
print(text)

## Pre-Processing Textual Data

### Convert text to lower case:

In [None]:
lower_text = text.lower()
print (lower_text)

### Tokenize text

Break down text into tokens, i.e, breaking the sentences into single words for analysis.

In [None]:
word_tokens = nltk.word_tokenize(lower_text)
print (word_tokens)

We need a better tokenizer also "punctuation" and "numbers" are retained as tokens. Also, very short words are translated into tokens.


In [None]:
better_tokenizer = RegexpTokenizer(r'[a-zA-Z]{3,}')

# [a-zA-Z] means that only letters are retained as tokens
# {3,} means that only tokens with at least three characters are retained

In [None]:
word_tokens = better_tokenizer.tokenize(lower_text)
print(word_tokens)

## Remove stop words

Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they don’t carry any information.

In [None]:
stopword = stopwords.words('english')
stopword

For getting rid of stopwords, we must compare each token against the words in the stop words list. With can be easily done in a list comprehension. List comprehension are a common extension of "for-loops".

A for loop that prints out every token:

In [None]:
for word in word_tokens:
    print(word)

Reformulating the for loop as a list comprehension. List comprehensions are considered to be very understandable and are thus used very frequently by pythonistas.

In [None]:
[word for word in word_tokens]

Extending our list comprehension such that only tokens are retained that are NOT on the stoplist.

In [None]:
clean_tokens = [word for word in word_tokens if word not in stopword]
print (clean_tokens)

## Lemmatization / Stemming

Stemming and Lemmatization

Often we want to map the different forms of the same word to the same root word, e.g. "walks", "walking", "walked" should all be the same as "walk".

The stemming and lemmatization process are hand-written rules written find the root word.

- Stemming: Trying to shorten a word with simple cutoff rules
- Lemmatization: Trying to find the root word with linguistics rules

In [None]:
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [wordnet_lemmatizer.lemmatize(word) for word in clean_tokens]
print (lemmatized_tokens)

In [None]:
snowball_stemmer = SnowballStemmer('english')

stemmed_tokens = [snowball_stemmer.stem(word) for word in lemmatized_tokens]
print (stemmed_token)

## Get word frequency

Counting the most frequently used words in a textdocument

In [None]:
freq = FreqDist(stemmed_tokens)
print (freq.most_common(5))

## Create a Wordclout

As a last step, we can now plot our results. The library wordcloud does all pre-processing steps under the hood.

In [None]:
wordcloud = WordCloud(max_words=25, background_color="white").generate(corpus["text"][1])

In [None]:
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

# Compare Fake and Real News

## Fake News

In [None]:
fakenews = corpus.loc[corpus["label"] == "Fake"].values

In [None]:
wordcloud = WordCloud(max_words=25, background_color="white").generate(str(fakenews))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

## True News

In [None]:
realnews = corpus.loc[corpus["label"] == "Real"].values

wordcloud = WordCloud(max_words=25, background_color="white").generate(str(realnews))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

# Very Basic Sentiment Analysis

Using a dictionairy of positive and negative words, we can now perform a very basic sentiment analysis

In [None]:
pos_sent = []

[pos_sent.append(1) for word in stemmed_tokens if word in ["correct", "good", "increas"] ]

sum(pos_sent)

In [None]:
neg_sent = []

[neg_sent.append(1) for word in stemmed_token if word in ["virus", "infect","gun"] ]

sum(neg_sent)

In [None]:
sum(pos_sent) - sum(neg_sent)