# Learner/Facilitator Guide to Text Analytics with Python


In [None]:
import nltk
nltk.download('all')

## Topic 2 Text Cleaning and Pre-processing


### Read Text using Standard Python Package

Python supports a number of standard and custom libraries to read files of all types into python variables.

In [None]:
#Read the file using standard python libaries
with open("Spark-Course-Description.txt", 'r') as fh:  
    filedata = fh.read()
    
#Print first 200 characters in the file
print("Data read from file : ", filedata[0:200] )

### Read Text using NLTK CorpusReader

Read the same text file using a Corpus Reader

NLTK supports multiple CorpusReaders depending upon the type of data source. Details available in http://www.nltk.org/howto/corpus.html

In [None]:
#install nltk from anaconda prompt using "pip install nltk"
import nltk
#Download punkt package, used part of the other commands
nltk.download('punkt')

from nltk.corpus.reader.plaintext import PlaintextCorpusReader

#Read the file into a corpus. The same command can read an entire directory
corpus=PlaintextCorpusReader("","Spark-Course-Description.txt")

#Print raw contents of the corpus
print(corpus.raw())

### Explore the Corpus

In [None]:
#Extract the file IDs from the corpus
print("Files in this corpus : ", corpus.fileids())

#Extract paragraphs from the corpus
paragraphs=corpus.paras()
print("\n Total paragraphs in this corpus : ", len(paragraphs))

#Extract sentences from the corpus
sentences=corpus.sents()
print("\n Total sentences in this corpus : ", len(sentences))
print("\n The first sentence : ", sentences[0])

#Extract words from the corpus
print("\n Words in this corpus : ",corpus.words() )


### Analyze the Corpus

The NLTK library provides a number of functions to analyze the distributions and aggregates for data in the corpus.


In [None]:
#Find the frequency distribution of words in the corpus
course_freq_dist=nltk.FreqDist(corpus.words())

#Print most commonly used words
print("Top 10 words in the corpus : ", course_freq_dist.most_common(10))

#find the distribution for a specific word
print("\n Distribution for \"Spark\" : ",course_freq_dist.get("Spark"))

### Tokenization

Tokenization refers to converting a text string into individual tokens. Tokens may be words or punctations

In [None]:
import nltk
import os


#Read the base file into a raw text variable
base_file = open("Spark-Course-Description.txt", 'rt')
raw_text = base_file.read()
base_file.close()

#Extract tokens
token_list = nltk.word_tokenize(raw_text)
print("Token List : ",token_list[:20])
print("\n Total Tokens : ",len(token_list))

### Cleansing Text

We will see examples of removing punctuation and converting to lower case

#### Remove Punctuation

In [None]:
#Use the Punkt library to extract tokens
token_list2 = list(filter(lambda token: nltk.tokenize.punkt.PunktToken(token).is_non_punct, token_list))
print("Token List after removing punctuation : ",token_list2[:20])
print("\nTotal tokens after removing punctuation : ", len(token_list2))

#### Convert to Lower Case

In [None]:
token_list3=[word.lower() for word in token_list2 ]
print("Token list after converting to lower case : ", token_list3[:20])
print("\nTotal tokens after converting to lower case : ", len(token_list3))

### Stop word Removal

Removing stop words by using a standard stop word list available in NLTK for English

In [None]:
#Download the standard stopword list
nltk.download('stopwords')
from nltk.corpus import stopwords

#Remove stopwords
token_list4 = list(filter(lambda token: token not in stopwords.words('english'), token_list3))
print("Token list after removing stop words : ", token_list4[:20])
print("\nTotal tokens after removing stop words : ", len(token_list4))

### Stemming

In [None]:
#Use the PorterStemmer library for stemming.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

#Stem data
token_list5 = [stemmer.stem(word) for word in token_list4 ]
print("Token list after stemming : ", token_list5[:20])
print("\nTotal tokens after Stemming : ", len(token_list5))

### Lemmatization

In [None]:
#Use the wordnet library to map words to their lemmatized form
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
token_list6 = [lemmatizer.lemmatize(word) for word in token_list4 ]
print("Token list after Lemmatization : ", token_list6[:20])
print("\nTotal tokens after Lemmatization : ", len(token_list6))

#### Comparison of tokens between raw, stemming and lemmatization

In [None]:
#Check for token technlogies
print( "Raw : ", token_list4[20]," , Stemmed : ", token_list5[20], " , Lemmatized : ", token_list6[20])

### N-gram


In [None]:
#Prepare data for use in this exercise

import nltk
import os
#Download punkt package, used part of the other commands
nltk.download('punkt')

#Read the base file into a token list
base_file = open("Spark-Course-Description.txt", 'rt')
raw_text = base_file.read()
base_file.close()

#Execute the same pre-processing done in module 3
token_list = nltk.word_tokenize(raw_text)

token_list2 = list(filter(lambda token: nltk.tokenize.punkt.PunktToken(token).is_non_punct, token_list))

token_list3=[word.lower() for word in token_list2 ]

nltk.download('stopwords')
from nltk.corpus import stopwords
token_list4 = list(filter(lambda token: token not in stopwords.words('english'), token_list3))

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
token_list6 = [lemmatizer.lemmatize(word) for word in token_list4 ]

print("\n Total Tokens : ",len(token_list6))

In [None]:
from nltk.util import ngrams
from collections import Counter

#Find bigrams and print the most common 5
bigrams = ngrams(token_list6,2)
print("Most common bigrams : ")
print(Counter(bigrams).most_common(5))

#Find trigrams and print the most common 5
trigrams = ngrams(token_list6,3)
print(" \n Most common trigrams : " )
print(Counter(trigrams).most_common(5))

### TF-IDF

In [None]:
#Use scikit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

#Use a small corpus for each visualization
vector_corpus = [
    'NBA is a Basketball league',
    'Basketball is popular in America.',
    'TV in America telecast BasketBall.',
]

#Create a vectorizer for english language
vectorizer = TfidfVectorizer(stop_words='english')

#Create the vector
tfidf=vectorizer.fit_transform(vector_corpus)

print("Tokens used as features are : ")
print(vectorizer.get_feature_names())

print("\n Size of array. Each row represents a document. Each column represents a feature/token")
print(tfidf.shape)

print("\n Actual TF-IDF array")
tfidf.toarray()

## Topic 3 Text Analytics


### Parts-of-Speech (POS) Tagging

Some examples of Parts-of-Speech abbreviations:
NN : noun
NNS : noun plural
VBP : Verb singular present.

In [None]:
#download the tagger package
nltk.download('averaged_perceptron_tagger')

#Tag and print the first 10 tokens
nltk.pos_tag(token_list4)[:10]

### Name Entity Recognition (NER)


In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
 
# Step Two: Load Data
 
sentence = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

# Step Three: Tokenise, find parts of speech and chunk words 

for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk))

### Text Clustering

#### Preparing Text for Clustering

In [None]:
import pandas as pd

#Load course hashtags
hashtags_df=pd.read_csv("Course-Hashtags.csv")
print("\nSample hashtag data :")
print(hashtags_df[:2])

#Seperate Hashtags and titles to lists
hash_list = hashtags_df["HashTags"].tolist()
title_list = hashtags_df["Course"].tolist()

#Do TF-IDF conversion of hashtags
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
hash_matrix=vectorizer.fit_transform(hash_list)
print("\n Feature names Identified :\n")
print(vectorizer.get_feature_names())

#### Clustering TF-IDF data

In [None]:
#Use KMeans clustering from scikit-learn
from sklearn.cluster import KMeans

#Split data into 3 clusters
kmeans = KMeans(n_clusters=3).fit(hash_matrix)

#get Cluster labels.
clusters=kmeans.labels_

#Print cluster label and Courses under each cluster
for group in set(clusters):
    print("\nGroup : ",group, "\n-------------------")
    
    for i in hashtags_df.index:
        if ( clusters[i] == group):
            print(title_list[i])
    

#### Finding optimal Cluster size

In [None]:
#Find optimal cluster size by finding sum-of-squared-distances

sosd = []
#Run clustering for sizes 1 to 15 and capture inertia
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(hash_matrix)
    sosd.append(km.inertia_)
    
print("Sum of squared distances : " ,sosd)

#Plot sosd against number of clusters
import matplotlib.pyplot as mpLib
mpLib.plot(K, sosd, 'bx-')
mpLib.xlabel('Cluster count')
mpLib.ylabel('Sum_of_squared_distances')
mpLib.title('Elbow Method For Optimal Cluster Size')
mpLib.show()

## Topic 4 Sentimental Analysis

### Sentiment Analysis using TextBlob

#### Preparing Data for Sentiment Analysis

In [None]:
#Import the movie reviews corpus
with open("Movie-Reviews.txt", 'r') as fh:  
    reviews = fh.readlines()
print(reviews[:2])

#### Finding Sentiments by Review

In [None]:
#install textblob if not already installed using "pip install -U textblob"
from textblob import TextBlob

print('{:40} : {:10} : {:10}'.format("Review", "Polarity", "Subjectivity") )

for review in reviews:
    #Find sentiment of a review
    sentiment = TextBlob(review)
    #Print individual sentiments
    print('{:40} :   {: 01.2f}    :   {:01.2f}'.format(review[:40]\
                , sentiment.polarity, sentiment.subjectivity) )

#### Summarizing Sentiment

In [None]:
#Categorize Polarity into Positive, Neutral or Negative
labels = ["Negative", "Neutral", "Positive"]
#Initialize count array
values =[0,0,0]

#Categorize each review
for review in reviews:
    sentiment = TextBlob(review)
    
    #Custom formula to convert polarity 
    # 0 = (Negative) 1 = (Neutral) 2=(Positive)
    polarity = round(( sentiment.polarity + 1 ) * 3 ) % 3
    
    #add the summary array
    values[polarity] = values[polarity] + 1
    
print("Final summarized counts :", values)

import matplotlib.pyplot as plt
#Set colors by label
colors=["Green","Blue","Red"]

print("\n Pie Representation \n-------------------")
#Plot a pie chart
plt.pie(values, labels=labels, colors=colors, \
        autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()

### Text Classificatoin

#### Preparing Data for Classification

In [None]:
#Read course descriptions
with open("Course-Descriptions.txt", 'r') as fh:  
    descriptions = fh.read().splitlines()
print("Sample course descriptions :", descriptions[:2])

#Setup stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

#setup wordnet for lemmatization
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

from sklearn.feature_extraction.text import TfidfVectorizer

#Custom tokenizer that will perform tokenization, stopword removal
#and lemmatization
def customtokenize(str):
    tokens=nltk.word_tokenize(str)
    nostop = list(filter(lambda token: token not in stopwords.words('english'), tokens))
    lemmatized=[lemmatizer.lemmatize(word) for word in nostop ]
    return lemmatized

#Generate TFIDF matrix
vectorizer = TfidfVectorizer(tokenizer=customtokenize)
tfidf=vectorizer.fit_transform(descriptions)

print("\nSample feature names identified : ", vectorizer.get_feature_names()[:25])
print("\nSize of TFIDF matrix : ",tfidf.shape)


#### Building and Training the model

In [None]:
#Loading the pre-built classifications for training
with open("Course-Classification.txt", 'r') as fh:  
    classifications = fh.read().splitlines()

#Create Labels and integer classes
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(classifications)
print("Classes found : ", le.classes_)

#Convert classes to integers for use with ML
int_classes = le.transform(classifications)
print("\nClasses converted to integers :", int_classes)

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

#Split as training and testing sets
xtrain, xtest, ytrain, ytest = train_test_split(tfidf, int_classes,random_state=0)

#Build the model
classifier= MultinomialNB().fit(xtrain, ytrain)


#### Model Evaluation

In [None]:
from sklearn import metrics

print("Testing with Test Data :\n------------------------")
#Predict on test data
predictions=classifier.predict(xtest)
print("Confusion Matrix : ")
print(metrics.confusion_matrix(ytest, predictions))
print("\n Prediction Accuracy : ",  \
      metrics.accuracy_score(ytest, predictions) )

print("\nTesting with Full Corpus :\n--------------------------")
#Predict on entire corpus data
predictions=classifier.predict(tfidf)
print("Confusion Matrix : ")
print(metrics.confusion_matrix(int_classes, predictions))
print("\n Prediction Accuracy : ",  \
      metrics.accuracy_score(int_classes, predictions) )


## Topic 5 Text Summarization

### Text summarization using sumy

In [None]:
%%capture
!pip install transformers

In [None]:
sample_text = "Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017."

In [None]:
from transformers import pipeline
from pprint import pprint

classifier = pipeline("summarization")
summary_text = classifier(sample_text)
pprint(summary_text)

### Text summarization using transformer BART Model

In [None]:
sample_text = """There is a massive shift underway in the global economy. Fueled by a global pandemic, we are seeing economic, cultural and social trends colliding.

This has led to a rapid transformation in how we work, where we work and even why we work. The only constant is change. And change that previously took decades, has happened in just two years. At Linkedin, we call this “The Great Reshuffle”.

Running in parallel with “The Great Reshuffle” is the world’s urgent need to save itself. Sir David Attenborough tells us “the future of humanity and indeed, all life on earth, now depends on us”.

Our hope is that we can collectively turn the climate crisis into an opportunity for change. We can all help save the planet by creating and pursuing greener careers for the world’s workforce. This green transition can see existing jobs apply more green skills and new green jobs emerge in tandem.

Such jobs include Sustainability Manager, Wind Turbine Technician, Solar Consultant, Ecologist, Environmental Health & Safety Specialist: roles that barely existed just a decade ago yet today are the five fastest-growing green jobs globally.

So what does this mean for you? How can you get yourself ready for the green economy that lies ahead?

In my LinkedIn Learning course, Closing the Green Skills Gap to Power a Greener Economy, I’ll help you better understand the green transition and the opportunities it presents. You’ll learn more about green skills and understand the rise of green jobs. I’ll look at what different countries and sectors are doing to lead green change. I’ll also give practical advice about what you can do to embrace and take advantage of this rapid economic shift. Watch this course for free until May 19, 2022. 

Remember - change starts with you! So whether you’re taking the first steps in your career or looking for a new challenge in your existing one, here are three things you can do to turbocharge your career in the green economy.

Upskill for green skills and green careers
Improving your green skills is important to pursue emerging career opportunities. Globally, members with green skills have been more resilient to economic downturns than the rest of the workforce. So improving or adding to your green skills is a great place to start. You could consider self-directed learning, offline or online. You could also enroll in a green-related higher education course to help you upskill.

Meanwhile, in the workplace, you could brush up on green skills as part of your organization's learning curriculum. Why not ask your employer or organization what green skills training may be available and even how you could volunteer to help make it happen?

Nurture your network of green skilled workers
Your network is something you can’t put a price on. That’s because you’ve built it by nurturing relationships and showcasing your capabilities. But you need to keep it up!

Indeed our data shows that green skilled workers tend to have stronger networks, with two and sometimes three times more connections on average. Your network is the way to build new relationships, ignite conversations and find new opportunities.

So start seeking out ‘green’ related content that interests you and consider engaging with it, for example, by commenting on it or sharing it. 

Why not join a LinkedIn Group dedicated to a green topic that you’re passionate about? And if you’re feeling more adventurous, you can even set up your own LinkedIn group, especially if you can’t find a green topic specific to your job.

That way, you’ll ignite new conversations which could present new green opportunities for not only you but also your network."""

In [None]:
bart_summarizer = pipeline("summarization", model="facebook/bart-large", tokenizer="facebook/bart-large", framework="tf")
bart_summarized = bart_summarizer(sample_text, min_length=25, max_length=50,truncation=True)
pprint(bart_summarized)