<h1>Introduction</h1>

This notebook will focus on some advanced NLP techniques and their various implementations in Python. We will focus on three major tasks

* Summarization of text - feature-based and Text Rank
* Clustering of documents
* NLP in search engines

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install gensim==1.0.0

In [None]:
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h2>Text summarization</h2>

There are a multitude of books, blogs and text material available everyhwere. We may find some good resources to learn NLP, but it's too long to read through. We can summarize the data, to capture all the important information, while also saving time. This process can be implemented in a variety of ways:

* Text Rank - Graph based ranking
* Feature-based summarization of text
* Topic linkage
* Usage of sentence embeddings
* Encoder-decoders - Deep learning

We'll be working with the first 2 models in this notebook

<h3>TextRank</h3>

**Not recommended approach due to lack of support for gensim summarization and its complete removal in the latest versions of gensim**. It is a graph rank algorithmthat uses the core concepts of NLP. It took its foundation from PageRank, used by the popular search engine Google, but designed specifically for text. It will extract the topics from the data, convert them into edge-points/nodes and capture the relationship between them. Let us capture some data from Wikipedia.

In [None]:
def scrape(link):
    website = urlopen(link)
    s = BeautifulSoup(website)
    # Capture the paragraph tag <p> in the data
    # Convert it into text with the map function and join the list elements
    text = ' '.join(map(lambda x: x.text, s.find_all('p')))
    print(text)
    return s.title.text, text

In [None]:
link = "https://en.wikipedia.org/wiki/Natural_language_processing"
paragraphs = scrape(link)

In [None]:
len(''.join(paragraphs))

In [None]:
paragraphs[:100]

In [None]:
total = str(paragraphs)

In [None]:
total

In [None]:
type(total)

In [None]:
# Summarizing the text with a ratio of 0.1 - 10% of total words
summarize(text, ratio=0.1)

In [None]:
print(keywords(text, ratio=0.1))

<h2>Feature-based text summarization</h2>

It will extract features form the sentences, check its importance and then rank it. The features may include the length, position of the word, frequency, named entity etc. We can use Luhn's Algorithm for our task.

In [None]:
!pip install sumy

In [None]:
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer

In [None]:
# The number of sentences we want our text to be summarized:
sentences = 15
link = "https://en.wikipedia.org/wiki/SQL_injection"
parser = HtmlParser.from_url(link, Tokenizer("english"))

In [None]:
lsa = LsaSummarizer()
lsa = LsaSummarizer(Stemmer("english"))
lsa.stop_words = get_stop_words("english")
for sentence in lsa(parser.document, sentences):
    print(sentence)

Through this, we were able to summarize our documents, in a much more flexible manner. We can further incorporate deep learning techniques to improve our summarization quality.

<h2>Document clustering</h2>

This is also known as text clustering. It is a clustering analysis on text documents. One of the main uses includes document management.

The process includes several similar steps to basic NLP tasks:

1. Tokenization
2. Stemming and lemmatization
3. Removing stop words and punctuation 
4. Counting term frequences or TF-IDF
5. Clustering through a K-means/Hierarchitcal technique
6. Evaluation and final visualizations

In [None]:
import nltk
from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
from sklearn.metrics.pairwise import cosine_similarity
import os
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.manifold import MDS
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

We will use the finance complaints dataset, which we used in our other notebook for classification purposes:

In [None]:
df = pd.read_csv('../input/us-consumer-finance-complaints/consumer_complaints.csv', encoding='latin-1')

In [None]:
# Extracting the required column
df = df[['consumer_complaint_narrative']]
df = df[df['consumer_complaint_narrative'].notnull()]

In [None]:
df.rename(columns={'consumer_complaint_narrative':'description'}, inplace=True)

In [None]:
df.head()

In [None]:
df.shape

Let us work with 200 documents for now

In [None]:
sampling = df.sample(200)

Preprocessing steps

In [None]:
# Remove X symbols
df['description'] = df['description'].str.replace('XXXX', '')
df['description']

In [None]:
df['description'] = df['description'].str.replace('XX', '')

In [None]:
# Conversion to list
data = sampling['description'].tolist()

In [None]:
ranks = [] # Will be used later
for i in range(1, len(data)+1):
    ranks.append(i)

In [None]:
stopwords = nltk.corpus.stopwords.words('english')
ss = SnowballStemmer('english')

In [None]:
# Function to clean up all our data
def cleaning(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered.append(token)
    stem = [ss.stem(t) for t in filtered]
    return stem

In [None]:
def tokenize_only(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered.append(token)
    return filtered

In [None]:
sampling.shape

In [None]:
len(data)

In [None]:
tfidf = TfidfVectorizer(min_df=0.2, max_df=0.8,max_features=200000, stop_words='english', use_idf=True, tokenizer=cleaning, ngram_range=(1, 3))
sampling_tf = tfidf.fit_transform(data)
terms = tfidf.get_feature_names()

In [None]:
print(sampling_tf.shape)

In [None]:
len(terms)

<h2>Clustering using K-means</h2>

In [None]:
km = KMeans(n_clusters=6)
km.fit(sampling_tf)

In [None]:
clusters = km.labels_.tolist()

In [None]:
complaints_data = { 'rank': ranks, 'complaints': data,
'cluster': clusters }
frame = pd.DataFrame(complaints_data, index = [clusters] ,
columns = ['rank', 'cluster'])

In [None]:
frame

In [None]:
frame['cluster'].value_counts().sort_values()

<h3>Identifying cluster behaviour</h3>
We will find which are the top 5 words nearest to each of the cluster centroids:

In [None]:
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in data:
    a = cleaning(i)
    totalvocab_stemmed.extend(a)
    b = tokenize_only(i)
    totalvocab_tokenized.extend(b)

In [None]:
vocab_frame = pd.DataFrame({'words':totalvocab_tokenized}, index=totalvocab_stemmed)

In [None]:
vocab_frame.head()

In [None]:
ordering

In [None]:
# Sorting cluster centers by their proximity to entroid
ordering = km.cluster_centers_.argsort()[:, ::-1]
# Iterating over each cluster
for i in range(6):
    print("Cluster %d words" % i, end="")
    # Extracting the index of the word
    for ind in ordering[i, :6]:
        print(" At index", ind, end=" ")
        # Using the index to extract the word from tfidf
        # Using the tfidf word to search using loc in our vocab_frame (which has the index of the stem words)
        print(' %s'  % vocab_frame.loc[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=', ')
    print()

In [None]:
# We search for the term through the stemmed value in the tfidf feature names
vocab_frame.loc[terms[ind].split(' ')].values.tolist()[0][0]

<h2>Plotting clusters

In [None]:
# Similarity
sim_d = 1 - cosine_similarity(sampling_tf)
# Reducing the features to a 2D space
mds = MDS(n_components = 2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(sim_d) # shape is of n_components, n_samples)
xs, ys = pos[: , 0], pos[:, 1]

# Colors to use and cluster names:
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a', 4: '#66a61e', 5: '#D2691E'}
#set up cluster names using a dict
cluster_names = {0: 'property, based, assist',
 1: 'business, card',
 2: 'authorized, approved, believe',
 3: 'agreement, application,business',
 4: 'closed, applied, additional',
 5: 'applied, card'}

In [None]:
sim_d.shape

In [None]:
xs.shape

In [None]:
ys.shape

In [None]:
df1 = pd.DataFrame(dict(x=xs, y=ys, label=clusters)) 

In [None]:
df1.head()

In [None]:
df1.groupby('label').count()

In [None]:
groups = df1.groupby('label')

In [None]:
fig, ax = plt.subplots(figsize=(17, 9)) 
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle="", ms=20, label=cluster_names[name], color=cluster_colors[name], mec='none')
    ax.set_aspect('auto')
    ax.tick_params(axis='x', which='both', bottom='off', top='off', labelbottom='off')
    ax.tick_params(axis='y', which='both', left='off', top='off', labelleft='off')

ax.legend(numpoints=1)
plt.show()

Through this, we were able to cluster about 200 different forms of complaints into 6 distinct gorups. We can also use different deep learning techniques such as word-embeddings, to achieve this better. 

<h2>NLP in search engines</h2>

The major processes in NLP include the following

**Preprocessing**
* Removal of noise and stop words
* Tokenization
* Stemming
* Lemmatization

**Entity Extraction model** - We can build customized models for this purpose or use libraries such as NLTK and Stanford NER. If we have an ecommerce website, our entity recognition model can work on the following:
* Gender
* Color
* Brand
* Product category
* Price
* Size 


We can also build named entity disambiguation using RNNs and LSTMs. This helps in understanding the context and content in which the entities are used e.g. - bank can be a riverbank or financial institution. NERD can help us in this:
* Data cleaning and preprocessing
* Training NER model
* Test and validate
* Deploy
The training of the NERD model can be done through
* Named entity recongition/disambiguation
* RNNs, LSTMs
* Joint named entity recogntion

**Query enhancnement and expansion** - It is important to understand possible different meanings of entites so that the search results do not miss out on relevance. We can use locally-trained word embeddings such as GloVe or Word2Vec to achieve this.

**The search platform** - Some search platforms have full-text search hit highlighting, elastic stacks, real-time indexing, dynamic clustering etc. This is less on the grounds of NLP but more focused towards end-to-end application featurees.

**Rankigns** - The search results are fetched from Solr or elastic search should be ranked based on user preference and other algorithms.