<div style="background-color:#05806c;display:block;padding:10px;"><h1 style="color:#ffffff;">Text Mining using Gensim Library</h1></div>

## Overview
Text mining is also known as Text Analytics. It is a branch of **Artificial Intelligence(AI)** that uses **Natural Language Processing(NLP)** to transform unstructured text data into normalized and structured data that can be used for Machine Learning model. 

Textual data also has huge business values such as companies can use the data to help profile customers and understand their needs.

## NLP Pipeline

Natural Language Processing(NLP) pipeline is a set of text preprocessing and feature extraction steps that are performed in a sequential manner. The length of the pipeline varies from one usecase to another usecase and also the kind of text data we are dealing with. 



## Import Libraries

In [None]:
import re
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

pd.set_option('display.max_colwidth', -1)

# dataset
from sklearn.datasets import fetch_20newsgroups

# Gensim packages
from gensim.parsing import strip_tags, strip_numeric, strip_multiple_whitespaces, stem_text, strip_punctuation, remove_stopwords
from gensim.parsing import preprocess_string



## Loading Dataset

In [None]:
# loading dataset
news_group = fetch_20newsgroups(subset='train')

news_group_data = news_group.data
news_group_target_names = news_group.target_names
news_group_target = news_group.target

In [None]:
# Creating a dataframe from the loaded data
news_df = pd.DataFrame({'news': news_group_data, 
                        'class': news_group_target})

### Random sampling 

We will take some of the records randomly as sample from the loaded dataset. 

In [None]:
news_extracts = news_df.sample(2000)

news_extracts.reset_index(drop=True, inplace=True)

In [None]:
news_extracts.head(2)

## Data Preprocessing

### 1. Data Cleaning Pipeline

This is going to be the first step of Text analysis. We will be applying various cleaning algorithms to remove unwanted elements from the text data. 


In [None]:
# Custom filter method
transform_to_lower = lambda s: s.lower()

remove_single_char = lambda s: re.sub(r'\s+\w{1}\s+', '', s)

# Filters to be executed in pipeline
CLEAN_FILTERS = [strip_tags,
                strip_numeric,
                strip_punctuation, 
                strip_multiple_whitespaces, 
                transform_to_lower,
                remove_stopwords,
                remove_single_char]

# Method does the filtering of all the unrelevant text elements
def cleaning_pipe(document):
    # Invoking gensim.parsing.preprocess_string method with set of filters
    processed_words = preprocess_string(document, CLEAN_FILTERS)
    
    return processed_words

In [None]:
# Apply the cleaning pipe on the news data

news_extracts['clean_text'] = news_extracts['news'].apply(cleaning_pipe)

In [None]:
news_extracts['clean_text'][0:2]

### 2. Stemming & Lemmatization

**Stemming**-Stemming is a technique of finding root word of the given word. For example, if a word is 'running' then the stem word of that word is 'run'.

**Lemmatization**-Lemmatization refers to find the axle word by doing vocabulary and morphological analysis of the words.

### Stemming Approaches

#### #1 gensim.parsing.stem()
There is an inbuilt method called **stem()** in **parsing** package of gensim. It does the stemming(PorterStemming) on the given text. 

#### #2 PorterStemmer 
The basic approach on stemming is using `PorterStemmer` object. Gensim has a porter stemmer class in `gensim.parsing.porter` package. The class has different functions to accept input as word, sentence and list of sentences. 

#### #3 Chain along pipes & filters
Another approach is chain the stem_text method in the cleaning pipeline filter and pass it as a parameter to `preprocess_string()` function.

In [None]:
# import stemmer from gensim
from gensim import parsing
from gensim.parsing.porter import PorterStemmer
from gensim.summarization import textcleaner

# Initialize PorterStemmer
porter = PorterStemmer()

def basic_stemming(text):
    return parsing.stem_text(text)

# Stem the incoming word
def get_stemword(stemmer, word):    
    return stemmer.stem(word)
# stem all the words in the passed sentence
def get_stem_sentence(stemmer, sentence):
    return stemmer.stem_sentence(sentence)

# stem all the sentences given as a document
def get_stem_documents(stemmer, document):
    return stemmer.stem_documents(document)



Suppose that the following paragraph needs to be processed to find the stem words for Nouns, Verbs, Adjactives and so on. Then it can be first broken into sentences and then passed to `stem_stencences()` method. 

In [None]:
document = """A computer is a machine that can be instructed to carry out sequences of arithmetic or logical operations automatically via computer programming. 
Modern computers have the ability to follow generalized sets of operations, called programs. 
These programs enable computers to perform an extremely wide range of tasks. 
A complete computer including the hardware, the operating system (main software), and peripheral equipment required and used for full operation can be referred to as a computer system. 
This term may as well be used for a group of computers that are connected and work together, in particular a computer network or computer cluster."""

In [None]:
# Stem the given paragraph text 
stemmed_text = basic_stemming(document)

print(stemmed_text)

In [None]:
# Break the paragraph into sentences
sentences = textcleaner.get_sentences(document)

# Sentences will be parsed by stem method
stem_doc = get_stem_documents(porter, sentences)

print(stem_doc)