<a href="https://colab.research.google.com/github/yuvaravii/BBC-News-article-Topic-Identification/blob/main/preprocessing_colab_nlp_stage1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Problem Description**

In this project your task is to identify major themes/topics across a collection of BBC news articles. You can use clustering algorithms such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) etc.

In [None]:
# for dataframes
import pandas as pd
import numpy as np
import re

#for ignoring warnings
import warnings
warnings.filterwarnings("ignore")

import json
import glob
import os

#gensim
import gensim
import gensim.corpora as corpora 
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel


from spacy import displacy
from gensim.corpora import Dictionary
from gensim.models import LdaModel

import sklearn
import keras

#spacy
import spacy 
from nltk.corpus import stopwords

# for visualisation of data
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
filepath='/content/drive/MyDrive/Colab Notebooks/Capstone Project/BBC article/1. bbc -raw Data set/to_csv file/data.csv'
raw_df=pd.read_csv(filepath)

#retaining the original file
df=raw_df.copy()

#changing the name of the columns
df=df.rename(columns={'collection':'docs','Topics':'topics'})

#dropping the unnecessary columns
df=df.drop(columns={'Unnamed: 0'})

In [None]:
df.head()

In [None]:
# creating a column for knowing the length of columns
df['doc_len']=df['docs'].apply(len)
df.head()

In [None]:
df.topics.unique()

In [None]:
print(df.groupby(by='topics',as_index=True).agg({'topics':'count'}))
df.groupby(by='topics',as_index=True).agg({'topics':'count'}).plot(kind='barh',figsize=(12,4),fontsize=12,color='green')

In [None]:
# Take a random content and view all along
df['docs'][1234]

## **Cleaning Data**

Cleaning Data involves processes like:

1. Converting into lower case - to avoid case sensitiveness
2. remove html tags - to remove noise from internet downloaded content
3. remove unaccented characters - e' to e

4. Stop words removal
5. Punctuation removal
6. Numerical removal
7. expanding contractions - like I'd = i would , you'r = you are..
8. removing special characters = !@# etc
9. standardisation - acronyms like nlp = natural language processing , however this is manual in nature..

10. Normalisation - 1.reduces to unique no. of token ,,, variation in words of text is reduced ,,, reduction in redundunt information..........


**Stemming** = stem - base words formed by just adding preposition and postposition such as Jump, jumping,jumped,jumps. however it has some inherent flaws like winning shown as win(overstemming) , data-->root word datum --> dat by machine (understemming). Sadly this is not the best normalization technique
* faster, not accurate , easier to run


**Lemmatization** --> a step by step procedure for reducing the words to base form  --> it takes Part of speech into consideration --> running (verb) converted to run, running (noun)- No conversion.
 * Not fast, accurate.
 Based on the applicability you can choose any of the below lemmatizers
Wordnet Lemmatizer
Spacy Lemmatizer
TextBlob
CLiPS Pattern
Stanford CoreNLP
Gensim Lemmatizer
TreeTagger


**Why such process are required?**
1. Computer understand only binary , hence the data input for it shall be numerical, then each words has to be converted into numbers.
Conversion into number involves vectorization.
Vectorization - words stored in form of numbers with some direction. Like placing the Harry potter book in 'H' row of the libraries.
they are given index. Similar to words are tokenized in form of numbers and stored in array.

Still might be thinking why not converging on topic...on the way pal...So, more number of words --> more number of tokens --> more amount of space & computation --> complex computation required.
So we shall now improve the computation power either we upgrade our system or cloud computing. There is another way, feed only necessary information.

How do you think words like when,to,a,the,from etc.. are going to help in picking topic. Not only that how about punctuations, numericals, neither of them is going to help. We also should throw some light on grammer, parts of speech, tenses.
 I think shakespear must be angry for killing English Grammer though we bestow him deeply.

 How does this affect like words run,runs,ran,running --> Opportunity ! we can reduce this types of words also called stemming process.

 Lemmatization is another similar kind of technique. how about this inform,information,informed (verb,adjective,noun,adverd) this also add burden hence we over come it.

It's been often said in Machine Learning and NLP algorithms - garbage in, garbage out. We can't have state-of-the-art results without data which is as good. Let's spend this section working on cleaning and understanding our data set. NTLK is usually a popular choice for pre-processing - but is a rather outdated and we will be checking out spaCy, an industry grade text-processing package.

## Remove Punctuations

In [None]:
################################################ BLOCK 0 - PUNCTUATION REMOVAL AND LOWER CASE ####################

df['data']=df.docs.to_list()

#preprocessing of data in datalist

df['data']= [re.sub('\s*@\s*\s?',' ',str(datum)) for datum in df['data']]

df['data']= [re.sub('\?',' ',str(datum)) for datum in df['data']]

df['data']= [re.sub('\_',' ',str(datum)) for datum in df['data']]

df['data']= [re.sub('\s+',' ',str(datum)) for datum in df['data']]

df['data']= [re.sub("\'"," ",str(datum)) for datum in df['data']]

df['data']= df['data'].str.lower()

In [None]:
df['docs'][0]

In [None]:
df['data'][0] #No upper case # No punctuations # lower case

In [None]:
df.head()

We cannot remove stopwords on application such as machine translation, text summarization as there are higher chances of failing the objective.

In [None]:
sample=df['data'][0]

In [None]:
sample

In [None]:
##################################################### Block 1 -SENTENCE TO TOKEN OF WORDS #####################################################
# creation of tokens of words

def data_to_words(sentences_in_doc):
  '''
  This function helps to convert to single string to list of strings
  '''
  yield(gensim.utils.simple_preprocess(str(sentences_in_doc),deacc=True)) # it is in form of generator # removes numericals # % removed #Special character removal

#df['doc_words']= list(data_to_words(df['data']))
token_words=list(data_to_words(sample)) # returns lists of list tokenized words
print(token_words)

### **Stop words removal**

The words which are used in sentence for making complete sense are called stop words. Used to avoid the grammatical error, to effectively communicate among human beings. However in NLP every word will be tokenized thus to convert the word to numerical.So, we reduce noise by removing stop words 

Also the computer does not understand the grammatical error and effective communication.

In [None]:
################################################ Block 2 - STOP WORDS REMOVAL ###############################################
# Removal of stop words

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words("english")
stop_words.extend(['from','subject','re','edu','use','mr'])

stp_wrd_rmd_list=[]
stp_wrd_in_given_list=[] # a list filtered after removing stop words
for lists in token_words:
  number_of_words_in_doc=len(lists) # gives number of words in the given doc
  for word in lists:  
    if word not in stop_words:
      stp_wrd_rmd_list.append(word)
    else:
      stp_wrd_in_given_list.append(word)

number_of_words_in_doc
print(number_of_words_in_doc,
      len(stp_wrd_rmd_list),  # gives number of words with out stop words
      len(stp_wrd_in_given_list)) # gives number stop words present

# join the words as single sentence
stp_wrd_rmd_list
cleaned_sentence=' '.join(stp_wrd_rmd_list)

In [None]:
cleaned_sentence # string format

In [None]:
############################################################# Block 3 (lemmatization)#####################################
nlp = spacy.load('en')
doc = nlp(cleaned_sentence) # gives token of words in form list of
lemma_doc=" ".join([token.lemma_ for token in doc])
lemma_doc

In [None]:
sample

In [None]:
################################### Block 4 - Creation one function for cleaning #########################

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def cleaned_sentence(sentences):      # replace the sample with sentences

  # Block 1 implemented
  # creation of tokens of words
  def data_to_words(sentences_in_doc):
    '''
    This function helps to convert to single string to list of strings
    '''
    yield(gensim.utils.simple_preprocess(str(sentences_in_doc),deacc=True)) # it is in form of generator

  ## This generator removed numericals,punctuations, even %'s, removed special characters, converted them into lower case

  token_words=list(data_to_words(sentences)) # returns lists of list tokenized words  # block 1 output = token_words

  ################################################ Block 2 ###############################################
  # Removal of stop words

  from nltk.corpus import stopwords

  stop_words = stopwords.words("english")
  stop_words.extend(['from','subject','re','edu','use','mr'])

  stp_wrd_rmd_list=[]
  stp_wrd_in_given_list=[]                                                  # a list filtered after removing stop words
  for lists in token_words:
    number_of_words_in_doc=len(lists)                                       # gives number of words in the given doc
    for word in lists:  
      if word not in stop_words:
        stp_wrd_rmd_list.append(word)
      else:
        stp_wrd_in_given_list.append(word)

  # join the words as single sentence
  cleaned_sentence=' '.join(stp_wrd_rmd_list)            # block 2 output = len_of words, after removal ?, how many stopwrod =?

  ############################################################# Block 3 (lemmatization)#####################################
  nlp = spacy.load('en')
  doc = nlp(cleaned_sentence) # gives token of words in form list of
  lemma_doc=" ".join([token.lemma_ for token in doc])
  lemma_doc


  return lemma_doc,stp_wrd_rmd_list,number_of_words_in_doc, len(stp_wrd_rmd_list),len(stp_wrd_in_given_list) 


In [None]:
cleaned_sentence(sample)

In [None]:
# create a data frame containing cleaned document.
df[['cleaned_doc','cleaned_doc_token','num_words_in_doc','aft_rm_stpwd_wrd_num','stpwd_wrd_num_in_doc']]=[cleaned_sentence(doc) for doc in df['data']]

In [None]:
df.head()

In [None]:
# Convert to .csv file as it truly test the patience of a human being to run each time
df.to_csv('/content/drive/MyDrive/Colab Notebooks/Capstone Project/BBC article/2. Cleaned and Preprocessed data/cleaned_dataset_stg1.csv')

# Load Data Set from Storage

In [None]:
processed_data_filepath='/content/drive/MyDrive/Colab Notebooks/Capstone Project/BBC article/2. Cleaned and Preprocessed data/cleaned_dataset_stg1.csv'
new_df=pd.read_csv(processed_data_filepath)
df1=new_df.copy()
df1.head(3)

In [None]:
details_dict=df1.groupby(by='topics').agg({'docs':'count','num_words_in_doc':'sum','aft_rm_stpwd_wrd_num':'sum','stpwd_wrd_num_in_doc':'sum'}).to_dict()
details_df=pd.DataFrame(details_dict)
details_df.columns
details_df['% reduction']=details_df['stpwd_wrd_num_in_doc']/details_df['num_words_in_doc']*100
details_df

In [None]:

# Figure Size
fig, ax = plt.subplots(figsize =(12, 5))
 
# Horizontal Bar Plot
ax.barh(details_df.index, details_df['num_words_in_doc'],color='orange')
ax.barh(details_df.index, details_df['aft_rm_stpwd_wrd_num'],color='black')

ax.set_title('Total Number of words vs Number of words after cleaning')

# Remove x, y Ticks
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
 
# Add padding between axes and labels
ax.xaxis.set_tick_params(pad = 5)
ax.yaxis.set_tick_params(pad = 5)
 
# Add x, y gridlines
ax.grid(b = True, color ='black',
        linestyle ='--', linewidth = 0.5,
        alpha = 0.2)
 
# Show top values
ax.invert_yaxis()
 
# Add annotation to bars
for i in ax.patches:
    plt.text(i.get_width()+0.2, i.get_y()+0.5,
             str(round((i.get_width()), 2)),
             fontsize = 10, fontweight ='bold',
             color ='grey')

# Show Plot
plt.show()

# **Phase 2 - Unigrams & Bigrams** 

In [None]:
from nltk.util import ngrams
def bi_tri_grams(text):
  tokenize = nltk.word_tokenize(text)
  bigrams = ngrams(tokenize,2)
  trigrams=ngrams(tokenize,3)
  return list(bigrams)

In [None]:
sample_df=pd.DataFrame()
sample_df['bi']=bi_tri_grams(df['cleaned_doc'][0])
sample_df

In [None]:
df['bigrams_word']=df['cleaned_doc'].apply(bi_tri_grams)

In [None]:
df.head()