# <strong>Data Preprocessing and Cleaning</strong>

This notebook is used to preprocess the text files from the Dataset and create the finalized Data

<hr>

Here is a sample of <strong>uncleaned-noisy</strong> data from the dataset

Sample uncleaned dataset

```text
BOARD OF DIRECTORS
Shri Ghanshyam Das Agarwal
-
Non-executive Chairman
Shri Jugal Kishore Agarwal
-
Non-executive Director
Shri Nirmal Kumar Agarwal
-
Non-executive Director
Shri Mohan Lal Agarwal

MANAGEMENT DISCUSSION & ANALYSIS
ADHUNIK METALIKS - AN OVERVIEW
Your Company operates in a specialised segment of steel industry,
producing, special alloy steel, ferro alloys, iron billets and rolled
products at it manufacturing facility at Odisha. Though integrated
with iron ore and manganese ore mines and a 1.6 MMTPA pellet
making facility set up under its wholly owned subsidiary, Orissa
Manganese & Minerals Limited, the fortune of your industry are
dependent upon the growth and fall of iron & steel segment of
the economy. During the year under review, the iron & steel
industry has been plagued with several challenges relating to
negative growth, issues with the mining sector and uncontrolled
imports from countries with surplus capacities. Though a preferred
supplier to many major industrial houses, your Company's
performance has been marred due to the sharp decline in the
performance of important customers of the Company.
```

This is found nearly in all over the dataset so it is neccessary to remove the noise / clean our data to feed it to next phase of our methodology

### Overview

There are several files which dosen't contain any MD&A data (because of inaccuracy in scrapping data from financial reports) we need to remove those files from the dataset as they are of no use in our further model building and training. So we consider valid files from these set of files.

In [1]:
# check if management discussion and analysis section is present in the following reports

import os
import re

bankrupt_companies = os.listdir('Dataset/Final Dataset/Bankrupt')
healthy_companies = os.listdir('Dataset/Final Dataset/Healthy')


### <strong>1. Select Acceptable Data</strong>

Select only files which has MD&A section in it, else do not consider it for further phase

In [2]:
acceptable_bankrupt = []
acceptable_healthy = []

# only consider files which has an MD&A section else donot consider that file
if os.access('Dataset/Final Dataset/Bankrupt', os.R_OK):
    for company in bankrupt_companies:
        with open('Dataset/Final Dataset/Bankrupt/' + company, 'r', encoding='utf-8') as f:
            text = f.read()
            if re.search('management discussion and analysis', text, re.IGNORECASE):
                acceptable_bankrupt.append(company)

if os.access('Dataset/Final Dataset/Healthy', os.R_OK):
    for company in healthy_companies:
        with open('Dataset/Final Dataset/Healthy/' + company, 'r', encoding='utf-8') as f:
            text = f.read()
            if re.search('management discussion and analysis', text, re.IGNORECASE):
                acceptable_healthy.append(company)

In [3]:
print(f'Acceptable bankrupt companies: {len(acceptable_bankrupt)} out of {len(bankrupt_companies)} companies.')
print(f'Acceptable healthy companies: {len(acceptable_healthy)} out of {len(healthy_companies)} companies.')

Acceptable bankrupt companies: 131 out of 201 companies.
Acceptable healthy companies: 130 out of 298 companies.


So Only 131 Bankrupt companies files and 130 Healthy companies files are good to move to the next phase of methodology

### <strong>2. Preprocessing Data</strong>

Preprocess the data by applying lemmatization, removing punctuation, removing stopwords, lowecaseing and using other technique which are generaly used to preprocess any textual data

In [4]:
import spacy
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK stopwords
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))


In [5]:
def preprocess_mda(mda_text):
    """
    1. Lowercasing
    2. Tokenizing
    3. Removing punctuation
    4. Removing stopwords
    5. Lemmatization
    6. NER
    """
    # Convert text to lowercase
    mda_text = mda_text.lower()
    
    # Sentence tokenization with NLTK
    sentences = sent_tokenize(mda_text)
    
    # Process each sentence
    processed_sentences = []
    for sentence in sentences:
        # Tokenize each sentence into words
        words = word_tokenize(sentence)
        
        # Remove punctuation (except for full stops) and stopwords
        filtered_words = [word for word in words if (word.isalnum() or word == '.') and word not in stop_words]
        
        # Join words back into a sentence
        processed_sentence = ' '.join(filtered_words)
        processed_sentences.append(processed_sentence)
    
    # Join processed sentences back into a single text
    cleaned_text = ' '.join(processed_sentences)
    
    # Lemmatization with spaCy
    doc = nlp(cleaned_text)
    lemmatized_tokens = [token.lemma_ if token.lemma_ != '-PRON-' else token.text for token in doc]
    
    # Named Entity Recognition (NER) with spaCy
    named_entities = [(ent.text, ent.label_) for ent in doc.ents]
    
    # Join lemmatized tokens back to form a preprocessed text string
    lemmatized_text = ' '.join(lemmatized_tokens)
    
    return lemmatized_text, named_entities

In [6]:
# def preprocess_mda(mda_text):
#     mda_text = mda_text.lower()         # lowercasing
#     sentences = sent_tokenize(mda_text) # sentence tokenization
#     processed_sentences = []                    
#     for sentence in sentences:  
#         words = word_tokenize(sentence) # removing punctuation and stopwords
#         filtered_words = [word for word 
#                     in words if (word.isalnum() or word == '.')
#                     and word not in stop_words]
#         processed_sentence = ' '.join(filtered_words)
#         processed_sentences.append(processed_sentence)
#     cleaned_text = ' '.join(processed_sentences)
    
#     # Lemmatization with spaCy
#     doc = nlp(cleaned_text)
#     lemmatized_tokens = [
#         token.lemma_ if token.lemma_ != '-PRON-' 
#         else token.text for token in doc
#         ]
#     lemmatized_text = ' '.join(lemmatized_tokens)
    
#     return lemmatized_text

### <strong>3. Extract MD&A from each file</strong>

Every file doesn't contain information about MD&A section, if contains it has some other useless data with it, So applied regular expression to parse out the MD&A section out of large text document we have.

In [7]:
# extract the md&a section from the text
def extract_mda_section(text):
    # Define the start and end patterns for the MD&A section
    start_pattern = r"(?:MANAGEMENT DISCUSSION & ANALYSIS|management discussion and analysis|MD&A|MDA|Management Discussion and Analysis|management discussion)"
    end_pattern = r"(?:DIRECTORS' REPORT|BOARD OF DIRECTORS|CORPORATE GOVERNANCE|CEO CERTIFICATION)"

    # Find the start and end indices
    start_match = re.search(start_pattern, text, re.IGNORECASE)
    end_match = re.search(end_pattern, text[start_match.end():], re.IGNORECASE) if start_match else None
    
    # If both start and end are found, extract the section
    if start_match and end_match:
        mda_section = text[start_match.start():start_match.end() + end_match.start()]
        return mda_section.strip()
    elif start_match:
        # If only the start is found, extract from start to the end of the document
        mda_section = text[start_match.start():].strip()
        return mda_section

In [8]:
import nltk
nltk.download('punkt_tab', quiet=True)

True

In [9]:
acceptable_bankrupt[0]

'ADHUNIK_2017_MDA.txt'

In [10]:
# now start extracting the text from the acceptable companies
with open('Dataset/Final Dataset/Bankrupt/' + acceptable_bankrupt[0], 'r', encoding='latin-1') as f:
    text = f.read()
    mda_section = extract_mda_section(text)
    text_data, ne = preprocess_mda(mda_section)

In [11]:
text_data

'management discussion analysis report adhunik metaliks limit 10 annual report india expect big turnaround economy also one fast grow economy however india set challenge tepid infrastructure manufacturing sector . assume average annual price crude petroleum product current account expect decrease . though may take see full policy change india decline improve current account balance provide picture previous year . growth growth expect move upward . infrastructure development increase revival manufacturing sector expect provide necessary trigger steel demand . steel demand expect grow 6 7 . however much sharper expect increase higher budget key downside risk outlook highlight world steel . risk opportunity threat steel sector intrinsically link economic growth . high economic growth india last 10 year lead increase demand steel move indian steel industry new stage growth development . increase result india become 4th large producer crude steel large producer sponge world . per capita ste

In [12]:
acceptable_bankrupt[0], acceptable_healthy[0]

('ADHUNIK_2017_MDA.txt', 'ADVENZYM_2021_MDA.txt')

In [13]:
# os.makedirs('Dataset/Final_Dataset_Cleaned/Bankrupt', exist_ok=True)
# os.makedirs('Dataset/Final_Dataset_Cleaned/Healthy', exist_ok=True)

def preprocess_mda_and_save(acceptable_comapnies, output_dir, bankrupt=True):
    if(os.listdir(output_dir)):
        print(f'{output_dir} is not empty. Please empty the directory before running this script.')
        return
    if bankrupt:
        for company in acceptable_comapnies:
            if(os.path.exists('Dataset/Final Dataset/Bankrupt/' + company)):
                with open('Dataset/Final Dataset/Bankrupt/' + company, 'r', encoding='utf-8') as f:
                    text = f.read()
                    mda_section = extract_mda_section(text)
                    text_data, ne = preprocess_mda(mda_section)
                    with open(f'{output_dir}/{company}', 'w', encoding='utf-8') as f:
                        f.write(text_data)
    else:
        for company in acceptable_comapnies:
            if(os.path.exists('Dataset/Final Dataset/Healthy/' + company)):
                with open('Dataset/Final Dataset/Healthy/' + company, 'r', encoding='utf-8') as f:
                    text = f.read()
                    mda_section = extract_mda_section(text)
                    text_data, ne = preprocess_mda(mda_section)
                    with open(f'{output_dir}/{company}', 'w', encoding='utf-8') as f:
                        f.write(text_data)
            else:
                print(f'{company} does not exist in the Healthy dataset.')


os.makedirs('Dataset/Final_Processed_Dataset/Bankrupt', exist_ok=True)
os.makedirs('Dataset/Final_Processed_Dataset/Healthy', exist_ok=True)

preprocess_mda_and_save(acceptable_bankrupt, 'Dataset/Final_Processed_Dataset/Bankrupt')
preprocess_mda_and_save(acceptable_healthy, 'Dataset/Final_Processed_Dataset/Healthy', bankrupt=False)


Dataset/Final_Processed_Dataset/Bankrupt is not empty. Please empty the directory before running this script.
Dataset/Final_Processed_Dataset/Healthy is not empty. Please empty the directory before running this script.
