# <span style="color:red">DATA EXTRACTION AND NLP</span>

#### Problem Statement: Extracting Article Text and Headings from URLs provided and performing Text Analysis

***NOTE:** Multiple files have been provided as a part of the problem statement and are present alongwith this notebook. Kindly refer to the same for better understanding!*

In [1]:
# Importing relevant modules
import pandas as pd
import numpy as np
import requests
import os
import nltk
import re
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup

Problem Input is given in the form of an excel file hence translating into a DataFrame.

In [8]:
df=pd.read_excel("./assign_ques_details/Input.xlsx")
df.head()

Unnamed: 0,URL_ID,URL
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...


##### Now, as we have to extract article text and headings from URLs, we’re using BeautifulSoup library to extract the article headings and text using the class name in a function as shown below.

In [9]:
# Function to get the text and heading of an article from url
def get_article_text_and_heading(url):
    try:
        response = requests.get(url)
        response.raise_for_status() #raising error message incase of response error
        soup = BeautifulSoup(response.content, 'html.parser')
        article_container = soup.find('div', class_='td-post-content')
        if not article_container:
            return 'No heading found', 'No content found' # handling no content

        heading = soup.find('h1')
        heading_text = heading.get_text(strip=True) if heading else 'No heading found' # finding article heading and removing whitespaces
        
        content_elements = article_container.find_all(['p', 'li']) # finding all article text using paragraph and list tags
        article_text = ' '.join([element.get_text(strip=True) for element in content_elements]) 
        return heading_text, article_text 

    except requests.RequestException as e:
        return 'Error fetching page', str(e)

##### We then iterate over the DataFrame’s rows containing the URLs and scrap and assign the article headings as well as text into new columns. This is done using the code block below.

In [10]:
# Iterating over DataFrame's rows to assign headings and text
for index, row in df.iterrows():
    url = row['URL']
    heading, text = get_article_text_and_heading(url)
    # Storing the results in the dataframe
    df.at[index, 'Heading'] = heading
    df.at[index, 'Text'] = text
    
df.head() # Dataframe now contains individual URL's Article Heading and Text

Unnamed: 0,URL_ID,URL,Heading,Text
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,Rising IT cities and its impact on the economy...,We have seen a huge development and dependence...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,Rising IT Cities and Their Impact on the Econo...,"Throughout history, from the industrial revolu..."
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,"Internet Demand’s Evolution, Communication Imp...",Introduction In the span of just a few decades...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,Rise of Cybercrime and its Effect in upcoming ...,"The way we live, work, and communicate has unq..."
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,OTT platform and its impact on the entertainme...,The year 2040 is poised to witness a continued...


##### We can see the DataFrame now contains each URLs Article Heading and Text.

Now, in the problem statement, we have been given a list of Stopwords which are to be excluded from the text. Hence, we first bring the list of StopWords into python and then use it to clean our “Text” column in the DataFrame.

For this, we have defined two functions: read_and_tokenize_files and remove_stopwords.
read_and_tokenize_files:

1. **read_and_tokenize_files** :
Here, for a given “path”, we open all the files in the path one by one and read the content of 
the same. Using nltk, we tokenize the text into words and then filter them so that they only 
contain characters which are further made lowercase. The content of all the files in the path are compiled into a list which is returned to the user (effectively making a list of words from all the files present in the provided path).

In [11]:
## Function for reading files and tokenizing 
def read_and_tokenize_files(path):
    all_files_in_path = os.listdir(path)
    all_tokens = []
    for file in all_files_in_path:
        with open(f"{path}/{file}","r") as f:
            file_text = f.read()
            tokens = word_tokenize(file_text)
            only_words = [token for token in tokens if token.isalpha()] # filtering only words
            all_tokens.extend(only_words)
            all_tokens= [token.lower() for token in all_tokens] 
    return all_tokens

2. **remove_stopwords** :
We use this function to filter out the stopwords from the provided text using the mentioned 
list of stop tokens. We tokenize the provided text into words and subsequently filter by 
removing stopwords using list comprehension. We return the obtained list.

In [12]:
# Function to tokenize and remove stopwords from text
def remove_stopwords(text,stop_tokens):
    tokens = word_tokenize(text)
    article_words = [token for token in tokens if token.isalpha()]
    article_words_wstpw = [word for word in article_words if word.lower() not in stop_tokens] # filtering out StopWords
    return article_words_wstpw

Now, firstly, using the provided StopWords file, we read and tokenize the same into a list 
which is further passed along with each article text in the DataFrame one by one effectively 
filtering out the stopwords. We then save the clean tokenized article words list into another 
column. This is done using the code block below.

In [13]:
# Reading all the stopwords files and compiling into one list for easier checking
path = "./assign_ques_details/StopWords/"
all_stop_tokens = read_and_tokenize_files(path)
df["Text_without_stopwords"]=df["Text"].apply(remove_stopwords, args=(all_stop_tokens,))

We can see that we have a column containing the clean text now : 

In [14]:
df["Text_without_stopwords"]

0     [huge, development, dependence, people, techno...
1     [history, industrial, revolution, century, dev...
2     [Introduction, span, decades, internet, underg...
3     [live, work, communicate, unquestionably, chan...
4     [poised, witness, continued, revolution, world...
                            ...                        
95    [Epidemics, general, direct, indirect, costs, ...
96    [COVID, bought, world, knees, businesses, shut...
97    [Handicrafts, making, crafts, called, handicra...
98    [pay, make, online, payment, lockdown, lockdow...
99    [business, prevent, transmission, financial, c...
Name: Text_without_stopwords, Length: 100, dtype: object

We have 2 text files (from the MasterDictionary file provided) containing positive and negative tokens. We read and tokenize the same into 2 lists for further use in the code snippet below.

In [15]:
# Creating separate lists for positive and negative tokens from Master Dictionary provided
m_dict_path = "./assign_ques_details/MasterDictionary/"
files_in_path = os.listdir(m_dict_path)
p_tokens = [] 
n_tokens = []

for file in files_in_path:
    with open(f"{m_dict_path}{file}","r") as f:
        file_text = f.read()
        tokens = word_tokenize(file_text)
        if file=="positive-words.txt":
            p_tokens.extend(tokens)
        else:
            n_tokens.extend(tokens)
        p_tokens= [token.lower() for token in p_tokens]
        n_tokens= [token.lower() for token in n_tokens]


*Now, as mentioned in the Objective file, we need to calculate various output parameters as defined in the "Text Analysis.docx" file.*

For calculating the positive and negative scores, we define another function called 
score_calculator:

In [16]:
## General function for calculating positive and negative scores
def score_calculator(text, token_list):
    score = 0
    for word in text:
        if word.lower() in token_list:
            score+=1
    return score

We calculate the score on the basis of each word’s presence in the provided token list and 
then return the same. 

Using this function, we calculate the positive and negative scores in the code below :

In [17]:
# Calculating all scores by iterating over clean text columns
df["POSITIVE SCORE"] = df["Text_without_stopwords"].apply(score_calculator, args=(p_tokens,))
df["NEGATIVE SCORE"] = df["Text_without_stopwords"].apply(score_calculator, args=(n_tokens,))
df["POLARITY SCORE"] = (df["POSITIVE SCORE"] - df["NEGATIVE SCORE"])/ ((df["POSITIVE SCORE"] + df["NEGATIVE SCORE"]) + 0.000001)
df["SUBJECTIVITY SCORE"] = (df["POSITIVE SCORE"] + df["NEGATIVE SCORE"])/ (len(df["Text_without_stopwords"]) + 0.000001)

We calculate the syllable count using the following function:

In [18]:
# Function to count syllables in a word
def syllable_count(word):
    vowels = 'aeiou'
    syllable_counter = 0
    for ch in word.lower():
        if ch in vowels:
            syllable_counter+=1
    if word.endswith("es") or word.endswith("ed"): # handling required exceptions
        syllable_counter-=1
    return syllable_counter

    ## Alternate approach using CMU dictionary (not utilised since too computationally expensive)
    # d = nltk.corpus.cmudict.dict() # Loading CMU Dictionary
    # if word.lower() not in d:
    #     return 0 # If word is not found in the dictionary, return 0
    # pronunciations = d[word.lower()]
    # syllable_counter=0
    # # Iterating over all pronunciations of the word and increasing count when last character of an item 
    # # in the pronounciation list is a digit (indicating a syllable)
    # for prn in pronunciations:
    #     for itm in prn:
    #         if itm[-1].isdigit():
    #             syllable_counter+=1
    # return syllable_counter
    

Next, for calculating the multiple output parameters, we have defined a function called 
‘aor_scores’:

In [19]:
# Function for calculating multiple scores/output labels
def aor_scores(text):
    complex_word_count = 0
    total_word_length_sum = 0
    total_syllables = 0
    
    sentences = nltk.sent_tokenize(text)
    all_words = word_tokenize(text)
    only_words = [word for word in all_words if word.isalpha()]
    
    avg_sentence_length = len(only_words) / len(sentences) 
    
    for word in only_words:
        total_word_length_sum+= len(word) # calculating sum of number of letters in all words
        total_syllables+=syllable_count(word)
        if syllable_count(word)>=2:
            complex_word_count+=1

    syllable_count_per_word = total_syllables / len(only_words)
    avg_word_length = total_word_length_sum / len(only_words) 
    complex_words_percentage = ( complex_word_count / len(only_words) )*100
    fog_index = 0.4 * (avg_sentence_length + complex_words_percentage)
    
    return avg_sentence_length, complex_word_count, len(only_words), syllable_count_per_word, complex_words_percentage, fog_index, avg_word_length

We use this function to calculate various output parameters: Firstly, we tokenize the provided article text into sentences as well as words. Using this, we calculate multiple output parameters.

*Note: For output parameter description, kindly refer to "Text Analysis.docx" file.*

We return the same and assign them to new columns in our pandas DataFrame :

In [20]:
# Calculating parameter values
df[["AVG SENTENCE LENGTH","COMPLEX WORD COUNT","WORD COUNT", "SYLLABLE COUNT PER WORD","PERCENTAGE OF COMPLEX WORDS","FOG INDEX",
    "AVG WORD LENGTH"]] = df["Text"].apply(aor_scores).apply(pd.Series)

We calculate the number of pronouns using count_personal_pronouns function and save them into a new column: 

In [21]:
# Function for counting pronouns using regex
def count_personal_pronouns(text):
    pattern = r'\b(I|WE|MY|OURS|We|My|Ours|Us|we|my|ours|us|i)\b' # US case handling by mentioning all possible cases explicitly
    matches = re.findall(pattern, text)
    count = len(matches)
    return count

In [22]:
df["PERSONAL PRONOUNS"] = df["Text"].apply(count_personal_pronouns)

Using the given range, we can also check for any outliers:

In [23]:
## Checking Outliers for Subjectivity and Polarity Scores
# df[df["SUBJECTIVITY SCORE"]>1]
# df[(df["POLARITY SCORE"]>1)|(df["POLARITY SCORE"]<-1)]

**Now, we have a DataFrame with all the required columns.**

In [24]:
df.head()

Unnamed: 0,URL_ID,URL,Heading,Text,Text_without_stopwords,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE COUNT PER WORD,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG WORD LENGTH,PERSONAL PRONOUNS
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,Rising IT cities and its impact on the economy...,We have seen a huge development and dependence...,"[huge, development, dependence, people, techno...",26,6,0.625,0.32,15.207792,558.0,1171.0,1.738685,47.65158,25.143749,4.574722,12
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,Rising IT Cities and Their Impact on the Econo...,"Throughout history, from the industrial revolu...","[history, industrial, revolution, century, dev...",51,29,0.275,0.8,18.350649,813.0,1413.0,2.05874,57.537155,30.355122,5.454352,4
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,"Internet Demand’s Evolution, Communication Imp...",Introduction In the span of just a few decades...,"[Introduction, span, decades, internet, underg...",36,23,0.220339,0.59,18.446429,614.0,1033.0,2.240077,59.438529,31.153983,6.042594,13
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,Rise of Cybercrime and its Effect in upcoming ...,"The way we live, work, and communicate has unq...","[live, work, communicate, unquestionably, chan...",35,74,-0.357798,1.09,20.137255,600.0,1027.0,2.179163,58.42259,31.423938,5.937683,5
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,OTT platform and its impact on the entertainme...,The year 2040 is poised to witness a continued...,"[poised, witness, continued, revolution, world...",18,8,0.384615,0.26,17.289474,360.0,657.0,1.981735,54.794521,28.833598,5.394216,6


We remove the columns which are not required and re-order the same. We then save the DataFrame to an output file. 

In [25]:
## Final output in another dataframe
df2=df.drop(["Text","Heading","Text_without_stopwords"],axis=1)

In [26]:
## Reordering columns and saving to a final dataframe
df_final = df2.loc[:,["URL_ID", "URL",	"POSITIVE SCORE", "NEGATIVE SCORE", "POLARITY SCORE", "SUBJECTIVITY SCORE",	"AVG SENTENCE LENGTH",	"PERCENTAGE OF COMPLEX WORDS",	"FOG INDEX",	"COMPLEX WORD COUNT",	"WORD COUNT",	"SYLLABLE COUNT PER WORD",	"PERSONAL PRONOUNS", "AVG WORD LENGTH"]]

In [27]:
df_final.head()

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE COUNT PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,26,6,0.625,0.32,15.207792,47.65158,25.143749,558.0,1171.0,1.738685,12,4.574722
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,51,29,0.275,0.8,18.350649,57.537155,30.355122,813.0,1413.0,2.05874,4,5.454352
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,36,23,0.220339,0.59,18.446429,59.438529,31.153983,614.0,1033.0,2.240077,13,6.042594
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,35,74,-0.357798,1.09,20.137255,58.42259,31.423938,600.0,1027.0,2.179163,5,5.937683
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,18,8,0.384615,0.26,17.289474,54.794521,28.833598,360.0,657.0,1.981735,6,5.394216


In [None]:
# Converting to excel file if required
df_final.to_excel("final_output_vishrut.xlsx",index=False)

#### This completes the required task.

##### *Additional Help*
**<u>How to run the file</u>:** 

The file can be run in any IDE directly but special care has to be taken to edit the paths 
mentioned in the code for the provided files (positive-words, negative-words, input.xlsx 
etc.).  
Please note, the code block containing the URL scrapping may take some time to execute 
based on the system capability.