## Textual Analysis of Extracted Articles: Understanding Sentiment, Readability, and Linguistic Features

Data Extraction:
- The Python program extracts article text from URLs provided in the "Input.xlsx" file.
- It saves each extracted article in a separate text file named after the URL_ID.
- The program uses Python libraries such as BeautifulSoup and requests for web scraping.

Data Analysis:
- The program performs text analysis on the extracted articles.
- It computes various variables, including positive and negative word counts, polarity score, subjectivity score, average sentence length, percentage of complex words, FOG Index, average number of words per sentence, complex word count, word count, syllables per word, personal pronouns count, and average word length.
- The results are organized and saved in the same order as specified in the "Output Data Structure.xlsx" file.

Objective:
- The objective is to extract and analyze textual data from articles provided via URLs.
- The program ensures that only article titles and texts are extracted while excluding website headers, footers, or any irrelevant content.

Tools and Libraries:
- Python programming is used for both data extraction and analysis.
- Libraries like BeautifulSoup, requests, Pandas, and NLTK (Natural Language Toolkit) may be utilized for web scraping and text analysis.

Output:
- The program generates an output in compliance with the structure specified in the "Output Data Structure.xlsx" file.

In [None]:
pip install syllables

Collecting syllables
  Downloading syllables-1.0.7-py3-none-any.whl (15 kB)
Collecting cmudict<2.0.0,>=1.0.11 (from syllables)
  Downloading cmudict-1.0.13-py3-none-any.whl (939 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m939.3/939.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting importlib-metadata<6.0.0,>=5.1.0 (from syllables)
  Downloading importlib_metadata-5.2.0-py3-none-any.whl (21 kB)
Collecting importlib-resources<6.0.0,>=5.10.1 (from cmudict<2.0.0,>=1.0.11->syllables)
  Downloading importlib_resources-5.13.0-py3-none-any.whl (32 kB)
Installing collected packages: importlib-resources, importlib-metadata, cmudict, syllables
  Attempting uninstall: importlib-resources
    Found existing installation: importlib-resources 6.0.1
    Uninstalling importlib-resources-6.0.1:
      Successfully uninstalled importlib-resources-6.0.1
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 6.8.0
    Uninstalling 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Necessary Libraries

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import syllables as syllables_module
import matplotlib.pyplot as plt

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Read The Excel File

In [None]:
df = pd.read_excel('/content/drive/MyDrive/Data Extraction and NLP/Input.xlsx')
df.head()

Unnamed: 0,URL_ID,URL
0,123.0,https://insights.blackcoffer.com/rise-of-telem...
1,321.0,https://insights.blackcoffer.com/rise-of-e-hea...
2,2345.0,https://insights.blackcoffer.com/rise-of-e-hea...
3,4321.0,https://insights.blackcoffer.com/rise-of-telem...
4,432.0,https://insights.blackcoffer.com/rise-of-telem...


In [None]:
df.tail()

Unnamed: 0,URL_ID,URL
109,50921.0,https://insights.blackcoffer.com/coronavirus-i...
110,51382.8,https://insights.blackcoffer.com/coronavirus-i...
111,51844.6,https://insights.blackcoffer.com/what-are-the-...
112,52306.4,https://insights.blackcoffer.com/marketing-dri...
113,52768.2,https://insights.blackcoffer.com/continued-dem...


## Web Data Scraping and DataFrame Creation

In [None]:
results = []
for index, rows in df.iterrows():
  url = rows['URL']
  response = requests.get(url)

  if response.status_code == 200:
    doc = BeautifulSoup(response.text, 'html.parser')

    title = (doc.title).string[:-23]
    for text in doc.find_all(class_="td-post-content tagdiv-type"):
      Text = text.get_text().replace('\n', ' ')


      result_entry = {
            'URL_ID': rows['URL_ID'],
            'Title': title,
            'Text': Text
        }
      results.append(result_entry)
  else:
        print(f"Failed to fetch data for URL_ID {rows['URL_ID']}")

# Create a DataFrame from the results list
result_df = pd.DataFrame(results)
result_df


Failed to fetch data for URL_ID 11668.0
Failed to fetch data for URL_ID 17671.4


Unnamed: 0,URL_ID,Title,Text
0,123.0,Rise of telemedicine and its Impact on Livelih...,"Telemedicine, the use of technology to diagno..."
1,321.0,Rise of e-health and its impact on humans by t...,"The rise of e-health, or the use of electroni..."
2,4321.0,Rise of telemedicine and its Impact on Livelih...,"“More gains on quality, affordability and acc..."
3,432.0,Rise of telemedicine and its Impact on Livelih...,"“More gains on quality, affordability and acc..."
4,2893.8,Rise of Chatbots and its impact on customer su...,The human race is known to come up with inven...
...,...,...,...
97,50921.0,Coronavirus: Impact on the Hospitality Industry,Before jumping on the topic I would like to g...
98,51382.8,Coronavirus impact on energy markets,As the coronavirus spreads around the world a...
99,51844.6,impacts of COVID-19 on the world of work - Wh,"From Alibaba to Ping An and Google to Ford, c..."
100,52306.4,Marketing Drives Results With A Focus On Problems,"When the British ruled India, many Indians ac..."


The following two URL show `ERROR 404`.
- **Error Message:** Failed to fetch data for [URL_ID 11668.0](https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/)
- **Error Message:** Failed to fetch data for [URL_ID 17671.4](https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/)


## Storing the data into an Excel File For Further Use

In [9]:
url = '/content/drive/MyDrive/Data Extraction and NLP/Text_file.xlsx'
result_df.to_excel(url, index=False)

In [10]:
df2 = pd.read_excel('/content/drive/MyDrive/Data Extraction and NLP/Text_file.xlsx')
df2.head()

Unnamed: 0,URL_ID,Title,Text
0,123.0,Rise of telemedicine and its Impact on Livelih...,"Telemedicine, the use of technology to diagno..."
1,321.0,Rise of e-health and its impact on humans by t...,"The rise of e-health, or the use of electroni..."
2,4321.0,Rise of telemedicine and its Impact on Livelih...,"“More gains on quality, affordability and acc..."
3,432.0,Rise of telemedicine and its Impact on Livelih...,"“More gains on quality, affordability and acc..."
4,2893.8,Rise of Chatbots and its impact on customer su...,The human race is known to come up with inven...


In [11]:
df2.tail()

Unnamed: 0,URL_ID,Title,Text
97,50921.0,Coronavirus: Impact on the Hospitality Industry,Before jumping on the topic I would like to g...
98,51382.8,Coronavirus impact on energy markets,As the coronavirus spreads around the world a...
99,51844.6,impacts of COVID-19 on the world of work - Wh,"From Alibaba to Ping An and Google to Ford, c..."
100,52306.4,Marketing Drives Results With A Focus On Problems,"When the British ruled India, many Indians ac..."
101,52768.2,Continued Demand for Sustainability,The business of business is no longer to do j...


In [12]:
df.shape, df2.shape

((114, 2), (102, 3))

## Collecting Stop Words from Multiple Files in a Directory

In [13]:
stop_dir = '/content/drive/MyDrive/Data Extraction and NLP/StopWords'
stop_words = set()
for filename in os.listdir(stop_dir):
    if filename.endswith(".txt"):
        with open(os.path.join(stop_dir, filename), 'r', encoding='latin-1') as stop_words_file:
            stop_words.update(set(stop_words_file.read().splitlines()))

In [14]:
print(f"Stop Words: {stop_words}")

Stop Words: {'BRET', 'UTLEY', 'MARTINI', 'DUPRE', 'ROSEANN', 'UPTON', 'BRIDGETT', 'NADINE', 'HORTON', 'DELANA', 'LATANYA', 'FRANCO', 'besides', 'PAK', 'AUDRIA', 'FENWICK', 'THIGPEN', 'HEIN', 'FARBER', 'ETSUKO', 'WOODWARD', 'SERRATO', 'PIAZZA', 'LESTER', 'PHAM', 'FORSTER', 'ROSENBERGER', 'QUINTON', 'GOLDSBERRY', 'ANGELINA', 'BRASHEAR', 'BRUCE', 'BARD', 'ETHRIDGE', 'DAMARIS', 'BARBER', 'SHANER', 'CLEMENTINE', 'LINDSAY', 'SHERLYN', 'MERAZ', 'QUICK', 'OZELLA', 'DUENAS', 'OLEVIA', 'HINTZ', 'CARREON', 'SKIDMORE', 'CHAVES', 'MCCOLLUM', 'DEVOE', 'CHRISTOPHER', 'YAMADA', 'DINAR | Algeria ', 'HYON', 'LINN', 'HAZEN', 'RAMBO', 'LEVINSON', 'JANAE', 'SHERER', 'MINCEY', 'EDDINGS', 'TUGGLE', 'BARBAR', 'APPLEBY', 'PAULA', 'CARNES', 'HART', 'MINTZ', 'LIRA', 'MASTIN', 'LUCKY', 'MICAH', 'CUSICK', 'ROSSER', 'CAMBELL', 'PHILOMENA', 'DIONE', 'BETHANY', 'BEARDSLEY', 'LEFLORE', 'ALINA', 'TYRONE', 'CROWELL', 'therefore', 'CRITES', 'SIMONA', 'CASSIDY', 'FELDER', 'ARETHA', 'MIKI', 'WATERS', 'TOWNSEND', 'MCADOO', 

## Loading Positive and Negative Words from MasterDictionary

In [15]:
master_dict = '/content/drive/MyDrive/Data Extraction and NLP/MasterDictionary'
positive_words = set()
negative_words = set()
master_list = set()
for filename in os.listdir(master_dict):
    if filename.endswith(".txt"):
        with open(os.path.join(master_dict, filename), 'r', encoding = 'latin-1') as pos_neg_words:
            words = set(pos_neg_words.read().splitlines())
            if 'positive' in filename.lower():
                positive_words.update(words)
            elif 'negative' in filename.lower():
                negative_words.update(words)

In [16]:
print(f"Positive Words: {positive_words}")
print(f"Negative Words: {negative_words}")

Positive Words: {'unbound', 'liberation', 'preferably', 'astounded', 'harmonize', 'joyously', 'upliftingly', 'strongest', 'lavishly', 'unencumbered', 'wonderfully', 'liberate', 'fascination', 'empathy', 'effectual', 'rightness', 'invaluablely', 'trustworthy', 'maturely', 'rock-stars', 'well-educated', 'greatness', 'energy-saving', 'painless', 'eased', 'apotheosis', 'steadiness', 'skillfully', 'rewarding', 'propitious', 'temptingly', 'dignified', 'clearer', 'flourish', 'excelled', 'elite', 'inpressed', 'autonomous', 'resilient', 'steadfastly', 'strikingly', 'outsmart', 'significant', 'appreciatively', 'instantly', 'groundbreaking', 'refunded', 'wowing', 'excallent', 'delightful', 'charmingly', 'revitalize', 'capability', 'amiable', 'amazes', 'enthralled', 'gentlest', 'monumentally', 'usable', 'affection', 'influential', 'authentic', 'praiseworthy', 'upgraded', 'helpful', 'incredible', 'heartening', 'avid', 'breathtaking', 'successful', 'complimentary', 'gracefully', 'versatility', 'plus

## Writing DataFrame Text Data to Individual Text Files

In [17]:
target_dir = '/content/drive/MyDrive/Data Extraction and NLP/TextFile'
for index, rows in df2.iterrows():
    url_id = rows['URL_ID']
    text = rows['Text']
    filenames = f'{url_id}.txt'
    path_file = os.path.join(target_dir, filenames)
    with open(path_file, 'w', encoding='utf-8') as words:
        words.write(text)

## Text Data Collection

In [18]:
text = []
for filename in os.listdir(target_dir):
    if filename.endswith(".txt"):
         path_file = os.path.join(target_dir, filename)
         with open(path_file, 'r', encoding = 'utf-8') as file:
             text = file.read()

In [19]:
print(f"Texts: {text}")

Texts:  The business of business is no longer to do just business or increase the bottom line to maximize shareholder value. Rather, the concept of business is moving towards a new dimension of sustainable business, the triple bottom line. People, planet, and profits are the core ideologies that are rooted in sustainable business. Sustainability is taken into account when companies want to create long-term value creation along with strategies that promote the longevity of the company. As corporate accountability rises, expectations and need for transparency among stakeholders increases therefore companies have started to recognize the need to be sustainable to stay alert and alive.  Business globalization that has happened over the previous few decades has made some companies more powerful than some national governments, making it easy for them to exploit inexpensive labor, plunder natural resources, causing severe impacts through pollution on the natural environment, human health, and

In [20]:
df3 = df2.copy()

## Text Data Analysis and Feature Extraction

Explaination of each variable:
1. **POSITIVE SCORE**: The count of words in the text that are considered positive or indicative of a positive sentiment.
   
2. **NEGATIVE SCORE**: The count of words in the text that are considered negative or indicative of a negative sentiment.

3. **POLARITY SCORE**: A numerical value indicating the overall sentiment polarity of the text, calculated as the difference between positive and negative word counts normalized by their sum.

4. **SUBJECTIVITY SCORE**: A numerical representation of the text's subjectivity, computed as the ratio of the sum of positive and negative word counts to the text's length.

5. **AVG SENTENCE LENGTH**: The average number of words in each sentence within the text, providing insights into sentence complexity and readability.

6. **PERCENTAGE OF COMPLEX WORDS**: The proportion of words in the text that are considered complex, typically having multiple syllables, expressed as a percentage.

7. **FOG INDEX**: A readability index that estimates the number of years of formal education required to understand the text, based on sentence length and the percentage of complex words.

8. **AVG NUMBER OF WORDS PER SENTENCE**: The mean count of words per sentence, providing an additional measure of text readability and sentence structure.

9. **COMPLEX WORD COUNT**: The total count of complex words in the text, which are often multisyllabic and can impact text comprehension.

10. **WORD COUNT**: The total number of words in the text, serving as a fundamental measure of text length.

11. **SYLLABLE PER WORD**: The average number of syllables per word in the text, offering insights into word complexity and pronunciation.

12. **PERSONAL PRONOUNS**: The total count of personal pronouns (e.g., "I," "you," "he," "she") used in the text, indicating the text's focus on individuals or personalities.

13. **AVG WORD LENGTH**: The mean length of words in the text, calculated by dividing the total number of letters by the total word count, providing insights into word complexity and language sophistication.

In [21]:
output = []
for i in range(len(df3)):
  text = df3.loc[i, 'Text']
  url_id = df3.loc[i, 'URL_ID']
  url = df.loc[i, 'URL']

  words = word_tokenize(text.lower())
  cleaned_words = [word for word in words if word not in stop_words]
  positive_score = len([word for word in cleaned_words if word.lower() in positive_words])
  negative_score = len([word for word in cleaned_words if word.lower() in negative_words])
  polarity_score =((positive_score - negative_score)/((positive_score + negative_score) + 0.000001))
  subjectivity_score = ((positive_score + negative_score / (len(text)) + 0.000001))
  sentences = sent_tokenize(text)
  avg_sent_len = len(words)/len(sentences)
  complex_word_count = len([word for word in words if syllables_module.estimate(word) > 2])
  if len(words) == 0:
        percent_complex_word = 0
        syllable_per_word = 0
        avg_word_length = 0

  else:
        percent_complex_word = (complex_word_count / len(words)) * 100
        syllable_per_word = sum(syllables_module.estimate(word) for word in words) / len(words)
        avg_word_length = sum(len(word) for word in words) / len(words)


  for_index = (0.4 * (avg_sent_len + percent_complex_word))
  avg_num_word = len(words) / len(sentences)
  word_count = len(words)
  personal_pronoun = len([word for word in words if word.lower() in ['i', 'me', 'you', 'he', 'she', 'it', 'we', 'they']])
  output_entry ={ "URL_ID": url_id,
                  "URL": url,
                  "Positive Score": positive_score,
                  "Negative Score": negative_score,
                  "Polarity Score": round(polarity_score,2),
                  "Subjectivity Score": round(subjectivity_score, 2),
                  "Avg Sentence Length": round(avg_sent_len,2),
                  "Percentage of Complex Words": percent_complex_word,
                  "FOG Index": round(for_index, 2),
                  "Avg Number of Words per Sentence": round(avg_num_word,2),
                  "Complex Word Count": complex_word_count,
                  "Word Count": word_count,
                  "Syllable per Word": round(syllable_per_word, 2),
                  "Personal Pronouns": personal_pronoun,
                  "Avg Word Length": round(avg_word_length, 2)}
  output.append(output_entry)

output_df = pd.DataFrame(output)

In [22]:
output_df

Unnamed: 0,URL_ID,URL,Positive Score,Negative Score,Polarity Score,Subjectivity Score,Avg Sentence Length,Percentage of Complex Words,FOG Index,Avg Number of Words per Sentence,Complex Word Count,Word Count,Syllable per Word,Personal Pronouns,Avg Word Length
0,123.0,https://insights.blackcoffer.com/rise-of-telem...,80,24,0.54,80.00,23.12,27.405405,20.21,23.12,507,1850,1.91,21,5.14
1,321.0,https://insights.blackcoffer.com/rise-of-e-hea...,38,13,0.49,38.00,26.56,29.216867,22.31,26.56,194,664,1.90,14,5.12
2,4321.0,https://insights.blackcoffer.com/rise-of-e-hea...,35,27,0.13,35.00,22.90,25.254731,19.26,22.90,347,1374,1.81,17,5.05
3,432.0,https://insights.blackcoffer.com/rise-of-telem...,35,27,0.13,35.00,22.90,25.254731,19.26,22.90,347,1374,1.81,17,5.05
4,2893.8,https://insights.blackcoffer.com/rise-of-telem...,48,12,0.60,48.00,19.98,25.327175,18.12,19.98,329,1299,1.83,17,5.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,50921.0,https://insights.blackcoffer.com/lessons-from-...,5,29,-0.71,5.01,25.47,21.073298,18.62,25.47,161,764,1.76,3,4.84
98,51382.8,https://insights.blackcoffer.com/lessons-from-...,24,64,-0.45,24.01,37.70,20.106101,23.12,37.70,379,1885,1.74,5,4.73
99,51844.6,https://insights.blackcoffer.com/coronavirus-i...,89,33,0.46,89.00,28.01,21.876594,19.96,28.01,429,1961,1.75,24,4.77
100,52306.4,https://insights.blackcoffer.com/why-scams-lik...,28,21,0.14,28.00,26.78,18.417722,18.08,26.78,291,1580,1.69,34,4.64


## Saving DataFrame to Excel

In [None]:
output_url = '/content/drive/MyDrive/Data Extraction and NLP/Output_File.xlsx'
output_df.to_excel(output_url, index = False)