# Approach to the Solution

- To successfully run the script and perform text analysis, we will approach the solution in several stages. 
- This includes setting up the environment, extracting article text, analyzing the text, and saving the results. 
- Here is the detailed approach:

### 1. Setting Up the Environment:

1. Install necessary libraries.
2. Prepare a list of positive and negative words.

### 2. Extracting Article Text:

1. Read URLs from the input Excel file.
2. Fetch and parse the HTML content of each URL to extract the article title and body text.

### 3. Performing Text Analysis:

1. Compute various text metrics using the extracted article text.
2. Use natural language processing techniques to calculate these metrics.

### 4. Saving the Results:

1. Save the computed metrics to an output Excel file.

# Running the Python Script

### 1. Install Dependencies

- Before running the script, ensure all necessary Python packages are installed. You can install them using "pip":

In [2]:
pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install requests beautifulsoup4 pandas nltk textstat syllapy textblob


Note: you may need to restart the kernel to use updated packages.


### 2. Prepare Positive and Negative Words Lists

- Create a Python script ("write_words_to_file.py") to generate "positive-words.txt" and "negative-words.txt" files:

In [4]:
# Lists of positive and negative words
positive_words = [
    "able", "abundance", "accomplish", "achievement", "active", "admire", "adventure", "affirmative",
    "amazing", "angelic", "awesome", "beautiful", "blessing", "brilliant", "celebrate", "champion",
    "charming", "cheerful", "clever", "confident", "courageous", "creative", "delight", "divine",
    "elegant", "enchanting", "encouraging", "energetic", "enjoy", "enthusiastic", "excellent", 
    "fabulous", "fantastic", "flourish", "fortunate", "free", "friendly", "fun", "generous",
    "genius", "glorious", "good", "great", "happy", "harmonious", "healthy", "heroic", "imaginative",
    "impressive", "incredible", "inspirational", "intelligent", "joyful", "kind", "knowledgeable",
    "laugh", "love", "lucky", "marvelous", "motivated", "outstanding", "perfect", "phenomenal", 
    "positive", "prosperous", "remarkable", "resilient", "resourceful", "spectacular", "splendid",
    "successful", "terrific", "thriving", "trustworthy", "unique", "uplifting", "victorious", "vivid",
    "wonderful", "worthy"
]

negative_words = [
    "abandon", "abuse", "afraid", "aggressive", "angry", "annoy", "anxiety", "arrogant", "ashamed", 
    "awful", "bad", "bitter", "bored", "broken", "clumsy", "collapse", "conflict", "confused", 
    "corrupt", "crazy", "crime", "cruel", "cry", "damage", "danger", "dark", "dead", "depressed", 
    "desperate", "destroy", "difficult", "disaster", "disgusting", "distress", "dreadful", "dull", 
    "embarrass", "enemy", "error", "evil", "fail", "fear", "filthy", "fool", "frighten", "frustrate", 
    "gloomy", "greed", "grief", "guilty", "harm", "hate", "heartbreaking", "horrible", "hostile", 
    "hurt", "ignorant", "ill", "immoral", "impatient", "imperfect", "impossible", "insecure", 
    "insult", "jealous", "lazy", "lost", "lousy", "mad", "mean", "messy", "miserable", "negative", 
    "nervous", "offensive", "pain", "pathetic", "poor", "reject", "repulsive", "revenge", "sad", 
    "scared", "selfish", "shame", "sick", "stress", "stupid", "terrible", "ugly", "unhappy", 
    "upset", "useless", "weak", "worry", "worthless"
]

# Function to write a list of words to a file
def write_words_to_file(filename, words):
    with open(filename, 'w') as file:
        for word in words:
            file.write(word + '\n')

# Create the files
write_words_to_file('positive-words.txt', positive_words)
write_words_to_file('negative-words.txt', negative_words)

print("Files created successfully.")


Files created successfully.


### 3. Extracting Article Text

- Create a Python script ("extract_articles.py") to extract text from the provided URLs:

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

# Function to extract article text
def extract_article_text(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract title
        title = soup.find('h1').get_text()

        # Extract article body
        article_body = soup.find('div', class_='td-post-content')
        paragraphs = article_body.find_all('p')

        article_text = "\n".join([p.get_text() for p in paragraphs])
        
        return title, article_text
    except Exception as e:
        print(f"Error extracting article text from {url}: {e}")
        return None, None

# Load input data
input_file = 'Input.xlsx'
input_data = pd.read_excel(r"C:\Users\SUSHIL\Downloads\Input.xlsx")

# Iterate through each URL and extract the text
for index, row in input_data.iterrows():
    url_id = row['URL_ID']
    url = row['URL']
    
    title, article_text = extract_article_text(url)
    
    if title and article_text:
        file_name = f"{url_id}.txt"
        with open(file_name, 'w', encoding='utf-8') as file:
            file.write(title + "\n\n")
            file.write(article_text)
        print(f"Successfully extracted and saved article for URL_ID: {url_id}")
    else:
        print(f"Failed to extract article for URL_ID: {url_id}")

Successfully extracted and saved article for URL_ID: blackassign0001
Successfully extracted and saved article for URL_ID: blackassign0002
Successfully extracted and saved article for URL_ID: blackassign0003
Successfully extracted and saved article for URL_ID: blackassign0004
Successfully extracted and saved article for URL_ID: blackassign0005
Successfully extracted and saved article for URL_ID: blackassign0006
Successfully extracted and saved article for URL_ID: blackassign0007
Successfully extracted and saved article for URL_ID: blackassign0008
Successfully extracted and saved article for URL_ID: blackassign0009
Successfully extracted and saved article for URL_ID: blackassign0010
Successfully extracted and saved article for URL_ID: blackassign0011
Successfully extracted and saved article for URL_ID: blackassign0012
Successfully extracted and saved article for URL_ID: blackassign0013
Successfully extracted and saved article for URL_ID: blackassign0014
Successfully extracted and saved a

### 4. Performing Text Analysis

- Create a Python script ("text_analysis.py") to perform text analysis:

In [5]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from textblob import TextBlob
import textstat
import syllapy

# Ensure you have the required NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')

# Load positive and negative word lists
positive_words = set(open('positive-words.txt').read().split())
negative_words = set(open('negative-words.txt').read().split())

# Function to extract article text
def extract_article_text(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract title
        title = soup.find('h1').get_text()

        # Extract article body
        article_body = soup.find('div', class_='td-post-content')
        paragraphs = article_body.find_all('p')

        article_text = "\n".join([p.get_text() for p in paragraphs])
        
        return title, article_text
    except Exception as e:
        print(f"Error extracting article text from {url}: {e}")
        return None, None
    
# Function to calculate positive score
def positive_score(text):
    words = word_tokenize(text)
    return sum(1 for word in words if word.lower() in positive_words)

# Function to calculate negative score
def negative_score(text):
    words = word_tokenize(text)
    return sum(1 for word in words if word.lower() in negative_words)

# Function to calculate polarity score
def polarity_score(text):
    analysis = TextBlob(text)
    return analysis.sentiment.polarity

# Function to calculate subjectivity score
def subjectivity_score(text):
    analysis = TextBlob(text)
    return analysis.sentiment.subjectivity

# Function to calculate average sentence length
def avg_sentence_length(text):
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    return len(words) / len(sentences)

# Function to calculate percentage of complex words
def percentage_complex_words(text):
    words = word_tokenize(text)
    complex_words = [word for word in words if syllapy.count(word) >= 3]
    return len(complex_words) / len(words) * 100

# Function to calculate Fog Index
def fog_index(text):
    return textstat.gunning_fog(text)

# Function to calculate average number of words per sentence
def avg_words_per_sentence(text):
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    return len(words) / len(sentences)

# Function to calculate complex word count
def complex_word_count(text):
    words = word_tokenize(text)
    return sum(1 for word in words if syllapy.count(word) >= 3)

# Function to calculate word count
def word_count(text):
    words = word_tokenize(text)
    return len(words)

# Function to calculate syllables per word
def syllable_per_word(text):
    words = word_tokenize(text)
    syllables = sum(syllapy.count(word) for word in words)
    return syllables / len(words)

# Function to calculate personal pronouns
def personal_pronouns(text):
    words = word_tokenize(text)
    personal_pronouns_list = ['I', 'we', 'my', 'ours', 'us', 'We', 'My', 'Ours', 'Us']
    return sum(1 for word in words if word in personal_pronouns_list)

# Function to calculate average word length
def avg_word_length(text):
    words = word_tokenize(text)
    return sum(len(word) for word in words) / len(words)

# Load input data
input_file = 'Input.xlsx'
input_data = pd.read_excel(r"C:\Users\SUSHIL\Downloads\Input.xlsx")

# Prepare a list to hold results
results = []

# Iterate through each URL and process the text
for index, row in input_data.iterrows():
    url_id = row['URL_ID']
    url = row['URL']
    
    title, article_text = extract_article_text(url)
    
    if title and article_text:
        analysis_results = {
            "URL_ID": url_id,
            "URL": url,
            "Title": title,
            "Positive Score": positive_score(article_text),
            "Negative Score": negative_score(article_text),
            "Polarity Score": polarity_score(article_text),
            "Subjectivity Score": subjectivity_score(article_text),
            "Avg Sentence Length": avg_sentence_length(article_text),
            "Percentage of Complex Words": percentage_complex_words(article_text),
            "Fog Index": fog_index(article_text),
            "Avg Words Per Sentence": avg_words_per_sentence(article_text),
            "Complex Word Count": complex_word_count(article_text),
            "Word Count": word_count(article_text),
            "Syllable Per Word": syllable_per_word(article_text),
            "Personal Pronouns": personal_pronouns(article_text),
            "Avg Word Length": avg_word_length(article_text)
        }
        results.append(analysis_results)
    else:
        print(f"Failed to process article for URL_ID: {url_id}")
        
# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Save results to an Excel file
output_file = 'Output Data Structure.xlsx'
results_df.to_excel(output_file, index=False)

print("Textual analysis complete and results saved.")
        

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SUSHIL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SUSHIL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Error extracting article text from https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/
Failed to process article for URL_ID: blackassign0036
Error extracting article text from https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/
Failed to process article for URL_ID: blackassign0049
Textual analysis complete and results saved.


# Summary

1. Setup: Ensure all required libraries are installed.
2. Prepare Word Lists: Generate "positive-words.txt" and "negative-words.txt:".
3. Extract Articles: Scrape article content from the URLs in "Input.xlsx" and save them.
4. Analyze Text: Compute textual metrics and save the results in "Output Data Structure.xlsx".

Following these steps will allow us to perform the required text analysis and generate the desired output.