### Approach to the Solution

1. **Data Extraction:**
   - **Reading Input:** The script begins by reading an Excel file (Input.xlsx) using pandas to obtain URLs and their corresponding IDs.
   - **Fetching Article Text:** For each URL, it uses requests and BeautifulSoup to extract the article text from the webpage. The extracted text is saved into separate text files named after their URL_IDs in a directory (articles/).

2. **Textual Analysis:**
   - **Loading Dependencies:** The script uses several libraries and tools:
     - pandas: For data handling and manipulation.
     - nltk: For tokenization (word_tokenize, sent_tokenize), stopword removal, and other linguistic analyses.
     - textstat: For readability metrics such as syllable count and Gunning Fog index.
     - TextBlob: For sentiment analysis (polarity and subjectivity).
   - **Calculating Metrics:** Various functions are defined to compute different metrics such as average sentence length, percentage of complex words, fog index, and more. These functions utilize NLTK and textstat functionalities to analyze the text extracted from each article file.
   - **Iterating through Data:** It iterates through each row of the input DataFrame, reads the corresponding article text from the saved files, calculates the required metrics, and stores the results in lists within output_data.

3. **Output Generation:**
   - **Creating Output DataFrame:** After computing all metrics for all articles, it creates a pandas DataFrame (output_df) from output_data.
   - **Saving to Excel:** Finally, the script saves this DataFrame into an Excel file named Output Data Structure.xlsx, ensuring that the structure and content match the requirements specified in Output Data Structure.xlsx.

### Running the .py File to Generate Output

To run the Python script (analysis.ipynb) and generate the output (Output Data Structure.xlsx), follow these steps:

1. **Setup Dependencies:**
   - Ensure you have installed the required libraries:
     
     !pip install pandas requests beautifulsoup4 nltk textstat textblob
     

2. **Prepare Input Data:**
   - Place your Input.xlsx file in the same directory as your Python script (analysis.ipynb). Ensure it contains the required URLs and URL_IDs.

3. **Run the Script:**
   - Open a jupiter notebook .
   - Navigate to the directory containing analysis.ipynb and Input.xlsx.
   - Run code :
     
     
   - This will start the script, which will fetch articles from the URLs, perform textual analysis, and save the results to 'Output Data Structure.xlsx'.

### Dependencies Required

Ensure that you have the following dependencies installed:

- pandas: For data manipulation and Excel file handling.
- requests: For making HTTP requests to fetch web pages.
- beautifulsoup4: For parsing HTML and extracting content from web pages.
- nltk: For natural language processing tasks such as tokenization and stopword removal.
- textstat: For computing readability metrics like syllable count and Gunning Fog index.
- textblob: For sentiment analysis.

You can install these dependencies using pip:

bash
pip install pandas requests beautifulsoup4 nltk textstat textblob


By following these steps and ensuring the correct setup of dependencies, you should be able to successfully run the script and generate the required textual analysis output in Excel format. 
If you encounter any issues or need further assistance, feel free to ask!


In [24]:
#!pip install pandas requests beautifulsoup4 nltk textstat textblob


In [25]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os
import re
from textblob import TextBlob
import nltk
from nltk.corpus import cmudict
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
nltk.download('stopwords')

# Load positive and negative word lists
positive_words = set(open('positive-words.txt').read().split())
negative_words = set(open('negative-words.txt').read().split())
stop_words = set(stopwords.words('english'))

# Load the input Excel file
input_df = pd.read_excel('C:\\Users\\SIDDHARTH\\Downloads\\Input.xlsx')

# Create a directory to save the extracted articles
os.makedirs('articles', exist_ok=True)

# Log file to keep track of errors
error_log = open('error_log.txt', 'w', encoding='utf-8')

# Function to count syllables
d = cmudict.dict()
def syllable_count(word):
    return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]][0] if word.lower() in d else 0

# Function to analyze text
def analyze_text(text):
    # Tokenize the text into sentences and words
    sentences = nltk.sent_tokenize(text)
    words = nltk.word_tokenize(text)
    
    # Clean words by removing stop words and punctuation
    cleaned_words = [word for word in words if word.isalnum() and word.lower() not in stop_words]
    
    # Calculate word counts and syllables
    word_count = len(cleaned_words)
    syllable_per_word = sum(syllable_count(word) for word in cleaned_words) / word_count if word_count else 0
    avg_word_length = sum(len(word) for word in cleaned_words) / word_count if word_count else 0
    
    # Calculate positive, negative scores
    positive_score = sum(1 for word in cleaned_words if word in positive_words)
    negative_score = sum(1 for word in cleaned_words if word in negative_words)
    
    # Calculate polarity and subjectivity
    polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
    subjectivity_score = (positive_score + negative_score) / (word_count + 0.000001)
    
    # Calculate sentence length and complex words
    avg_sentence_length = word_count / len(sentences) if sentences else 0
    complex_words = [word for word in cleaned_words if syllable_count(word) > 2]
    complex_word_count = len(complex_words)
    percentage_of_complex_words = complex_word_count / word_count if word_count else 0
    fog_index = 0.4 * (avg_sentence_length + percentage_of_complex_words)
    
    # Calculate average number of words per sentence
    avg_words_per_sentence = word_count / len(sentences) if sentences else 0
    
    # Count personal pronouns
    personal_pronouns = sum(1 for word in cleaned_words if word.lower() in ['i', 'we', 'my', 'ours', 'us'])
    
    return [
        positive_score,
        negative_score,
        polarity_score,
        subjectivity_score,
        avg_sentence_length,
        percentage_of_complex_words,
        fog_index,
        avg_words_per_sentence,
        complex_word_count,
        word_count,
        syllable_per_word,
        personal_pronouns,
        avg_word_length
    ]

# Analyze each article and compile the results
results = []
for index, row in input_df.iterrows():
    url_id = row['URL_ID']
    url = row['URL']
    file_path = f'articles/{url_id}.txt'
    
    if os.path.exists(file_path):
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()
            analysis_results = analyze_text(text)
            results.append([url_id] + analysis_results)
    else:
        try:
            response = requests.get(url)
            response.raise_for_status()  # This will raise an HTTPError for bad responses
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Extract the article title and text (adjust the selector based on the website structure)
            title = soup.find('h1').get_text(strip=True)
            article_body = soup.find('div', class_='td-post-content')
            paragraphs = article_body.find_all('p')
            article_text = ' '.join([para.get_text(strip=True) for para in paragraphs])
            
            # Save the extracted text to a file named after the URL_ID
            with open(file_path, 'w', encoding='utf-8') as file:
                file.write(title + '\n' + article_text)
            
            # Perform analysis on the extracted text
            analysis_results = analyze_text(title + ' ' + article_text)
            results.append([url_id] + analysis_results)
        
        except requests.RequestException as e:
            error_log.write(f"Failed to fetch article for URL_ID {url_id}: {e}\n")
        except Exception as e:
            error_log.write(f"Failed to extract article for URL_ID {url_id}: {e}\n")

error_log.close()

# Convert results to DataFrame
columns = ['URL_ID', 'POSITIVE SCORE', 'NEGATIVE SCORE', 'POLARITY SCORE', 'SUBJECTIVITY SCORE', 'AVG SENTENCE LENGTH', 
           'PERCENTAGE OF COMPLEX WORDS', 'FOG INDEX', 'AVG NUMBER OF WORDS PER SENTENCE', 'COMPLEX WORD COUNT', 
           'WORD COUNT', 'SYLLABLE PER WORD', 'PERSONAL PRONOUNS', 'AVG WORD LENGTH']
results_df = pd.DataFrame(results, columns=columns)

# Merge with input data
final_df = pd.merge(input_df, results_df, on='URL_ID')

# Save the final results to Excel
final_df.to_excel('Output Data Structure.xlsx', index=False)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SIDDHARTH\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SIDDHARTH\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SIDDHARTH\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
