<a href="https://colab.research.google.com/github/ritwiks9635/Natural_Language_Processing_Model/blob/main/Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Approach
1. Loaded input URLs from the provided "Input.xlsx".
2. Extracted article text from each URL using BeautifulSoup and saved it in text files.
3. Performed textual analysis on the extracted text:
    - Calculated positive, negative, polarity, and subjectivity scores using TextBlob.
    - Calculated average sentence length, average word length, word count, and complex word count.
    - Calculated FOG index and percentage of complex words.
    - Counted personal pronouns and calculated syllables per word.
4. Stored the results in the specified output structure.

# How to Run the Script
1. Ensure all dependencies are installed:
    ```
    pip install pandas nltk textblob openpyxl
    ```
2. Place the script `text_analysis.py` in the same directory as the following files:
    - Input.xlsx
    - Output Data Structure.xlsx
    - Extracted articles in a folder named `extracted_articles` (each article saved as a text file with its URL_ID as the filename).
3. Run the script:
    ```
    python text_analysis.py
    ```
4. The output will be saved as `Output_Analysis.xlsx` in the same directory.

# Dependencies
- pandas
- nltk
- textblob
- openpyxl

In [1]:
import requests
from bs4 import BeautifulSoup
import os
import pandas as pd

In [2]:
# Load the Excel file
file_path = '/content/Input.xlsx'
df = pd.read_excel(file_path)

# Load the structure of the output Excel file
output_structure_file_path = '/content/Output Data Structure.xlsx'
output_structure_df = pd.read_excel(output_structure_file_path)

In [3]:
df.head()

Unnamed: 0,URL_ID,URL
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...


In [4]:
df.shape

(100, 2)

In [5]:
output_structure_df.head()

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,,,,,,,,,,,,,
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,,,,,,,,,,,,,
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,,,,,,,,,,,,,
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,,,,,,,,,,,,,
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,,,,,,,,,,,,,


In [6]:
# Create a directory to save the extracted articles
output_dir = 'extracted_articles'
os.makedirs(output_dir, exist_ok=True)

# Function to extract article title and text
def extract_article(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract the title
    title = soup.find('h1').get_text()

    # Extract the article text
    article_body = soup.find('div', class_='td-post-content')
    paragraphs = article_body.find_all('p')
    article_text = '\n'.join([para.get_text() for para in paragraphs])

    return title, article_text

# Iterate over the rows in the dataframe
for index, row in df.iterrows():
    url_id = row['URL_ID']
    url = row['URL']

    try:
        title, article_text = extract_article(url)
        content = f"Title: {title}\n\n{article_text}"

        # Save the content to a text file
        with open(os.path.join(output_dir, f"{url_id}.txt"), 'w', encoding='utf-8') as file:
            file.write(content)

        print(f"Successfully extracted {url_id}")

    except Exception as e:
        print(f"Failed to extract {url_id}: {e}")

Successfully extracted blackassign0001
Successfully extracted blackassign0002
Successfully extracted blackassign0003
Successfully extracted blackassign0004
Successfully extracted blackassign0005
Successfully extracted blackassign0006
Successfully extracted blackassign0007
Successfully extracted blackassign0008
Successfully extracted blackassign0009
Successfully extracted blackassign0010
Successfully extracted blackassign0011
Successfully extracted blackassign0012
Successfully extracted blackassign0013
Successfully extracted blackassign0014
Successfully extracted blackassign0015
Successfully extracted blackassign0016
Successfully extracted blackassign0017
Successfully extracted blackassign0018
Successfully extracted blackassign0019
Successfully extracted blackassign0020
Successfully extracted blackassign0021
Successfully extracted blackassign0022
Successfully extracted blackassign0023
Successfully extracted blackassign0024
Successfully extracted blackassign0025
Successfully extracted bl

In [7]:
with open("/content/extracted_articles/blackassign0100.txt", "r") as f:
    files = f.read()
print(files)

Title: How will COVID-19 affect the world of work?

As business close to help prevent transmission of COVID-19, financial concerns and job losses are one of the first human impacts of the virus;
COVID-19 is in decline in China. There are now more new cases every day in Europe than there were in China at the epidemic’s peak and Italy has surpassed it as the country with the most deaths from the virus It took 67 days to reach the first 100,000 confirmed cases worldwide, 11 days for this to increase to 200,000and just four to reach 300,000 confirmed cases – a figure now exceeded.
In recent weeks, we have seen the significant economic impact of the coronavirus on financial markets and vulnerable industries such as manufacturing, tourism, hospitality and travel. Travel and tourism account for 10% of the global GDP and 50 million jobs are at risk worldwide.
Global tourism, travel and hospitality companies closing down affects SMEs globally. This, in turn, affects many people, typically the l

In [8]:
import nltk
from nltk.corpus import cmudict
from nltk.tokenize import word_tokenize, sent_tokenize
from textblob import TextBlob
import pandas as pd
import os
import re

In [11]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('cmudict')

# Load the CMU Pronouncing Dictionary for syllable counting
d = cmudict.dict()

# Function to count syllables in a word
def syllable_count(word):
    return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]][0] if word.lower() in d else 0

# Function to count complex words (words with more than two syllables)
def complex_word_count(text):
    words = word_tokenize(text)
    return sum(1 for word in words if syllable_count(word) > 2)

# Function to calculate the FOG index
def fog_index(text):
    total_words = len(word_tokenize(text))
    total_sentences = len(sent_tokenize(text))
    complex_words = complex_word_count(text)
    if total_sentences == 0:
        return 0
    return 0.4 * ((total_words / total_sentences) + 100 * (complex_words / total_words))

# Function to count personal pronouns
def personal_pronouns_count(text):
    pronouns = ['I', 'we', 'my', 'ours', 'us']
    words = word_tokenize(text)
    tagged_words = nltk.pos_tag(words)
    return sum(1 for word, pos in tagged_words if word in pronouns and pos == 'PRP')

# Function to calculate positive, negative, polarity, and subjectivity scores
def sentiment_scores(text):
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    subjectivity = blob.sentiment.subjectivity
    words = blob.words
    positive_score = sum(1 for word in words if TextBlob(word).sentiment.polarity > 0)
    negative_score = sum(1 for word in words if TextBlob(word).sentiment.polarity < 0)
    return positive_score, negative_score, polarity, subjectivity

# Function to calculate average sentence length
def avg_sentence_length(text):
    sentences = sent_tokenize(text)
    total_words = len(word_tokenize(text))
    if len(sentences) == 0:
        return 0
    return total_words / len(sentences)

# Function to calculate average word length
def avg_word_length(text):
    words = word_tokenize(text)
    total_length = sum(len(word) for word in words)
    if len(words) == 0:
        return 0
    return total_length / len(words)

# Function to calculate word count
def word_count(text):
    return len(word_tokenize(text))

# Function to calculate percentage of complex words
def percentage_complex_words(text):
    total_words = len(word_tokenize(text))
    if total_words == 0:
        return 0
    return complex_word_count(text) / total_words

# Load extracted articles
extracted_articles_dir = '/content/extracted_articles'
articles = {}
for filename in os.listdir(extracted_articles_dir):
    if filename.endswith('.txt'):
        with open(os.path.join(extracted_articles_dir, filename), 'r', encoding='utf-8') as file:
            articles[filename[:-4]] = file.read()

# Initialize the output dataframe
output_data = output_structure_df.copy()

# Perform analysis for each article
for url_id, text in articles.items():
    positive_score, negative_score, polarity_score, subjectivity_score = sentiment_scores(text)
    avg_sent_length = avg_sentence_length(text)
    perc_complex_words = percentage_complex_words(text)
    fog_idx = fog_index(text)
    avg_words_per_sent = avg_sentence_length(text)
    complex_words = complex_word_count(text)
    total_word_count = word_count(text)
    syllables_per_word = syllable_count(text) / total_word_count if total_word_count else 0
    personal_pronouns = personal_pronouns_count(text)
    avg_word_len = avg_word_length(text)

    # Save the computed values in the output dataframe
    output_data.loc[output_data['URL_ID'] == url_id, 'POSITIVE SCORE'] = positive_score
    output_data.loc[output_data['URL_ID'] == url_id, 'NEGATIVE SCORE'] = negative_score
    output_data.loc[output_data['URL_ID'] == url_id, 'POLARITY SCORE'] = polarity_score
    output_data.loc[output_data['URL_ID'] == url_id, 'SUBJECTIVITY SCORE'] = subjectivity_score
    output_data.loc[output_data['URL_ID'] == url_id, 'AVG SENTENCE LENGTH'] = avg_sent_length
    output_data.loc[output_data['URL_ID'] == url_id, 'PERCENTAGE OF COMPLEX WORDS'] = perc_complex_words
    output_data.loc[output_data['URL_ID'] == url_id, 'FOG INDEX'] = fog_idx
    output_data.loc[output_data['URL_ID'] == url_id, 'AVG NUMBER OF WORDS PER SENTENCE'] = avg_words_per_sent
    output_data.loc[output_data['URL_ID'] == url_id, 'COMPLEX WORD COUNT'] = complex_words
    output_data.loc[output_data['URL_ID'] == url_id, 'WORD COUNT'] = total_word_count
    output_data.loc[output_data['URL_ID'] == url_id, 'SYLLABLE PER WORD'] = syllables_per_word
    output_data.loc[output_data['URL_ID'] == url_id, 'PERSONAL PRONOUNS'] = personal_pronouns
    output_data.loc[output_data['URL_ID'] == url_id, 'AVG WORD LENGTH'] = avg_word_len

# Save the results to a new Excel file
output_file_path = '/content/Output_Analysis.xlsx'
output_data.to_excel(output_file_path, index=False)

print("Textual analysis completed and results saved.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!


Textual analysis completed and results saved.


In [12]:
output_data.head()

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,23.0,3,0.243518,0.583195,15.88,0.108312,10.684494,15.88,43.0,397.0,0.0,1.0,4.183879
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,74.0,23,0.121024,0.429174,21.181818,0.20049,16.492347,21.181818,327.0,1631.0,0.0,3.0,4.876763
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,39.0,14,0.100987,0.431038,21.839286,0.270646,19.561552,21.839286,331.0,1223.0,0.0,13.0,5.474244
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,41.0,22,0.041421,0.406442,23.784314,0.225062,18.516199,23.784314,273.0,1213.0,0.0,4.0,5.305853
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,19.0,3,0.11389,0.520523,19.641026,0.181462,15.114896,19.641026,139.0,766.0,0.0,6.0,5.061358


In [14]:
%%writefile text_analysis.py

import nltk
from nltk.corpus import cmudict
from nltk.tokenize import word_tokenize, sent_tokenize
from textblob import TextBlob
import pandas as pd
import os
import re

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('cmudict')

# Load the CMU Pronouncing Dictionary for syllable counting
d = cmudict.dict()

# Function to count syllables in a word
def syllable_count(word):
    return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]][0] if word.lower() in d else 0

# Function to count complex words (words with more than two syllables)
def complex_word_count(text):
    words = word_tokenize(text)
    return sum(1 for word in words if syllable_count(word) > 2)

# Function to calculate the FOG index
def fog_index(text):
    total_words = len(word_tokenize(text))
    total_sentences = len(sent_tokenize(text))
    complex_words = complex_word_count(text)
    if total_sentences == 0:
        return 0
    return 0.4 * ((total_words / total_sentences) + 100 * (complex_words / total_words))

# Function to count personal pronouns
def personal_pronouns_count(text):
    pronouns = ['I', 'we', 'my', 'ours', 'us']
    words = word_tokenize(text)
    tagged_words = nltk.pos_tag(words)
    return sum(1 for word, pos in tagged_words if word in pronouns and pos == 'PRP')

# Function to calculate positive, negative, polarity, and subjectivity scores
def sentiment_scores(text):
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    subjectivity = blob.sentiment.subjectivity
    words = blob.words
    positive_score = sum(1 for word in words if TextBlob(word).sentiment.polarity > 0)
    negative_score = sum(1 for word in words if TextBlob(word).sentiment.polarity < 0)
    return positive_score, negative_score, polarity, subjectivity

# Function to calculate average sentence length
def avg_sentence_length(text):
    sentences = sent_tokenize(text)
    total_words = len(word_tokenize(text))
    if len(sentences) == 0:
        return 0
    return total_words / len(sentences)

# Function to calculate average word length
def avg_word_length(text):
    words = word_tokenize(text)
    total_length = sum(len(word) for word in words)
    if len(words) == 0:
        return 0
    return total_length / len(words)

# Function to calculate word count
def word_count(text):
    return len(word_tokenize(text))

# Function to calculate percentage of complex words
def percentage_complex_words(text):
    total_words = len(word_tokenize(text))
    if total_words == 0:
        return 0
    return complex_word_count(text) / total_words

# Load the structure of the output Excel file
output_structure_file_path = 'Output Data Structure.xlsx'
output_structure_df = pd.read_excel(output_structure_file_path)

# Load input Excel file
input_file_path = 'Input.xlsx'
input_df = pd.read_excel(input_file_path)

# Load extracted articles
extracted_articles_dir = 'extracted_articles'
articles = {}
for filename in os.listdir(extracted_articles_dir):
    if filename.endswith('.txt'):
        with open(os.path.join(extracted_articles_dir, filename), 'r', encoding='utf-8') as file:
            articles[filename[:-4]] = file.read()

# Initialize the output dataframe
output_data = output_structure_df.copy()

# Perform analysis for each article
for url_id, text in articles.items():
    positive_score, negative_score, polarity_score, subjectivity_score = sentiment_scores(text)
    avg_sent_length = avg_sentence_length(text)
    perc_complex_words = percentage_complex_words(text)
    fog_idx = fog_index(text)
    avg_words_per_sent = avg_sentence_length(text)
    complex_words = complex_word_count(text)
    total_word_count = word_count(text)
    syllables_per_word = syllable_count(text) / total_word_count if total_word_count else 0
    personal_pronouns = personal_pronouns_count(text)
    avg_word_len = avg_word_length(text)

    # Save the computed values in the output dataframe
    output_data.loc[output_data['URL_ID'] == url_id, 'POSITIVE SCORE'] = positive_score
    output_data.loc[output_data['URL_ID'] == url_id, 'NEGATIVE SCORE'] = negative_score
    output_data.loc[output_data['URL_ID'] == url_id, 'POLARITY SCORE'] = polarity_score
    output_data.loc[output_data['URL_ID'] == url_id, 'SUBJECTIVITY SCORE'] = subjectivity_score
    output_data.loc[output_data['URL_ID'] == url_id, 'AVG SENTENCE LENGTH'] = avg_sent_length
    output_data.loc[output_data['URL_ID'] == url_id, 'PERCENTAGE OF COMPLEX WORDS'] = perc_complex_words
    output_data.loc[output_data['URL_ID'] == url_id, 'FOG INDEX'] = fog_idx
    output_data.loc[output_data['URL_ID'] == url_id, 'AVG NUMBER OF WORDS PER SENTENCE'] = avg_words_per_sent
    output_data.loc[output_data['URL_ID'] == url_id, 'COMPLEX WORD COUNT'] = complex_words
    output_data.loc[output_data['URL_ID'] == url_id, 'WORD COUNT'] = total_word_count
    output_data.loc[output_data['URL_ID'] == url_id, 'SYLLABLE PER WORD'] = syllables_per_word
    output_data.loc[output_data['URL_ID'] == url_id, 'PERSONAL PRONOUNS'] = personal_pronouns
    output_data.loc[output_data['URL_ID'] == url_id, 'AVG WORD LENGTH'] = avg_word_len

# Save the results to a new Excel file
output_file_path = 'Output_Analysis.xlsx'
output_data.to_excel(output_file_path, index=False)

print("Textual analysis completed and results saved.")

Writing text_analysis.py


In [13]:
%%writefile README.txt

# Approach
1. Loaded input URLs from the provided "Input.xlsx".
2. Extracted article text from each URL using BeautifulSoup and saved it in text files.
3. Performed textual analysis on the extracted text:
    - Calculated positive, negative, polarity, and subjectivity scores using TextBlob.
    - Calculated average sentence length, average word length, word count, and complex word count.
    - Calculated FOG index and percentage of complex words.
    - Counted personal pronouns and calculated syllables per word.
4. Stored the results in the specified output structure.

# How to Run the Script
1. Ensure all dependencies are installed:
    ```
    pip install pandas nltk textblob openpyxl
    ```
2. Place the script `text_analysis.py` in the same directory as the following files:
    - Input.xlsx
    - Output Data Structure.xlsx
    - Extracted articles in a folder named `extracted_articles` (each article saved as a text file with its URL_ID as the filename).
3. Run the script:
    ```
    python text_analysis.py
    ```
4. The output

Writing README.txt


In [18]:

import zipfile
import os

# Define the list of files to be included in the zip file
files_to_zip = ['/content/text_analysis.py', '/content/Output_Analysis.xlsx', '/content/README.txt']

# Create a zip file
with zipfile.ZipFile('text_analysis_project.zip', 'w') as zipf:
    for file in files_to_zip:
        zipf.write(file)

print("Zip file created successfully.")

Zip file created successfully.


In [21]:

from google.colab import drive
from google.colab import files

# Mount Google Drive
drive.mount('/content/drive')

# Upload the file
uploaded = files.upload()

# Move the file to Google Drive
for filename in uploaded.keys():
    !cp {"/content/drive/MyDrive/Colab Notebooks"} /content/drive/MyDrive/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Saving text_analysis_project.zip to text_analysis_project (1).zip
cp: cannot stat '{/content/text_analysis_project.zip}': No such file or directory


In [22]:

!cp /content/text_analysis_project.zip /content/drive/MyDrive/