# Project Task Structure Explanation

## 🗂 Directories and Files

### 📝 Cleaned Articles
- **`cleaned_articles`**: Contains the articles that have been cleaned and are ready for analysis. This directory includes text files with preprocessed content.

### 📂 Extracted Articles
- **`extracted_articles`**: Holds the raw articles that were initially extracted for the project. This directory is used as the source for further processing and analysis.

### 📚 Master Dictionary
- **`master dictionary`**: A collection of files used for sentiment analysis and text processing.
  - **`cleaned negative words.txt`**: Contains a list of negative words after cleaning.
  - **`cleaned positive words.txt`**: Contains a list of positive words after cleaning.
  - **`negative-words.txt`**: Raw file with negative words for sentiment analysis.
  - **`positive-words.txt`**: Raw file with positive words for sentiment analysis.

### 📑 Project Introduction
- **`project Introduction`**: Provides an overview and objectives of the project. Includes introductory material and project goals.

### 🧪 Test Assessment
- **`test assessment`**: Contains test assignments and notebooks related to the assessment.
  - **`dataextraction.ipynb`**: Jupyter Notebook for data extraction tasks.
  - **`testassessment.ipynb`**: Jupyter Notebook for additional test assessments.

### 💻 Code and Markdown
- **`testassignment`**: Includes code and markdown files related to test assignments.
  - **`Code + Markdown |`**: Contains code snippets and markdown explanations for the test assignments.
  - **`Run All`**: A script or command to execute all code cells in notebooks.

### 🚫 Stop Words
- **`Stop Words`**: Directory containing various stop words files used to clean and preprocess text.
  - **`StopWords Auditor.txt`**: Stop words used for auditing and review.
  - **`StopWords Currencies.txt`**: Stop words related to currency names and symbols.
  - **`StopWords Datasand Numbers.txt`**: Stop words related to data and numbers.
  - **`StopWords Generic.txt`**: General stop words list.
  - **`StopWords GenericLong.txt`**: Extended list of generic stop words.
  - **`StopWords Geographic.txt`**: Stop words related to geographic locations.
  - **`StopWords Names.txt`**: Stop words related to personal names.

### 📊 Text Analysis
- **`Text Analysis`**: Contains various files and notebooks for performing text analysis.
  - **`textanalysis.ipynb`**: Jupyter Notebook for conducting text analysis.
  - **`textanalysisoverflow.ipynb`**: Additional notebook for overflow analysis tasks.
  - **`sentiment analysis.log`**: Log file recording the results of sentiment analysis.
  - **`textblob sentiment result.csv`**: CSV file with sentiment analysis results using TextBlob.
  - **`textual analysis metrics.xlsx`**: Excel file with metrics from textual analysis.
  - **`average word length.csv`**: CSV file containing results for average word length analysis.
  - **`average word_per_sentence_results.csv`**: CSV file with results for average words per sentence.
  - **`personal pronouns count.csv`**: CSV file with counts of personal pronouns found in the text.
  - **`readability analysis results.csv`**: CSV file with readability analysis results.
  - **`syllable count result.csv`**: CSV file with results for syllable counting in words.
  - **`word count result.csv`**: CSV file with word count results.
  - **`word frequency visualization.png`**: PNG file with a visualization of word frequency.
  - **`Text Analysis.docx`**: Word document summarizing text analysis results.
  - **`textual analysis results.docx`**: Word document with detailed results from textual analysis.
  - **`sentiment analysis results.csv`**: CSV file with comprehensive sentiment analysis results.

### 📈 Additional Files
- **`analysis results.csv`**: CSV file with various analysis results.
- **`analysis results.xlsx`**: Excel file with summarized analysis results.
- **`Objective.docx`**: Document outlining the objectives and goals of the project.
- **`Output Data Structure.xlsx`**: Excel file describing the structure of the output data.
- **`final text analysis results.xlsx`**: Final Excel file with all text analysis results compiled.

## 📌 Notes
- Ensure all files are properly saved and backed up to avoid data loss.
- Maintain consistency in file naming and formats to facilitate easier navigation and data management.
- Verify that all analysis results are correctly generated and logged for accurate reporting and future reference.

# Blackcoffer Test Assignment

## 🌐 Company Information
- **Consulting Website**: [Blackcoffer](https://blackcoffer.com) | [LSA Lead](https://lsalead.com/)
- **Web App Products**: [Netclan](https://netclan.com/) | [Insights](https://insights.blackcoffer.com/) | [Hire Kingdom](https://hirekingdom.com/) | [Workcroft](https://workcroft.com/)
- **Mobile App Products**: Netclan | Bwstory

## 📋 Assignment Overview

### 1. Objective
Extract textual data from the provided URLs and perform text analysis to compute specified variables.

### 2. Data Extraction
- **File**: `Input.xlsx`
- **Task**: Extract the article title and text from each URL in `Input.xlsx`. Save the extracted content as text files named by `URL_ID`.
- **Tools**: Use Python with libraries such as BeautifulSoup, Selenium, or Scrapy.

### 3. Data Analysis
- **File**: `Output Data Structure.xlsx`
- **Task**: Perform textual analysis on the extracted articles and compute the required variables.
- **Tools**: Python programming for data analysis.

### 4. Variables
Refer to `Text Analysis.docx` for definitions. Compute the following:
1. Positive Score
2. Negative Score
3. Polarity Score
4. Subjectivity Score
5. Avg Sentence Length
6. Percentage of Complex Words
7. Fog Index
8. Avg Number of Words Per Sentence
9. Complex Word Count
10. Word Count
11. Syllable Per Word
12. Personal Pronouns
13. Avg Word Length

### 5. Output Data Structure
- **Include**:
  - All input variables from `Input.xlsx`
  - Computed variables: Positive Score, Negative Score, Polarity Score, Subjectivity Score, Avg Sentence Length, Percentage of Complex Words, Fog Index, Avg Number of Words Per Sentence, Complex Word Count, Word Count, Syllable Per Word, Personal Pronouns, Avg Word Length
- **File Format**: CSV or Excel, following the format in `Output Data Structure.xlsx`.

### 6. Timeline
- **Duration**: 6 days (sooner is better).

### 7. Submission
- **Submit via**: [Google Form](https://forms.gle/nvWAgrCBdq1JkKou8)
- **Requirements**:
  - `.py` file
  - Output in CSV or Excel
  - Instructions documentation:
    1. Approach explanation
    2. How to run the `.py` file
    3. List of dependencies required

**Note**: Do not include any other files.

---

Good luck with your assignment!


# Text Analysis

## Objective
The objective of this document is to explain the methodology adopted for performing text analysis to derive sentimental opinions, sentiment scores, readability, passive words, personal pronouns, and other metrics.

## Table of Contents
1. [Sentimental Analysis](#sentimental-analysis) - 2
    1.1. [Cleaning using Stop Words Lists](#cleaning-using-stop-words-lists) - 2
    1.2. [Creating Dictionary of Positive and Negative Words](#creating-dictionary-of-positive-and-negative-words) - 2
    1.3. [Extracting Derived Variables](#extracting-derived-variables) - 2
2. [Analysis of Readability](#analysis-of-readability) - 3
3. [Average Number of Words Per Sentence](#average-number-of-words-per-sentence) - 3
4. [Complex Word Count](#complex-word-count) - 3
5. [Word Count](#word-count) - 3
6. [Syllable Count Per Word](#syllable-count-per-word) - 4
7. [Personal Pronouns](#personal-pronouns) - 4
8. [Average Word Length](#average-word-length) - 4

## 1. Sentimental Analysis
Sentimental analysis involves determining whether a piece of writing is positive, negative, or neutral. The following algorithm is designed for use in Financial Texts:

### 1.1. Cleaning using Stop Words Lists
- **Purpose**: To clean the text so that Sentiment Analysis can be performed by excluding the words found in the Stop Words List.
- **Source**: Stop Words Lists (located in the `StopWords` folder).

### 1.2. Creating Dictionary of Positive and Negative Words
- **Purpose**: To create a dictionary of Positive and Negative words.
- **Source**: Master Dictionary (located in the `MasterDictionary` folder).
- **Process**: Add words to the dictionary only if they are not found in the Stop Words Lists.

### 1.3. Extracting Derived Variables
Convert the text into a list of tokens using the NLTK tokenize module and calculate the following variables:
- **Positive Score**: Assign a value of +1 for each word found in the Positive Dictionary and sum all values.
- **Negative Score**: Assign a value of -1 for each word found in the Negative Dictionary, then sum all values and multiply by -1 to get a positive number.
- **Polarity Score**: Determines if the text is positive or negative:
  \[
  \text{Polarity Score} = \frac{\text{Positive Score} - \text{Negative Score}}{\text{Positive Score} + \text{Negative Score} + 0.000001}
  \]
  Range: -1 to +1
- **Subjectivity Score**: Determines if the text is objective or subjective:
  \[
  \text{Subjectivity Score} = \frac{\text{Positive Score} + \text{Negative Score}}{\text{Total Words after cleaning} + 0.000001}
  \]
  Range: 0 to +1

## 2. Analysis of Readability
Readability is calculated using the Gunning Fox Index formula:
- **Average Sentence Length**: Number of words / Number of sentences
- **Percentage of Complex Words**: Number of complex words / Number of words
- **Fog Index**:
  \[
  \text{Fog Index} = 0.4 \times (\text{Average Sentence Length} + \text{Percentage of Complex Words})
  \]

## 3. Average Number of Words Per Sentence
Calculated as:
\[
\text{Average Number of Words Per Sentence} = \frac{\text{Total Number of Words}}{\text{Total Number of Sentences}}
\]

## 4. Complex Word Count
- **Definition**: Words with more than two syllables.

## 5. Word Count
- **Process**:
  1. Remove stop words (using the `stopwords` class of the NLTK package).
  2. Remove punctuations like `?`, `!`, `,`, `.` before counting.

## 6. Syllable Count Per Word
- **Method**: Count the number of syllables in each word by counting the vowels. Handle exceptions such as words ending with "es" or "ed".

## 7. Personal Pronouns
- **Process**: Use regex to count occurrences of the words “I,” “we,” “my,” “ours,” and “us.” Ensure that "US" as a country name is not included.

## 8. Average Word Length
- **Formula**:
  \[
  \text{Average Word Length} = \frac{\text{Sum of the Total Number of Characters in Each Word}}{\text{Total Number of Words}}
  \]