# Financial News Sentiment Analysis Project

## Overview
This project analyzes financial news data to uncover correlations between news sentiment and stock market movements for Nova Financial Solutions.

## Objectives
1. **Sentiment Analysis**: Analyze sentiment in financial news headlines using NLP techniques
2. **Correlation Analysis**: Study statistical relationships between news sentiment and stock price movements




In [1]:
# Import necessary libraries and modules
import os
import sys

sys.path.insert(0, os.path.dirname(os.getcwd()))
from scripts.financial_analysis import FinancialAnalysis

# Suppress FutureWarnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Data Loading and Initial Analysis
 
First, we'll load the financial news dataset and perform some initial exploratory analysis to understand the data structure and content.



In [2]:
# Initialize the FinancialAnalysis class
analysis = FinancialAnalysis()

# Load the dataset
analysis.load_data()

# Display the first few rows of the dataset
analysis.df.head()

Unnamed: 0.1,Unnamed: 0,headline,url,publisher,date,stock
0,0,Stocks That Hit 52-Week Highs On Friday,https://www.benzinga.com/news/20/06/16190091/s...,Benzinga Insights,2020-06-05 10:30:54-04:00,A
1,1,Stocks That Hit 52-Week Highs On Wednesday,https://www.benzinga.com/news/20/06/16170189/s...,Benzinga Insights,2020-06-03 10:45:20-04:00,A
2,2,71 Biggest Movers From Friday,https://www.benzinga.com/news/20/05/16103463/7...,Lisa Levin,2020-05-26 04:30:07-04:00,A
3,3,46 Stocks Moving In Friday's Mid-Day Session,https://www.benzinga.com/news/20/05/16095921/4...,Lisa Levin,2020-05-22 12:45:06-04:00,A
4,4,B of A Securities Maintains Neutral on Agilent...,https://www.benzinga.com/news/20/05/16095304/b...,Vick Meyer,2020-05-22 11:38:59-04:00,A


 #### Initial Data Exploration
 
 Let's explore the key characteristics of our dataset:
 
 1. **Dataset Size**: Examining record count and column structure
 2. **Data Quality**: Checking for missing values and duplicates
 3. **Statistical Overview**: Basic statistics for numerical fields
 4. **Categorical Analysis**: Distribution of text-based columns


In [3]:
analysis.explore_data()


Dataset Overview:
------------------------------
Number of records: 1407328
Columns: Unnamed: 0, headline, url, publisher, date, stock

Missing Values:
------------------------------
Empty DataFrame
Columns: [Missing Count, Missing %]
Index: []

Duplicates:
------------------------------
Duplicate rows: 0 (0.0%)

Duplicates (excluding Unnamed: 0): 1 (0.0%)

Numerical Statistics:
------------------------------
         Unnamed: 0
count  1.407328e+06
mean   7.072454e+05
std    4.081009e+05
min    0.000000e+00
25%    3.538128e+05
50%    7.072395e+05
75%    1.060710e+06
max    1.413848e+06

Categorical Columns:
------------------------------

headline:
Unique values: 845770
Top 3 most common:
headline
Benzinga's Top Upgrades       5449
Benzinga's Top Downgrades     5372
Benzinga's Top Initiations    4241
Name: count, dtype: int64

url:
Unique values: 883429
Top 3 most common:
url
https://www.benzinga.com/news/20/03/15538835/stocks-that-hit-52-week-lows-on-thursday    1704
https://www.benz

<scripts.financial_analysis.FinancialAnalysis at 0x1371ff0b1f0>

#### Data Cleaning and Preprocessing
 Let's clean and standardize our dataset for analysis


In [4]:
analysis.clean_data()
analysis.exploratory_analysis()


=== Exploratory Analysis ===

Dataset Shape: (55982, 10)
Date Range: 2011-04-27 21:01:48-04:00 to 2020-06-11 17:12:35-04:00

Top 10 Most Covered Stocks:
stock
GRUB    10
TSLA    10
FIVE    10
DEJ     10
CRIS    10
GDL     10
HTZ     10
UAL     10
GLTR    10
CHS     10
Name: count, dtype: int64

Articles by Hour of Day:
hour
0       67
1       14
2       57
3       93
4     1469
5     1829
6     2475
7     5032
8     5526
9     5965
10    7668
11    5701
12    5732
13    2710
14    2075
15    1612
16    3939
17    2799
18     704
19     227
20     131
21      82
22      48
23      27
Name: count, dtype: int64

Headline Length Summary:
count    55982.000000
mean        80.013951
std         56.127638
min         12.000000
25%         42.000000
50%         63.000000
75%         91.000000
max        512.000000
Name: headline_length, dtype: float64

Number of Unique Publishers: 225

Top 5 Publishers by Article Count:
publisher
Benzinga Newsdesk    14745
Lisa Levin           12408
Etf Profe

<scripts.financial_analysis.FinancialAnalysis at 0x1371ff0b1f0>

 #### 📊 Initial Data Analysis Summary:
 
 1. 📈 Dataset Scale:
    - Large dataset with 1.4M records
    - Clean data with no missing values
    - Minimal duplicates (<0.1%)

 2. 📰 Content Analysis:
    - 845K unique headlines
    - Most common headlines are analyst ratings (upgrades/downgrades)
    - 883K unique URLs with some repeated content

 3. 👥 Publisher Insights:
    - 1,034 unique publishers
    - Top publishers: Paul Quintaro (228K articles), Lisa Levin (186K)
    - High concentration among top publishers

 4. 📅 Temporal Coverage:
    - ~40K unique dates
    - Peak activity in March 2020 (market volatility period)
    - Covers multiple stocks with MRK, MS, NVDA most frequent

 5. 📏 Headline Length Analysis:
    - Mean length: 73 characters
    - Median length: 64 characters
    - Mode length: 47 characters
    - Right-skewed distribution with long tail

 6. ⏰ Temporal Publishing Patterns:
    - Peak publishing: Thursdays (302,595 articles)
    - Lowest activity: Saturdays (7,753 articles)
    - Strong weekday bias aligning with trading hours

 7. 📈 Time Series Characteristics:
    - Upward trend in monthly article volume
    - Periodic activity spikes observed
    - Financial news coverage expanding over time
