# KAIM Week 1 Challenges Task 3

## Business Objective

**Nova Financial Solutions** aims to enhance its predictive analytics capabilities to significantly boost its financial forecasting accuracy and operational efficiency through advanced data analysis. As a Data Analyst at Nova Financial Solutions,  your primary task is to conduct a rigorous analysis of the financial news dataset. The focus of your analysis should be two-fold:

*     **Sentiment Analysis:** Perform sentiment analysis on the ‘headline’ text to quantify the tone and sentiment expressed in financial news. This will involve using natural language processing (NLP) techniques to derive sentiment scores, which can be associated with the respective 'Stock Symbol' to understand the emotional context surrounding stock-related news.
*     **Correlation Analysis:** Establish statistical correlations between the sentiment derived from news articles and the corresponding stock price movements. This involves tracking stock price changes around the date the article was published and analyzing the impact of news sentiment on stock performance. This analysis should consider the publication date and potentially the time the article was published if such data can be inferred or is available.

Your recommendations should leverage insights from this sentiment analysis to suggest investment strategies. These strategies should utilize the relationship between news sentiment and stock price fluctuations to predict future movements. The final report should provide clear, actionable insights based on your analysis, offering innovative strategies to use news sentiment as a predictive tool for stock market trends.


## Dataset Overview

### Financial News and Stock Price Integration Dataset

**FNSPID (Financial News and Stock Price Integration Dataset)**, is a comprehensive financial dataset designed to enhance stock market predictions by combining quantitative and qualitative data.

- The structure of the [data](https://drive.google.com/file/d/1tLHusoOQOm1cU_7DtLNbykgFgJ_piIpd/view?usp=drive_link) is as follows
    - `headline`: Article release headline, the title of the news article, which often includes key financial actions like stocks hitting highs, price target changes, or company earnings.
    - `url`: The direct link to the full news article.
    - `publisher`: Author/creator of article.
    - `date`: The publication date and time, including timezone information(UTC-4 timezone).
    - `stock`: Stock ticker symbol (unique series of letters assigned to a publicly traded company). For example (AAPL: Apple)


### Correlation between news and stock movement

**Tasks:**
- Date Alignment: Ensure that both datasets (news and stock prices) are aligned by dates. This might involve normalizing timestamps.
- Sentiment Analysis: Conduct sentiment analysis on news headlines to quantify the tone of each article (positive, negative, neutral).Tools: Use Python libraries like nltk, TextBlob for sentiment analysis.
- Analysis:
    - Calculate Daily Stock Returns: Compute the percentage change in daily closing prices to represent stock movements.
    - Correlation Analysis: Use statistical methods to test the correlation between daily news sentiment scores and stock returns.

**KPIs**
- Proactivity to self-learn - sharing references.
- Sentiment Analysis
- Correlation Strength


### Minimum Essential To Do:
- Merge the necessary branches from task-2 into the main branch using a Pull Request (PR)
- Create at least one new branch called "task-3" for the ongoing development of the dashboard.
- Commit your work with a descriptive commit message.
- Data preparation
    - Normalize Dates: Align dates in both news and stock datasets to ensure each news item matches the corresponding stock trading day.
    - Perform Sentiment Analysis: Use a simple and effective sentiment analysis tool to assign sentiment scores to headlines.
- Calculate Stock Movements
    - Compute Daily Returns: Calculate daily percentage changes in stock prices to represent movements.
- Correlation Analysis
    - Aggregate Sentiments: Compute average daily sentiment scores if multiple articles appear on the same day.
    - Calculate Correlation: Determine the Pearson correlation coefficient between average daily sentiment scores and stock daily returns.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('/kaggle/input/kaim-w1/yfinance_data/yfinance_data/AAPL_historical_data.csv')
data.head()

To complete Task 3, you need to analyze the correlation between news sentiment and stock movement. The following detailed steps will guide you through this process, including data preparation, sentiment analysis, and correlation analysis.

### Steps to Complete Task 3

#### 1. **Setup GitHub Repository**

1. **Create a New Branch:**
   ```bash
   git checkout main
   git pull origin main
   git checkout -b task-3
   ```

2. **Commit Your Work:**
   ```bash
   git add .
   git commit -m "Set up repository for Task 3 and created branch."
   git push origin task-3
   ```

#### 2. **Data Preparation**

**Load and Normalize Data:**

1. **Load Datasets:**

```python
import pandas as pd

# Load news data
news_file_path = 'path_to_your_news_data.csv'  # Replace with your dataset path
news_df = pd.read_csv(news_file_path)

# Load stock price data
stock_file_path = 'path_to_your_stock_data.csv'  # Replace with your dataset path
stock_df = pd.read_csv(stock_file_path)

# Convert 'date' columns to datetime
news_df['date'] = pd.to_datetime(news_df['date'])
stock_df['Date'] = pd.to_datetime(stock_df['Date'])
stock_df.set_index('Date', inplace=True)
```

2. **Normalize Dates:**

Ensure that both datasets are aligned by date. If the news dataset has multiple entries per day, aggregate the news sentiment on a daily basis.

```python
# Aggregate news sentiment by day
news_df['date'] = news_df['date'].dt.date
news_daily_sentiment = news_df.groupby('date').agg({'headline': 'count', 'sentiment': 'mean'}).reset_index()
news_daily_sentiment.rename(columns={'date': 'Date', 'sentiment': 'Avg_Sentiment'}, inplace=True)
news_daily_sentiment.set_index('Date', inplace=True)
```

**Sentiment Analysis:**

3. **Perform Sentiment Analysis:**

```python
from textblob import TextBlob

# Add a sentiment column to news dataframe
news_df['sentiment'] = news_df['headline'].apply(lambda x: TextBlob(x).sentiment.polarity)
```

#### 3. **Calculate Stock Movements**

1. **Compute Daily Returns:**

```python
# Calculate daily percentage change in stock price
stock_df['Daily_Return'] = stock_df['Close'].pct_change() * 100
```

#### 4. **Correlation Analysis**

1. **Aggregate Sentiments:**

Ensure the average daily sentiment scores are available and aligned with stock daily returns.

```python
# Merge daily news sentiment with stock data
merged_df = pd.merge(stock_df[['Daily_Return']], news_daily_sentiment, left_index=True, right_index=True)
```

2. **Calculate Correlation:**

```python
# Calculate Pearson correlation coefficient
correlation = merged_df[['Avg_Sentiment', 'Daily_Return']].corr().iloc[0, 1]
print(f'Pearson correlation coefficient between average daily sentiment and stock daily returns: {correlation:.2f}')
```

#### 5. **Visualizations**

Visualize the relationship between sentiment scores and stock returns.

```python
import matplotlib.pyplot as plt
import seaborn as sns

# Plot daily returns and average sentiment
plt.figure(figsize=(14, 7))

# Plot stock returns
plt.subplot(2, 1, 1)
plt.plot(merged_df.index, merged_df['Daily_Return'], label='Daily Return', color='blue')
plt.title('Daily Stock Returns')
plt.xlabel('Date')
plt.ylabel('Return (%)')
plt.legend()

# Plot average sentiment
plt.subplot(2, 1, 2)
plt.plot(merged_df.index, merged_df['Avg_Sentiment'], label='Average Sentiment', color='orange')
plt.title('Average Daily Sentiment')
plt.xlabel('Date')
plt.ylabel('Sentiment Score')
plt.legend()

plt.tight_layout()
plt.show()

# Scatter plot to visualize correlation
plt.figure(figsize=(8, 6))
sns.scatterplot(data=merged_df, x='Avg_Sentiment', y='Daily_Return', alpha=0.5)
plt.title('Sentiment Score vs. Stock Returns')
plt.xlabel('Average Sentiment Score')
plt.ylabel('Daily Return (%)')
plt.show()
```

### Example Commit Messages

1. “Set up the repository and created branch for Task 3.”
2. “Loaded and normalized news and stock price data.”
3. “Performed sentiment analysis on news headlines.”
4. “Calculated daily stock returns and aggregated sentiment scores.”
5. “Conducted correlation analysis and visualized results.”

### Final Notes

- **Data Alignment:** Ensure that dates in both datasets are aligned correctly, and handle missing values or discrepancies if necessary.
- **Sentiment Analysis:** Choose a sentiment analysis method that fits your needs; TextBlob is used here, but other methods or libraries could be used for more advanced analysis.
- **Correlation Analysis:** The Pearson correlation coefficient provides a measure of the linear relationship between sentiment scores and stock returns. Consider additional statistical tests or models if needed.

Feel free to adjust the code and analysis based on the specific features of your datasets and analysis requirements.

To ensure Task 3 is completed comprehensively and in detail, let’s break down each step carefully. We’ll ensure every requirement is addressed, from aligning dates to calculating correlations.

### Task 3: Correlation Between News and Stock Movement

### 1. **Setup GitHub Repository**

1. **Create a New Branch:**
   ```bash
   git checkout main
   git pull origin main
   git checkout -b task-3
   ```

2. **Commit Your Work:**
   ```bash
   git add .
   git commit -m "Set up repository for Task 3 and created branch."
   git push origin task-3
   ```

### 2. **Data Preparation**

#### **Load and Normalize Data**

1. **Load Datasets:**
   ```python
   import pandas as pd

   # Load news data
   news_file_path = 'path_to_your_news_data.csv'  # Replace with your dataset path
   news_df = pd.read_csv(news_file_path)

   # Load stock price data
   stock_file_path = 'path_to_your_stock_data.csv'  # Replace with your dataset path
   stock_df = pd.read_csv(stock_file_path)

   # Display initial data
   print("News Data:")
   print(news_df.head())
   print("\nStock Data:")
   print(stock_df.head())
   ```

2. **Convert Dates and Set Index:**
   ```python
   # Convert 'date' columns to datetime
   news_df['date'] = pd.to_datetime(news_df['date'])
   stock_df['Date'] = pd.to_datetime(stock_df['Date'])
   stock_df.set_index('Date', inplace=True)

   # Ensure the date column in news_df is set to date only
   news_df['date'] = news_df['date'].dt.date
   ```

3. **Align Dates:**
   Aggregate news data by day to align with stock trading days.

   ```python
   # Aggregate news sentiment by day
   news_df['sentiment'] = news_df['headline'].apply(lambda x: TextBlob(x).sentiment.polarity)
   news_daily_sentiment = news_df.groupby('date').agg({'sentiment': 'mean'}).reset_index()
   news_daily_sentiment.rename(columns={'date': 'Date', 'sentiment': 'Avg_Sentiment'}, inplace=True)
   news_daily_sentiment.set_index('Date', inplace=True)
   ```

#### **Sentiment Analysis**

1. **Perform Sentiment Analysis:**
   ```python
   from textblob import TextBlob

   # Add sentiment scores to the news dataframe
   news_df['sentiment'] = news_df['headline'].apply(lambda x: TextBlob(x).sentiment.polarity)
   ```

2. **Aggregate Sentiment Scores:**
   ```python
   # Group by date and calculate the average sentiment
   news_daily_sentiment = news_df.groupby('date').agg({'sentiment': 'mean'}).reset_index()
   news_daily_sentiment.rename(columns={'date': 'Date', 'sentiment': 'Avg_Sentiment'}, inplace=True)
   news_daily_sentiment.set_index('Date', inplace=True)
   ```

### 3. **Calculate Stock Movements**

1. **Compute Daily Returns:**
   ```python
   # Calculate daily percentage change in closing price
   stock_df['Daily_Return'] = stock_df['Close'].pct_change() * 100
   ```

### 4. **Correlation Analysis**

1. **Merge Datasets:**
   ```python
   # Merge daily news sentiment with stock data
   merged_df = pd.merge(stock_df[['Daily_Return']], news_daily_sentiment, left_index=True, right_index=True)

   # Drop any rows with missing data
   merged_df.dropna(inplace=True)

   # Display the merged data
   print("Merged Data:")
   print(merged_df.head())
   ```

2. **Calculate Pearson Correlation:**
   ```python
   # Calculate Pearson correlation coefficient
   correlation = merged_df[['Avg_Sentiment', 'Daily_Return']].corr().iloc[0, 1]
   print(f'Pearson correlation coefficient between average daily sentiment and stock daily returns: {correlation:.2f}')
   ```

### 5. **Visualizations**

1. **Plot Daily Returns and Average Sentiment:**
   ```python
   import matplotlib.pyplot as plt
   import seaborn as sns

   # Plot Daily Returns and Average Sentiment
   plt.figure(figsize=(14, 7))

   # Plot stock returns
   plt.subplot(2, 1, 1)
   plt.plot(merged_df.index, merged_df['Daily_Return'], label='Daily Return', color='blue')
   plt.title('Daily Stock Returns')
   plt.xlabel('Date')
   plt.ylabel('Return (%)')
   plt.legend()

   # Plot average sentiment
   plt.subplot(2, 1, 2)
   plt.plot(merged_df.index, merged_df['Avg_Sentiment'], label='Average Sentiment', color='orange')
   plt.title('Average Daily Sentiment')
   plt.xlabel('Date')
   plt.ylabel('Sentiment Score')
   plt.legend()

   plt.tight_layout()
   plt.show()
   ```

2. **Scatter Plot to Visualize Correlation:**
   ```python
   plt.figure(figsize=(8, 6))
   sns.scatterplot(data=merged_df, x='Avg_Sentiment', y='Daily_Return', alpha=0.5)
   plt.title('Sentiment Score vs. Stock Returns')
   plt.xlabel('Average Sentiment Score')
   plt.ylabel('Daily Return (%)')
   plt.show()
   ```

### Example Commit Messages

1. “Set up repository for Task 3 and created branch.”
2. “Loaded and prepared news and stock price data, and normalized dates.”
3. “Performed sentiment analysis on news headlines and aggregated daily sentiment.”
4. “Calculated daily stock returns and merged with news sentiment data.”
5. “Conducted correlation analysis and created visualizations of sentiment and stock returns.”

### Final Notes

- **Data Alignment:** Ensure that all dates match correctly, and handle any discrepancies or missing values.
- **Sentiment Analysis:** Verify the sentiment analysis method aligns with your needs. Consider more advanced models if required.
- **Correlation Analysis:** The Pearson correlation coefficient measures the linear relationship. Consider additional statistical tests or analyses for more insights.

Feel free to adjust the code and approach based on the specific needs of your datasets and analysis objectives.