Let's break down the code step-by-step:

### 1. Setup & Data Loading
This section initializes the necessary libraries and loads the primary data. It prepares the environment for the analysis and sets up the base dataframes we'll be working with.



*   **Import Libraries**: We import `pandas` for data manipulation, `numpy` for numerical operations, `yfinance` to download stock data, `nltk` for natural language processing (specifically sentiment analysis), `sklearn` for machine learning utilities (model selection and metrics), `pandas_ta` for technical indicators, and `xgboost` for our gradient boosting model.
*   **NLTK VADER Lexicon**: `nltk.download('vader_lexicon')` ensures the necessary lexicon for sentiment analysis is available, and `SentimentIntensityAnalyzer()` is initialized.
*   **Download S&P 500 Data**: `yf.download("^GSPC", start="2020-01-01", end="2024-01-01")` fetches historical stock data for the S&P 500 index (`^GSPC`) and stores it in `stock_df`.
*   **Flatten MultiIndex Columns**: Yfinance sometimes returns a DataFrame with MultiIndex columns. The code checks for this and flattens them to a single level (e.g., from `('Close', '^GSPC')` to `Close`).
*   **Create Dummy News Data**: Since we don't have a real news CSV, a `news_df` DataFrame is created with dummy headlines. In a real scenario, you would load a CSV file containing actual financial news headlines here. The headlines pattern is repeated to match the number of stock dates.

### 2. Sentiment Analysis (NLP)
This part of the code analyzes the sentiment of the news headlines to extract a numerical score representing positivity or negativity.

*   **`get_sentiment(text)` function**: This function takes a piece of text (a headline) and uses `sia.polarity_scores(text)['compound']` to calculate a compound sentiment score. This score ranges from -1 (most negative) to +1 (most positive).
*   **Apply to News Data**: `news_df['Sentiment_Score'] = news_df['Headline'].apply(get_sentiment)` applies this function to every headline in our `news_df` to generate a new column named `Sentiment_Score`.

### 3. Add Technical Indicators
Here, we calculate various technical indicators from the stock's price data. These indicators are widely used in financial analysis to provide insights into market trends, momentum, and potential buy/sell signals.

*   **Simple Moving Averages (SMA)**: `stock_df['Close'].rolling(window=X).mean()` calculates the average closing price over `X` periods (5, 10, and 20 days), helping to smooth out price data and identify trends.
*   **Exponential Moving Averages (EMA)**: `stock_df['Close'].ewm(span=X, adjust=False).mean()` calculates an EMA for 5, 10, and 20 days. EMAs give more weight to recent prices, making them more responsive to new information.
*   **Relative Strength Index (RSI)**: `ta.rsi(stock_df['Close'], length=14)` computes the 14-period RSI, a momentum oscillator that measures the speed and change of price movements, indicating overbought or oversold conditions.
*   **Moving Average Convergence Divergence (MACD)**: `ta.macd(stock_df['Close'], fast=12, slow=26, signal=9)` calculates the MACD line, signal line, and histogram. It's a trend-following momentum indicator that shows the relationship between two moving averages of a security’s price.
*   **Handle NaN Values**: Technical indicators often produce `NaN` (Not a Number) values at the beginning of the series because they require historical data. `stock_df.ffill()` forward-fills these `NaN`s, and `stock_df.dropna()` removes any remaining `NaN`s.

### 4. Data Alignment (Feature Engineering)
This section prepares the data for machine learning by defining the target variable and merging the sentiment and technical indicators.

*   **Calculate `Price_Diff` and `Target`**: `stock_df['Price_Diff'] = stock_df['Close'].diff()` calculates the difference in closing price from the previous day. `stock_df['Target'] = (stock_df['Price_Diff'].shift(-1) > 0).astype(int)` creates our target variable: `1` if the stock price increased *the next day* (shifted by -1), and `0` otherwise. This is crucial for predicting future movement.
*   **Merge DataFrames**: `final_df = pd.merge(stock_df, news_df, on='Date')` combines the `stock_df` (now with technical indicators and target) and `news_df` (with sentiment scores) into a single `final_df` based on the common 'Date' column.
*   **Drop Remaining NaNs**: `final_df = final_df.dropna()` cleans up any `NaN` values that might still exist after merging or due to the shifting of the target variable (the very last day will have a `NaN` target as there's no 'next day' to predict).

### 5. Model Training with XGBoost
This is where we define our machine learning model, train it, and evaluate its performance.

*   **Define Features (X) and Target (y)**: `X` is created using the sentiment score and all the financial features (`Open`, `High`, `Low`, `Volume`, SMA, EMA, RSI, MACD components). `y` is our `Target` variable (whether the stock went up or down).
*   **Split Data**: `train_test_split(X, y, test_size=0.2, random_state=42)` divides the data into training (80%) and testing (20%) sets. This allows us to train the model on one portion of the data and evaluate its performance on unseen data.
*   **Initialize XGBoost Model**: `model_xgb = xgb.XGBClassifier(eval_metric='logloss', random_state=42)` creates an instance of the XGBoost classifier. `eval_metric='logloss'` specifies the evaluation metric during training, and `random_state` ensures reproducibility.
*   **Train Model**: `model_xgb.fit(X_train, y_train)` trains the XGBoost model using our training features and target.
*   **Make Predictions**: `y_pred_xgb = model_xgb.predict(X_test)` uses the trained model to make predictions on the test set.
*   **Evaluate Performance**: Finally, `accuracy_score(y_test, y_pred_xgb)` calculates the overall accuracy, and `classification_report(y_test, y_pred_xgb)` provides a detailed report including precision, recall, and f1-score for each class (stock up or stock down).

In [None]:
pip install yfinance nltk xgboost pandas_ta

Collecting pandas_ta
  Downloading pandas_ta-0.4.71b0-py3-none-any.whl.metadata (2.3 kB)
Collecting numba==0.61.2 (from pandas_ta)
  Downloading numba-0.61.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.8 kB)
Collecting numpy>=1.16.5 (from yfinance)
  Downloading numpy-2.4.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.6 kB)
Collecting pandas>=1.3.0 (from yfinance)
  Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llvmlite<0.45,>=0.44.0dev0 (from numba==0.61.2->pandas_ta)
  Downloading llvmlite-0.44.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.0 kB)
Collecting numpy>=1.16.5 (from yfinance)
  Downloading numpy-2.2.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━

In [None]:
import pandas as pd
import numpy as np
import yfinance as yf
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import nltk
import pandas_ta as ta
import xgboost as xgb

# 1. Setup & Data Loading
# Ensure you have the lexicon downloaded
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

# Download S&P 500 data (for the target variable)
stock_df = yf.download("^GSPC", start="2020-01-01", end="2024-01-01")
stock_df = stock_df.reset_index()

# Flatten column MultiIndex if it exists (common with yfinance output for single or multiple tickers)
if isinstance(stock_df.columns, pd.MultiIndex):
    new_columns = []
    for col_tuple in stock_df.columns:
        if col_tuple[0] == 'Date':
            new_columns.append('Date') # Ensure 'Date' column is named correctly
        elif col_tuple[1] == '': # Handle cases where reset_index() might create ('Date', '')
            new_columns.append(col_tuple[0]) # This would be 'Date'
        else:
            # For other columns like ('Open', '^GSPC'), we want 'Open'
            new_columns.append(col_tuple[0]) # Take the first level for stock metrics (e.g., 'Open', 'High', 'Close')
    stock_df.columns = new_columns

# Sample news data (In your project, load the Kaggle CSV here)
# news_df = pd.read_csv('financial_news.csv')
# For demonstration, we'll create dummy news aligned with stock dates
headlines_pattern = [
    "Economy shows strong growth",
    "Tech stocks crash over interest rates",
    "Market remains stable despite inflation",
    "Fed hints at rate cuts next month"
]
num_stock_dates = len(stock_df)
# Repeat the pattern enough times to cover all stock dates, then slice to the exact length
repeated_headlines = (headlines_pattern * (num_stock_dates // len(headlines_pattern) + 1))[:num_stock_dates]

news_data = {
    'Date': stock_df['Date'],
    'Headline': repeated_headlines
}
news_df = pd.DataFrame(news_data)

# 2. Sentiment Analysis (NLP)
def get_sentiment(text):
    return sia.polarity_scores(text)['compound']

news_df['Sentiment_Score'] = news_df['Headline'].apply(get_sentiment)

# 3. Add Technical Indicators
stock_df['SMA_5'] = stock_df['Close'].rolling(window=5).mean()
stock_df['SMA_10'] = stock_df['Close'].rolling(window=10).mean()
stock_df['SMA_20'] = stock_df['Close'].rolling(window=20).mean()

stock_df['EMA_5'] = stock_df['Close'].ewm(span=5, adjust=False).mean()
stock_df['EMA_10'] = stock_df['Close'].ewm(span=10, adjust=False).mean()
stock_df['EMA_20'] = stock_df['Close'].ewm(span=20, adjust=False).mean()

stock_df['RSI'] = ta.rsi(stock_df['Close'], length=14)

macd = ta.macd(stock_df['Close'], fast=12, slow=26, signal=9)
stock_df['MACD'] = macd['MACD_12_26_9']
stock_df['MACD_Signal'] = macd['MACDs_12_26_9']

# 4. Data Alignment (Feature Engineering)
# We want to predict TOMORROW'S movement using TODAY'S sentiment.
stock_df['Price_Diff'] = stock_df['Close'].diff()
stock_df['Target'] = (stock_df['Price_Diff'].shift(-1) > 0).astype(int) # 1 if Up, 0 if Down

# Merge news sentiment with stock data
final_df = pd.merge(stock_df, news_df, on='Date')

# Handle NaN values introduced by indicator calculations and merge
final_df = final_df.dropna() # Remove rows without targets (last day) or initial NaNs from indicators

# 5. Model Training with XGBoost
X = final_df[['Sentiment_Score', 'Open', 'High', 'Low', 'Volume',
              'SMA_5', 'SMA_10', 'SMA_20', 'EMA_5', 'EMA_10', 'EMA_20',
              'RSI', 'MACD', 'MACD_Signal']]
y = final_df['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model_xgb = xgb.XGBClassifier(eval_metric='logloss', random_state=42)
model_xgb.fit(X_train, y_train)

y_pred_xgb = model_xgb.predict(X_test)

print(f"XGBoost Model Accuracy: {accuracy_score(y_test, y_pred_xgb):.2%}")
print("\nXGBoost Classification Report:")
print(classification_report(y_test, y_pred_xgb))

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
  stock_df = yf.download("^GSPC", start="2020-01-01", end="2024-01-01")
[*********************100%***********************]  1 of 1 completed


XGBoost Model Accuracy: 45.13%

XGBoost Classification Report:
              precision    recall  f1-score   support

           0       0.42      0.41      0.42        92
           1       0.48      0.49      0.48       103

    accuracy                           0.45       195
   macro avg       0.45      0.45      0.45       195
weighted avg       0.45      0.45      0.45       195

