**Noah Severin Grad HW Assingment**

Hi Eric,

The below assingment was based on the following criteria modifications, which we discussed during your office hours on 4/17/25



*   Due to differences in their 10Q naming conventions, it would be acceptable for full credit to only analyze NVDA 10Qs
*   It would be acceptable for full credit to manually upload the cleaned 10Q documents to Gemini and perform the sentiment analysis outside of the code environment


I have uploaded the csv file required for this code, along with a transcript of the prompts I used to perform the sentiment analysis along with this notebook to the dropbox.


In [73]:
#STEP 1: Download the 10Qs from SEC EDGAR
!pip install sec-edgar-downloader
from sec_edgar_downloader import Downloader
import os
from datetime import datetime

# Company tickers and report type
tickers = ['NVDA', 'INTC']
report_type = '10-Q'
account_email = "noah.severin@marquette.edu"  # Replace with your actual email

# Date range
after_date = '2020-01-01'
before_date = datetime.today().strftime('%Y-%m-%d')

# Loop through each ticker and download 10-Qs
for ticker in tickers:
    try:
        dl = Downloader(ticker, account_email)
        dl.get(report_type, ticker, after=after_date, before=before_date)
        base_dir = os.getcwd()
        target_dir = os.path.join(base_dir, "sec-edgar-filings", ticker, report_type)
        print(f"Successfully downloaded 10-Q filings for {ticker} between {after_date} and {before_date}")
        print(f"Files saved to: {target_dir}")
    except Exception as e:
        print(f"Error downloading 10-Q filings for {ticker}: {e}")

Successfully downloaded 10-Q filings for NVDA between 2020-01-01 and 2025-04-18
Files saved to: /content/sec-edgar-filings/NVDA/10-Q
Successfully downloaded 10-Q filings for INTC between 2020-01-01 and 2025-04-18
Files saved to: /content/sec-edgar-filings/INTC/10-Q


In [74]:
#STEP 2: Clean the 10Qs to remove the HTML format
import os
from bs4 import BeautifulSoup
import re
from datetime import datetime

def extract_nvda_10q_cleaned(base_dir, report_type='10-Q'):
    ticker = "NVDA"
    reports_path = os.path.join(base_dir, "sec-edgar-filings", ticker, report_type)
    output_dir = os.path.join(base_dir, "cleaned-filings", ticker, report_type)
    os.makedirs(output_dir, exist_ok=True)

    month_map = {
        "January": "01", "February": "02", "March": "03", "April": "04",
        "May": "05", "June": "06", "July": "07", "August": "08",
        "September": "09", "October": "10", "November": "11", "December": "12"
    }

    for root, dirs, files in os.walk(reports_path):
        for file in files:
            if file.endswith(".txt"):
                file_path = os.path.join(root, file)
                with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
                    html = f.read()
                    soup = BeautifulSoup(html, "html.parser")
                    full_text = soup.get_text(" ")

                    # Look for pattern like "For the Quarter Ended March 31, 2023"
                    date_match = re.search(
                        r"For the Quarter Ended\s+([A-Za-z]+)\s+(\d{1,2}),\s+(\d{4})",
                        full_text, re.IGNORECASE
                    )

                    if date_match:
                        month_str, day, year = date_match.groups()
                        month = month_map.get(month_str.capitalize(), "01")
                        filing_date = f"{year}-{month}-{int(day):02d}"
                    else:
                        filing_date = "unknown-date"

                    # Extract and clean content
                    docs = soup.find_all("document")
                    if docs:
                        text = "\n".join(doc.get_text(separator="\n") for doc in docs)
                    else:
                        text = soup.get_text(separator="\n")

                    lines = [line.strip() for line in text.splitlines()]
                    clean_text = "\n".join([line for line in lines if line])

                    # Save cleaned file
                    cleaned_filename = f"{ticker}_{filing_date}_cleaned.txt"
                    cleaned_path = os.path.join(output_dir, cleaned_filename)
                    with open(cleaned_path, "w", encoding="utf-8") as out_f:
                        out_f.write(clean_text)

                    print(f"Saved: {cleaned_filename}")

# Run this for NVDA
base_dir = os.getcwd()
extract_nvda_10q_cleaned(base_dir)

Saved: NVDA_2024-07-28_cleaned.txt
Saved: NVDA_2020-10-25_cleaned.txt
Saved: NVDA_2022-10-30_cleaned.txt
Saved: NVDA_2023-04-30_cleaned.txt
Saved: NVDA_2023-07-30_cleaned.txt
Saved: NVDA_2020-07-26_cleaned.txt
Saved: NVDA_2020-04-26_cleaned.txt
Saved: NVDA_2021-05-02_cleaned.txt
Saved: NVDA_2021-08-01_cleaned.txt
Saved: NVDA_2023-10-29_cleaned.txt
Saved: NVDA_2021-10-31_cleaned.txt
Saved: NVDA_2024-10-27_cleaned.txt
Saved: NVDA_2022-05-01_cleaned.txt
Saved: NVDA_2022-07-31_cleaned.txt
Saved: NVDA_2024-04-28_cleaned.txt


In [75]:
#STEPS 3 & 4: Extract Key Sections and Perform LLM Sentiment Analysis

#As we discussed, I did this portion manually by downloading the cleaned 10Q files and
#uploading them to Gemini. I included a copy of the Gemini transcipt along with the output table
#it generated with the submission of this notebook. The output table was uploaded to this notebook
#to perform the below step

#Read in Gemini Output Table

import kagglehub
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_auc_score,
    roc_curve,
    accuracy_score
)
import seaborn as sns

df = pd.read_csv('/content/Gemini Sentiment Analysis Output Table.csv')
df.head(15)


Unnamed: 0,Filing Date,Sentiment Rating,Quantified Sentiment Rating
0,2024-10-27,Positive,8
1,2024-07-28,Positive,9
2,2024-04-28,Positive,7
3,2023-10-29,Positive,6
4,2023-07-30,Positive,5
5,2023-04-30,Positive,2
6,2022-10-30,Negative,-3
7,2022-07-31,Negative,-4
8,2022-05-01,Negative,-5
9,2021-10-31,Positive,4


In [76]:
#STEP 5: Correlation with stock performance

import yfinance as yf
import pandas as pd

# Download NVDA data
nvda_data = yf.download("NVDA", start="2020-01-01", end=pd.Timestamp.today().strftime('%Y-%m-%d'))

# Create a dataframe with Date and Close Price columns
nvda_close = nvda_data[["Close"]].copy()
nvda_close = nvda_close.rename(columns={"Close": "Close Price"}) #Rename "Close" to "Close Price"

# Display the dataframe
nvda_close.head()

[*********************100%***********************]  1 of 1 completed


Price,Close Price
Ticker,NVDA
Date,Unnamed: 1_level_2
2020-01-02,5.972161
2020-01-03,5.876571
2020-01-06,5.901216
2020-01-07,5.972659
2020-01-08,5.983862


In [77]:
# Ensure index is sorted and a DatetimeIndex
nvda_close = nvda_close.sort_index()
nvda_close.index = pd.to_datetime(nvda_close.index)

# Original list of target dates
target_dates = [
    "2020-04-26", "2020-07-26", "2020-10-25", "2021-05-02", "2021-08-01",
    "2021-10-31", "2022-05-01", "2022-07-31", "2022-10-30", "2023-04-30",
    "2023-07-30", "2023-10-29", "2024-04-28", "2024-07-28", "2024-10-27"
]
target_dates = pd.to_datetime(target_dates)

# Reindex to include target dates
nvda_close_reindexed = nvda_close.reindex(nvda_close.index.union(target_dates))

# Forward fill missing values
nvda_close_reindexed = nvda_close_reindexed.fillna(method='ffill')

# Select the target dates
filtered_nvda_close = nvda_close_reindexed.loc[target_dates]

# Show result
print(filtered_nvda_close)

Price      Close Price
Ticker            NVDA
2020-04-26    7.213175
2020-07-26   10.161715
2020-10-25   13.550594
2021-05-02   14.975291
2021-08-01   19.459021
2021-10-31   25.519075
2022-05-01   18.517660
2022-07-31   18.138102
2022-10-30   13.819138
2023-04-30   27.730993
2023-07-30   46.724495
2023-10-29   40.481236
2024-04-28   87.706192
2024-07-28  113.032143
2024-10-27  141.517227


  nvda_close_reindexed = nvda_close_reindexed.fillna(method='ffill')


In [78]:
# Calculate percentage return from one row to the next
filtered_nvda_close["Pct Return"] = filtered_nvda_close["Close Price"].pct_change() * 100

# Optional: round for readability
filtered_nvda_close["Pct Return"] = filtered_nvda_close["Pct Return"].round(2)

# Show result
print(filtered_nvda_close)

Price      Close Price Pct Return
Ticker            NVDA           
2020-04-26    7.213175        NaN
2020-07-26   10.161715      40.88
2020-10-25   13.550594      33.35
2021-05-02   14.975291      10.51
2021-08-01   19.459021      29.94
2021-10-31   25.519075      31.14
2022-05-01   18.517660     -27.44
2022-07-31   18.138102      -2.05
2022-10-30   13.819138     -23.81
2023-04-30   27.730993     100.67
2023-07-30   46.724495      68.49
2023-10-29   40.481236     -13.36
2024-04-28   87.706192     116.66
2024-07-28  113.032143      28.88
2024-10-27  141.517227      25.20


In [79]:
print(type(filtered_nvda_close.index))      # Check row index type
print(type(filtered_nvda_close.columns))    # Check column index type

<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
<class 'pandas.core.indexes.multi.MultiIndex'>


In [80]:
filtered_nvda_close.columns = [col if not isinstance(col, tuple) else col[-1] for col in filtered_nvda_close.columns]

In [81]:
# Convert 'Filing Date' columns to datetime objects
df['Filing Date'] = pd.to_datetime(df['Filing Date'])
filtered_nvda_close.index = pd.to_datetime(filtered_nvda_close.index)

#Rename the index in filtered_nvda_close
filtered_nvda_close = filtered_nvda_close.rename_axis('Filing Date')


# Perform the merge
merged_df = pd.merge(df, filtered_nvda_close, on='Filing Date', how='left')

# Display the merged DataFrame
print(merged_df.head(15))

   Filing Date Sentiment Rating  Quantified Sentiment Rating        NVDA  \
0   2024-10-27         Positive                            8  141.517227   
1   2024-07-28         Positive                            9  113.032143   
2   2024-04-28         Positive                            7   87.706192   
3   2023-10-29         Positive                            6   40.481236   
4   2023-07-30         Positive                            5   46.724495   
5   2023-04-30         Positive                            2   27.730993   
6   2022-10-30         Negative                           -3   13.819138   
7   2022-07-31         Negative                           -4   18.138102   
8   2022-05-01         Negative                           -5   18.517660   
9   2021-10-31         Positive                            4   25.519075   
10  2021-08-01         Positive                            5   19.459021   
11  2021-05-02         Positive                            6   14.975291   
12  2020-10-

In [82]:
# Display the headers of the merged DataFrame
print(merged_df.columns)

Index(['Filing Date', 'Sentiment Rating', 'Quantified Sentiment Rating',
       'NVDA', ''],
      dtype='object')


In [83]:
# Rename specific columns using a dictionary
merged_df = merged_df.rename(columns={'NVDA': 'Filing Date Close Price', '': '% Change Last Filing Date'})

# Display the headers of the merged DataFrame
print(merged_df.columns)

Index(['Filing Date', 'Sentiment Rating', 'Quantified Sentiment Rating',
       'Filing Date Close Price', '% Change Last Filing Date'],
      dtype='object')


In [84]:
# Display the merged DataFrame (first 15 rows)
print(merged_df.head(15))

   Filing Date Sentiment Rating  Quantified Sentiment Rating  \
0   2024-10-27         Positive                            8   
1   2024-07-28         Positive                            9   
2   2024-04-28         Positive                            7   
3   2023-10-29         Positive                            6   
4   2023-07-30         Positive                            5   
5   2023-04-30         Positive                            2   
6   2022-10-30         Negative                           -3   
7   2022-07-31         Negative                           -4   
8   2022-05-01         Negative                           -5   
9   2021-10-31         Positive                            4   
10  2021-08-01         Positive                            5   
11  2021-05-02         Positive                            6   
12  2020-10-25         Positive                            5   
13  2020-07-26         Positive                            4   
14  2020-04-26         Positive         

In [85]:
#Move the '% Change Last Filing Date' column up by one to compare sentiment vs next period performance

# Shift the '% Change Last Filing Date' column up by one
merged_df['% Change Next Filing Date'] = merged_df['% Change Last Filing Date'].shift(-1)

# Create a correlation column
merged_df['Correlation'] = (merged_df['Quantified Sentiment Rating'] > 0) == (merged_df['% Change Next Filing Date'] > 0)

# Display the updated DataFrame
print(merged_df[['Filing Date', 'Sentiment Rating', 'Quantified Sentiment Rating', '% Change Next Filing Date', 'Correlation']])

# Analyze the correlation (e.g., calculate the percentage of True values in the 'Correlation' column)
correlation_percentage = merged_df['Correlation'].sum() / len(merged_df) * 100
print(f"\nCorrelation Percentage: {correlation_percentage:.2f}%")

   Filing Date Sentiment Rating  Quantified Sentiment Rating  \
0   2024-10-27         Positive                            8   
1   2024-07-28         Positive                            9   
2   2024-04-28         Positive                            7   
3   2023-10-29         Positive                            6   
4   2023-07-30         Positive                            5   
5   2023-04-30         Positive                            2   
6   2022-10-30         Negative                           -3   
7   2022-07-31         Negative                           -4   
8   2022-05-01         Negative                           -5   
9   2021-10-31         Positive                            4   
10  2021-08-01         Positive                            5   
11  2021-05-02         Positive                            6   
12  2020-10-25         Positive                            5   
13  2020-07-26         Positive                            4   
14  2020-04-26         Positive         

In [87]:
#Remove the last two rows to capture an accurate correlation %

# 1. Remove rows with NaN in '% Change Next Filing Date':
merged_df = merged_df.dropna(subset=['% Change Next Filing Date'])

# 2. Recalculate the correlation percentage:
correlation_percentage_no_nan = merged_df['Correlation'].sum() / len(merged_df) * 100

# 3. Display the modified DataFrame:
print("\nModified DataFrame:")
print(merged_df)

# 4. Print the result:
print(f"\nCorrelation Percentage (Excluding NaN Rows): {correlation_percentage_no_nan:.2f}%")


Modified DataFrame:
   Filing Date Sentiment Rating  Quantified Sentiment Rating  \
0   2024-10-27         Positive                            8   
1   2024-07-28         Positive                            9   
2   2024-04-28         Positive                            7   
3   2023-10-29         Positive                            6   
4   2023-07-30         Positive                            5   
5   2023-04-30         Positive                            2   
6   2022-10-30         Negative                           -3   
7   2022-07-31         Negative                           -4   
8   2022-05-01         Negative                           -5   
9   2021-10-31         Positive                            4   
10  2021-08-01         Positive                            5   
11  2021-05-02         Positive                            6   
12  2020-10-25         Positive                            5   

    Filing Date Close Price  % Change Last Filing Date  \
0                141.517

**Analysis**
This analysis had Gemini look at each 10Q, specifcially the MD&A and Risk Factors sections, and assing it a rating, positive or negative. It then captured whether the stock increased or decreased over the period of the time from when the 10Q was filed until the next 10Q was filed. Surprisly, the sentiment analysis performed by Gemini was correlated at a rate of 76.92% with the peformance of the stock over the next period. This does not imply causation, however, and could be largely due to NVDA's strong performance over the past 5 years.