***NOTEBOOK 1 : DATA COLLECTION AND SENTIMENT ANALYSIS***

**DETAILS : COMPANIES STOCKS INFO AND DATA SCRAPING**

In other to get the required dataset for sentiment analysis on financial stocks, we first of all install the **Finviz API** and the **YFinance API**.

We use the finviz api to download the information of companies:
|__ Ticker : This is the ticker symbol of the companies
|__ Company Name : The Company name of the ticker
|__ Sector : The Sector name in which the Company is a group of
|__ Industry : The Indutry name in which the Company is a part of
|__ Description : The Sector name in which the Company is a group of
|__ Charts : This link redirects you to the chart image of the Company
|__ Country : The Country name in which the Company is located
|__ News : This part is a dataframe of a compact news on the Company with 3 columns (|__ date	|__ title	|__ link)

We use the yfinance api to download the respective company stocks for the days in which we have a news value in other to compare the sentiment value and its effect on the stock prices. the data is downloaded in this a dataframe format (table name "trade") with the following column values: 
(\_ date	\_open 	\_ close  \_ volume).  we then append the	(|__ company_name	|__ industry_name	|__ sector_name) to the intitial trade dataframe downloaded in other to properly identify the values during sentiment analysis and model training
	

**STEP 1.1: INSTALLING REQUIRED PACKAGES**

**INSTALLING THE FINVIZ API**

In [None]:
!pip install finviz
!pip install finvizfinance

**INSTALLING THE YFINANCE API**

In [None]:
!pip install yfinance

**STEP 1.2 : IMPORTING LIBRARIES**

**IMPORTING NECCESSARY LIBRARIES**

In [1]:
# Import neccesary libraries

# 
import datetime

# importing the finviz library
import finviz
from finviz.screener import Screener
import finvizfinance

# Importing Pandas Library for dataframe manipulations
import pandas as pd

# Libraries used for Sentiment ANalysis
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# We use the warnings library to ignore Deprecated Warningd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\liemu\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


**STEP 1.3 : USING FINVIZ SCREENER TO SCRAPE COMPANIES CSV FILE**

**Reading the FINVIZ stocks data "finviz_stocks_list.csv"**

We instantiate the Finviz Screener method, this action loads all the company names, ticker symbols, Sector, Industry, Country, Market Cap, PE, Price, Change, and Volume.
We then save this info a csv format **"finviz_stocks_list.csv"** so we could use it later for other tasks

In [None]:
stock_list = Screener()
stock_list.to_csv("finviz_stocks_list.csv")

**STEP 1.3.1 : CONVERTING THE FINVIZ CSV FILE TO DICTIONARY**

In this **finviz_stocks_list.csv** file we just downloaded we are only interested in the ("Ticker", "Company", "Sector", "Industry", "Country") values, we then load this into a pandas dataframe **df**, and convert it to the form of a dictionary **result**, so we can easily attch other values to it with ease

In [2]:
# Reading the FINVIZ stocks data 
df = pd.read_csv('./finviz_stocks_list.csv')
df = df[["Ticker", "Company", "Sector", "Industry", "Country"]]

# Convert the DataFrame to a dictionary
df_dict = df.to_dict(orient='records')
# Create a new dictionary with Ticker as the key
result = {
          row['Ticker']: {'Company': row['Company'], 'Sector': row['Sector'],
                         'Industry': row['Industry'], 'Country': row['Country']}
          for row in df_dict}

the result dict outputs in tgo following format:

{
  'AA': {'Company': 'Alcoa Corporation',
  'Sector': 'Basic Materials',
  'Industry': 'Aluminum',
  'Country': 'USA'},

 'AAAU': {'Company': 'Goldman Sachs Physical Gold ETF',
  'Sector': 'Financial',
  'Industry': 'Exchange Traded Fund',
  'Country': 'USA'},
  
  ...}

In [None]:
result

**STEP 1.4 : USING FINVIZFINACE TO SCRAPE COMPANIES DESCRIPTION, CHARTS, AND NEWS**

**GETTING THE STOCKS DESCRIPTION, STOCK CHARTS, AND STOCK NEWS**

**Next Step:**

After step above, next is we use the finvizfinance methods to download additional company data. these methods are:

**ticker_description** : This downloads the Company's description (what the company does) and we attach this value **"description"** to every ticker symbol in the result dictionary above.

**ticker_charts** : This downloads the Company's stock chart (trading history chart) and we attach this value **"charts"** to every ticker symbol in the result dictionary above.

**ticker_news** : This downloads the Company's news data from the finviz website (in compact format: around 100 news for each company) in a dataframe fromat and we attach this value **"news"** to every ticker symbol in the result dictionary above.

In [None]:
# Getting company Description
error_symbol = []
for key in result:
  try:
    stock = finvizfinance(key)
    # Stock description
    description = stock.ticker_description()
    result[key]["description"] = description
    # Stock charts
    result[key]["charts"] = stock.ticker_charts(urlonly=True)
    # Stock News
    news = stock.ticker_news()
    news['Date'] = news['Date'].dt.date
    # Get only today's date uncomment the code below
    # today = datetime.datetime.now().date()
    # news = news[news['Date'] == today]
    result[key]["news"] = news
  except:
    error_symbol.append(key)

While downloading this datas (ticker_description, ticker_charts, and ticker_news), we might encounter some ticker symbol error, so in other to investigate this, we attach every error symbol to the **error_symbol** list so we could investigate the error further

In [None]:
len(error_symbol)

After confirming the ticker symbols in the error list at the finviz site, we found out that some of the ticker symbols are no longer valid, and has no datas on the site, while other ticker symbols experienced an error due to a network glitch. 

So in other to fix this, we have to iterate through the **error_symbol** list and re scrap their info once again, so we get all the data needed

**STEP 1.4.1 : RE-USING FINVIZFINANCE TO SCRAPE COMPANIES DESCRIPTION, CHARTS, AND NEWS**

**DIDNT GET ALL THE COMPANIES INFORMATION EARLIER**

In [None]:
# Getting company Description
tick = []
# Iterating through the error_symbol list
for key in error_symbol:
  try:
    stock = finvizfinance(key)
    # Stock description
    # This checks if the ticker key has a value for decription in its datas, and if it doesnt, then it downloads and attachs it, otherwise it skips
    if 'description' not in result[key].keys():
        description = stock.ticker_description()
        result[key]["description"] = description
    # Stock charts
    # This checks if the ticker key has a value for charts in its datas, and if it doesnt, then it downloads and attachs it, otherwise it skips
    if 'charts' not in result[key].keys():
        result[key]["charts"] = stock.ticker_charts(urlonly=True)
    # Stock News
    # This checks if the ticker key has a value for news in its datas, and if it doesnt, then it downloads and attachs it, otherwise it skips
    if 'news' not in result[key].keys():
        news = stock.ticker_news()
        news['Date'] = news['Date'].dt.date
        # Get only today's date uncomment the code below
        # today = datetime.datetime.now().date()
        # news = news[news['Date'] == today]
        result[key]["news"] = news
  except:
    tick.append(key)

In [None]:
len(tick)

This action above downloads all the requested data for tickers that are valid in the error_symbol list

In [None]:
result

**STEP 1.5: GETTING THE NEWS SENTIMENT SCORES**


In other to perform Sentiment analysis for each news in each ticker symbol, we will use the Natural Language ToolKit **NLTK** submodule **VADER**

Our choice of the vader submodule is because we are dealing with an unlabelled text dataset, and this module calculates the positive/negative polarity and also the intensity of emotion (strength) which would all be helpful later on in this project.



In [None]:
# Getting company Description
# Iterating through all ticker symbols in the result dictionary
for key in result:
  # Check if the ticker symbol has a value for "news" in the result dictionary, and proceed if it does else skip
  if 'news' in result[key].keys():
    # Instantiating the vader SentimentIntensityAnalyzer
    vader = SentimentIntensityAnalyzer()

    # Create functions to calculate the compound, negative, positive, and neutral polarity value for each news headline
    function = lambda title: vader.polarity_scores(title)['compound']
    function1 = lambda title: vader.polarity_scores(title)['neg']
    function2 = lambda title: vader.polarity_scores(title)['pos']
    function3 = lambda title: vader.polarity_scores(title)['neu']
    
    # Stock sentiment calculation
    # Applying the functions above to calcaalate the respective scores and attach them to the "compound", "neg", "pos", and "neu" value in the "news" datafame 
    result[key]['news']['compound'] = result[key]['news']['Title'].apply(function)
    result[key]['news']['neg'] = result[key]['news']['Title'].apply(function1)
    result[key]['news']['pos'] = result[key]['news']['Title'].apply(function2)
    result[key]['news']['neu'] = result[key]['news']['Title'].apply(function3)

    # Getting the sentiment score in ranges of (-10 to +10) for every headline in the "news" dataframe
    # in other to achieve this we multiply the "compound" value for each headline by 10, and round it up to 0 decimal places
    result[key]['news']['sentiment'] = result[key]['news']['compound'] * 10
    result[key]['news']['sentiment'] = result[key]['news']['sentiment'].round()
    # Lastly we convert the "sentiment" data type from float to integer
    result[key]['news']['sentiment'] = result[key]['news']['sentiment'].astype(int)

In [None]:
result

Now we have the news for each company and their corresponding sentiment scores

.

**STEP 1.6: GETTING THE AVERAGE SENTIMENT SCORES BY DAYS AND ATTACHING IT TO THE "comp" VALUE IN EACH "ticker"**



This would enable us to compare the effect of sentiment scores by days with the trading returns data to be downloaded from yfinance. the reason we are grouping the sentiment scores by days is beacuse the data we would be downloading at yfinance comes in **days** format not in **time**.

Hence we group all the news scores by days, and take the weighted mean for the corresponding "compound", "neg", "pos", "neu" and "sentiment" value in the "news" datafame

This would enable us to properly detect if there is a relationship between the values of sentiment scores and stock pricing.

In [None]:
# Iterating through all ticker symbols in the result dictionary
for key in result:
  # Check if the ticker symbol has a value for "news" in the result dictionary, and proceed if it does else skip
  if 'news' in result[key].keys():
    # Attach the weighted mean values by days for the corresponding "compound", "neg", "pos", "neu" and "sentiment" value in the "news" datafame 
    # to the "comp" dataframe
    result[key]['comp'] = result[key]['news'].groupby('Date').mean()

**STEP 1.7: GETTING THE START AND END DATE FROM THE 'comp' KEY IN EACH TICKER TO ENABLE US DOWNLOAD THE STOCKS PRICING DATA WITHIN THOSE RANGES**



The next step is we extract the most recent date and the first date in the "comp" dataframe and attach the values to the "latest_date", and "start_date" respectively to enable us to download the stock prices for those dates ranges only

In [None]:
# Iterating through all ticker symbols in the result dictionary
for key in result:
   # Check if the ticker symbol has a value for "comp" in the result dictionary, and proceed if it does else skip
  if 'comp' in result[key].keys():
    # Extract the first date, convert it to date string format, and attach it to the "start_date" value of the "ticker" symbol in result dictionary
    result[key]['start_date'] =  result[key]['comp'].index[0].strftime('%Y-%m-%d')
    # Extract the most recent date, convert it to date string format, and attach it to the "latest_date" value of the "ticker" symbol in result dictionary
    result[key]['latest_date'] =  result[key]['comp'].index[-1].strftime('%Y-%m-%d')
    # Extract the most recent sentiment score, and attach it to the "sentiment" value of the ticker symbol in result dictionary 
    # which would be used later on while constructing the Knowledge Graph
    result[key]['sentiment'] =  result[key]['comp']['sentiment'][-1]


In [None]:
# Confirm our work
result['AA']

**STEP 1.8 : DOWNLOADING THE STOCKS PRICING DATA FROM YFINANCE**



In this step we extract the ("ticker", "start_date", and "latest_date") for each ticker symbol in result dictionary and use the info to download the ['Open', 'Close', 'Volume'] values for each stock by dates and attach the trade dataframe to the "trade" key for each "ticker" key in result

In [None]:
# Importing the yfinance library
import yfinance as yf
# Iterating through all ticker symbols in the result dictionary
for key in result:
  # Check if the ticker symbol has a value for "comp" in the result dictionary, and proceed if it does else skip
  if 'comp' in result[key].keys():
    # Check if the ticker symbol already has a value for "trade" in the result dictionary, and proceed if it does not else skip
    if 'trade' not in result[key].keys():
      try:
        # Downloading the "Open", "Close", and "Volume" value for each stock by days
        data = yf.download(key, start= result[key]['start_date'], end=result[key]['latest_date'], progress=False)[['Open', 'Close', 'Volume']]
        # Resetting the index to integers and not dates
        data = data.reset_index()
        # Attaching the trade dataframe to the "trade" key for each "ticker" symbol in the result dictionary
        result[key]['trade'] = data
      except:
        print(f"no data for {key} syymbol between the dates {result[key]['start_date']} - {result[key]['latest_date']}")

In [None]:
result['AA']

**STEP 1.9: CHANGING THE DATA TYPE OF THE 'news', 'comp', AND 'trade' IN result DICTIONARY FROM DATAFRAMES TO DICTIONARIES**

In [None]:
# Iterating through all ticker symbols in the result dictionary
for key in result:
  # Check if the ticker symbol has a value for "news" in the result dictionary, and proceed if it does else skip
  if 'news' in result[key].keys():
    try:
      # Converting the "news" of each ticcker in result dictionary from dataframe format to dictionary format
      result[key]['news'] = result[key]['news'].to_dict(orient='records')
    except:
      print(f"check if {key} news has been converted already to dict")

  # Check if the ticker symbol has a value for "comp" in the result dictionary, and proceed if it does else skip
  if 'comp' in result[key].keys():
    # Converting the "comp" of each ticcker in result dictionary from dataframe format to dictionary format
    result[key]['comp'] = result[key]['comp'].reset_index().to_dict(orient='records')

  # Check if the ticker symbol has a value for "trade" in the result dictionary, and proceed if it does else skip
  if 'trade' in result[key].keys():
    # Converting the "trade" of each ticcker in result dictionary from dataframe format to dictionary format
    result[key]['trade'] = result[key]['trade'].to_dict(orient='records')

**STEP 1.9.1: CHANGING THE DATE TYPE FORMAT TO STRING**



Now we change the date format of datas in the "news", "comp", and "trade" dictionaries for each ticker symbol in result dictionary to a string date format

In [None]:
# Iterating through all ticker symbols in the result dictionary
for key in result:
    # Check if the ticker symbol has a value for "news" in the result dictionary, and proceed if it does else skip
    if 'news' in result[key].keys():
        # Iterate through the length of the news data
        for i in range(len(result[key]['news'])):
            try: 
                # Convert Date to Date string format
                result[key]['news'][i]['Date'] = result[key]['news'][i]['Date'].strftime('%Y-%m-%d')
            except:
                f"Already converted"

    # Check if the ticker symbol has a value for "comp" in the result dictionary, and proceed if it does else skip
    if 'comp' in result[key].keys():
        # Iterate through the length of the comp data
        for i in range(len(result[key]['comp'])):
            try: 
                # Convert Date to Date string format
                result[key]['comp'][i]['Date'] = result[key]['comp'][i]['Date'].strftime('%Y-%m-%d')
            except:
                f"Already converted"

    # Check if the ticker symbol has a value for "trade" in the result dictionary, and proceed if it does else skip
    if 'trade' in result[key].keys():
        # Iterate through the length of the trade data
        for i in range(len(result[key]['trade'])):
            # Convert Date to Date string format
            result[key]['trade'][i]['Date'] = result[key]['trade'][i]['Date'].strftime('%Y-%m-%d')

**STEP 1.9.2: APPENDING THE COMPANY NAME, INDUSTRY, AND SECTOR OF EACH "trade" KEY**



Now we append the company name, industry name, sector name, and country name to each value in the "trade" dictionary for each ticker symbol, this would enable us to correctly identify the trade information a later task while concating the "trade" and "comp" dictionaries

In [None]:
# Iterating through all ticker symbols in the result dictionary
for key in result:
  # Check if the ticker symbol has a value for "trade" in the result dictionary, and proceed if it does else skip
  if 'trade' in result[key].keys():
    # Iterate through the length of the trade data
    for i in range(len(result[key]['trade'])):
      # Append the name of the Company to each value in the 'trade' list
      result[key]['trade'][i]['Company'] = result[key]['Company']
      # Append the name of the Industry to each value in the 'trade' list
      result[key]['trade'][i]['Industry'] = result[key]['Industry']
      # Append the name of the Sector to each value in the 'trade' list
      result[key]['trade'][i]['Sector'] = result[key]['Sector']
      # Append the name of the Country to each value in the 'trade' list
      result[key]['trade'][i]['Country'] = result[key]['Country']

**STEP 1.9.3: APPENDING COMPANY, INDUSTRY, AND SECTOR NAME TO 'comp' KEY**



We also append the company name, industry name, sector name, and country name to each value in the "comp" dictionary for each ticker symbol, this would enable us to correctly identify the trade information a later task while concating the "trade" and "comp" dictionaries

In [None]:
# Iterating through all ticker symbols in the result dictionary
for key in result:
  # Check if the ticker symbol has a value for "comp" in the result dictionary, and proceed if it does else skip
  if 'comp' in result[key].keys():
    # Iterate through the length of the comp data
    for i in range(len(result[key]['comp'])):
      # Append the name of the Company to each value in the 'comp' list
      result[key]['comp'][i]['Company'] = result[key]['Company']
      # Append the name of the Industry to each value in the 'comp' list
      result[key]['comp'][i]['Industry'] = result[key]['Industry']
      # Append the name of the Sector to each value in the 'comp' list
      result[key]['comp'][i]['Sector'] = result[key]['Sector']
      # Append the name of the Country to each value in the 'comp' list
      result[key]['comp'][i]['Country'] = result[key]['Country']

In [None]:
result

**STEP 1.10: SAVING THE DICTIONARY AS A PICKLE FILE**



Finally we can save our result dictionary data as a pickle file, so we could use it later on for other exploration tasks

In [None]:
import pickle
# This will create a binary file called finviz.pickle and store all result dictionary info in the file
with open("./finviz.pickle", "wb") as f:
    pickle.dump(result, f)