# Part 2: Data Analytics
***
## Importing Libraries

In [None]:
import json
import wikipedia
import pandas as pd
import requests
import numpy as np

## Step 1: Crawling Real-World Datasets
***
The dataset that is extracted is about S&P500 stocks. S&P500 is a common equity indicies which include 500 of the largest companies listed on stock exchanges in the United States. 

First, the table of S&P500 companies is scarped from Wikipedia's __[S&P500 Companies](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)__ homepage. The columns of interest from this table are: Symbol of the stock (e.g. AAPL for Apple Inc.), Security (i.e. the company name), Global Industry Classification Standard (GICS) sector, and Headquarters Location. 

Second, common key metrics used in analysing stocks are scraped from __[Yahoo Finance](https://finance.yahoo.com/)__. The key metrics of interest are: Market Capitilisation, Revenue, Profit Margin, Earning per Share, Profit to Earnings ratio and Profit to Earning Growth ratio. These metrics are scraped by taking the symbols from the table acquired from Wikipedia and using them to create a url to the respective stock's statistics page (e.g. __https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL__ for Apple Inc.)

The two tables are then merged and saved as `SnP500_raw.csv` in the directory of this jupyter notebook

<div class="alert alert-block alert-info">
<b>Note:</b> The data extracted from Yahoo Finance is accurate to date 02/12/2023.
</div>


In [None]:
# extracting list of SnP 500 companies from Wikipedia
url_link= 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
response = requests.get(url_link)
SnP500_raw = pd.read_html(response.text)[0]
SnP500_raw.head()

In [None]:
def yahoo_api_statistics(symbol:str):
     '''
    Returns a list of metrics commonly used to evaluate stocks.

            Parameters:
                    symbol (str): Stock symbols; e.g. AAPL for Apple

            Returns:
                    metrics_list (lst): List of key metrics in the given order:
                    1. Market Capitilisation (Market_cap)
                    2. Revenue (Revenue)
                    3. Profit Margin (Profit_margin)
                    4. Earning per Share (EPS)
                    5. Profit to Earnings ratio (PE_ratio)
                    6. Profit to Earnings Growth ratio (PEG)
    '''
        
    # replaces a '.' to '-' in any symbol to produce the correct url to scrape data
    if "." in symbol:
        symbol = symbol.replace(".","-")
        
    statistics_url = ("https://finance.yahoo.com/quote/{symbol}/key-statistics?p={symbol}").format(symbol = symbol)
    response = requests.get(statistics_url, headers= {'User-agent': 'Mozilla/5.0'})
    df = pd.read_html(response.text)
    
    # Extracts individual metrics from the tables
    Market_cap = df[0].iloc[0,1]
    Revenue = df[-3].iloc[0,1]
    Profit_margin = df[5].iloc[0,1]
    EPS = df[-3].iloc[6,1]
    PE_ratio =  df[0].iloc[2,1]
    PEG = df[0].iloc[4,1]
    
    metrics_list = [Market_cap, Revenue, Profit_margin, EPS, PE_ratio, PEG]
    
    return metrics_list


# creating an empty list to store the metrics
metrics = []

# loops through symbols to extract 
for i in range(len(SnP500_sliced)):
    
    # try/except to catch out any pages with errors
    try:
        symbol = SnP500_sliced.iloc[i,0]
        input_list = yahoo_api_statistics(symbol)
        metrics.append(input_list)
    except:
        print(symbol)


In [None]:
# columns with units, where B = Billions and TTM = Trailing Twelve Months
columns_to_add = ['Market Cap / B', 'Revenue (TTM) / B', 'Profit Margin / %', 'Earnings per Share (TTM) / $', 'Price to Earning ratio (TTM)', 'Price to Earnings Growth ratio (5yr expected)']

# converting lists of lists into a dataframe
metrics = pd.DataFrame(data = metrics, columns = columns_to_add)

# slicing the raw dataframe to acquire the required columns only 
SnP500_sliced = SnP500_raw[0][['Symbol','Security','GICS Sector', 'Headquarters Location']]

# merging the two dataframes
SnP500_metrics_raw = pd.concat([SnP500_sliced,metrics], axis = 1)

# saving the merged dataframe as csv file in the directory of this Jupyter Notebook
SnP500_metrics_raw.to_csv("SnP500_raw_data.csv", index= False)

## Step 2: Data Preparation & Cleaning
***
The following is performed to prepare the data:
- Units of the key metrics are standardised for all stocks and removed from each cell (e.g. for Market Capitalisation any value in T (Trillions) is converted to B, and 302.1B is changed to 302.1 since units are stated on the column headings)
- Null values are indentified using an open source data observability tool great_expectations and either rows are removed or values are filled via manual calculations
- List of companies are sorted in the order of largest Market Capitalisation to the smallest and a ranking index is produced
- Ensuring metrics are to 4 significant figures
- Column headings are simplified where possible (e.g. Security is changed to Company)
- Headquarters location of Security (or Companies) are standardised (e.g. for US based companies only the States are mentioned, this can be standardised by replacing it with country only)

## Step 3: Exploratory Analysis
***

## Step 4: Investigating Data-Set with questions
***

## Step 5: Conclusion
***