<h1>Scraping Fidelity.com</h1>
In this assignment, you will scrape data from fidelity.com. The goal of the exercise is to get the latest sector performance data from the US markets, and to get the total market capitalization for each sector. 

The end result is to write a function: <i>get_us_sector_performance()</i> that will return a list of tuples. Each tuple should correspond to a sector and should contain the following data:
<li>the sector name
<li>the amount the sector has moved
<li>the market capitalization of the sector
<li>the market weight of the sector
<li>a link to the fidelity page for that sector

<p>
The data should be sorted by decreasing order of market weight. I.e., the sector with the highest weight should be in the first tuple, etc.

<h3>Process</h3>
<li>Get a list of sectors and the links to the sector detail pages from the url (see function)
<li>Loop through the list and call the function <i>get_sector_change_and_market_cap(sector_page_link)</i> for each sector
<li>Accumulate the name, the change, the capitalization, the weight and the link for each sector in output_list (see function)
<li>Sort the list by market weight

<b>Notes:</b>
<li>Note that the market weight is a string with a % sign at the back. You will need to get rid of the % and convert the string into a float before you can sort it
<li>Your starting data is the url listed below. You need to extract all data, including links to the sector pages, from the page at this url
<li>To sort a list of tuples by an arbitrary element, use the example at the bottom of this notebook

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

In [2]:
def get_us_sector_performance():
    output_list = list()
    url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
    header=[]
    sector_links=[]
    response = requests.get(url)
    if not response.status_code == 200:
        return None
    try:
        results_page = BeautifulSoup(response.content,'lxml')
        table=results_page.find('tbody',id="tbody_id")
        names=table.find_all('div',class_="heading")
        for name in names:
            list_of_tuples=[]
            sector_name=name.find('a',class_="heading1").get_text()      #get sector names 
            div=results_page.find_all('div', {"class":"heading"})
            link=name.find('a').get('href')   #get link to new page with only particular sector data
            new_link="https://eresearch.fidelity.com"+link
            sector_change,sector_market_cap,sector_market_weight=get_sector_change_and_market_cap(new_link)
            #create a list with all the data needed 
            list_of_tuples.append(sector_name)       
            list_of_tuples.append(sector_change)
            list_of_tuples.append(sector_market_cap)
            list_of_tuples.append(sector_market_weight)
            list_of_tuples.append(new_link)
            #add data from each sector to a tuple
            tuples=tuple(list_of_tuples)  
            #add tuples of data from each sector to a list
            output_list.append(tuples) 
        output_list.sort(key=lambda a: a[3], reverse=True)   #sort by market weight
        return output_list
    except:
        return None            
       
    

In [3]:
get_us_sector_performance()

In [3]:
def get_sector_change_and_market_cap(sector_page_link):
    sector_data=[]
    new_response=requests.get(sector_page_link)
    new_page=BeautifulSoup(new_response.content,'lxml')
    table=new_page.find_all('tbody')
    tr=table[1].find_all('tr')
    data=tr[0].find_all('td')
    for i in data:
        sector_data.append(i.find('span').get_text())
    
    sector_change=float(sector_data[0].replace("%",""))
    sector_market_cap=sector_data[1]
    sector_market_weight=float(sector_data[2].replace("%",""))
    
    
    return sector_change,sector_market_cap,sector_market_weight

In [4]:
#Test get_sector_change_and_market_cap()
link = "https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25"
get_sector_change_and_market_cap(link)
#Should return (-0.40, '$3.58T', 6.80)
#Note that neither the -0.40, nor the 6.80, end with a % sign

(-0.04, '$6.49T', 11.66)

In [5]:
#Test get_us_sector_performance()
get_us_sector_performance()

[('Information Technology',
  -0.2,
  '$11.29T',
  26.75,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=45'),
 ('Health Care',
  0.05,
  '$7.06T',
  14.87,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=35'),
 ('Consumer Discretionary',
  -0.04,
  '$6.49T',
  11.66,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25'),
 ('Financials',
  0.41,
  '$6.90T',
  10.95,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=40'),
 ('Communication Services',
  0.11,
  '$4.03T',
  8.08,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=50'),
 ('Industrials',
  -0.03,
  '$4.46T',
  7.83,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=20'),
 

In [382]:
#Running the above cell should return (example: obviously the results will vary over time!)
"""
[('Information Technology',
  2.16,
  '$14.64T',
  28.02,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=45'),
 ('Health Care',
  0.19,
  '$7.60T',
  13.22,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=35'),
 ('Consumer Discretionary',
  1.96,
  '$8.32T',
  11.94,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25'),
 ('Financials',
  0.63,
  '$8.83T',
  11.32,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=40'),
 ('Communication Services',
  0.78,
  '$5.74T',
  10.01,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=50'),
 ('Industrials',
  0.43,
  '$5.58T',
  8.06,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=20'),
 ('Consumer Staples',
  -0.16,
  '$4.26T',
  6.27,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=30'),
 ('Energy',
  0.12,
  '$3.19T',
  3.27,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=10'),
 ('Real Estate',
  0.69,
  '$1.71T',
  2.72,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=60'),
 ('Utilities',
  0.55,
  '$1.63T',
  2.6,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=55'),
 ('Materials',
  0.25,
  '$2.53T',
  2.58,
  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=15')]
  
"""

"\n[('Information Technology',\n  2.16,\n  '$14.64T',\n  28.02,\n  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=45'),\n ('Health Care',\n  0.19,\n  '$7.60T',\n  13.22,\n  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=35'),\n ('Consumer Discretionary',\n  1.96,\n  '$8.32T',\n  11.94,\n  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25'),\n ('Financials',\n  0.63,\n  '$8.83T',\n  11.32,\n  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=40'),\n ('Communication Services',\n  0.78,\n  '$5.74T',\n  10.01,\n  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=50'),\n ('Industrials',\n  0.43,\n  '$5.58T',\n  8.06,\n  'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market

In [350]:
#Print the different sector names from the above output in alpabetical order

In [6]:
def sorted_names():
    sector_names=[]
    output=[]
    output=get_us_sector_performance()    #get output returned by previously defined function
    for i in output:
        sector_names.append(i[0])         #get only sector_names
    return sorted(sector_names)           #sort alphabetically using a built-in function and return sorted list

In [7]:
sorted_names()

['Communication Services',
 'Consumer Discretionary',
 'Consumer Staples',
 'Energy',
 'Financials',
 'Health Care',
 'Industrials',
 'Information Technology',
 'Materials',
 'Real Estate',
 'Utilities']

In [8]:
### Find the sector with lowest Market cap by creating a dict of all sectors and their market caps
##    example of the output. {'Materials': '$2.01T', ...}
def dict_sectors():
    output=[]
    dict_of_values=dict()
    output=get_us_sector_performance()      #get output returned by previously defined function
    for i in output:        
        market_cap=float(i[2].replace('$','').replace('T','')) #convert to float in order to sort
        dict_of_values.update({i[0]:market_cap})
    dict_of_values=dict(sorted(dict_of_values.items(), key=lambda x: x[1]))   #sort by market cap 
    sorted_dict=dict([key, ('$'+str(value)+'T')]
       for key, value in dict_of_values.items())       #convert float back to string
    lowest_sector=(list(sorted_dict.items())[0])       #return only the first item from sorted dictionary to get sector with lowest market cap
#     return sorted_dict
#    lowest_sector_dict = {lowest_sector[i]: lowest_sector[i + 1] for i in range(0, len(lowest_sector), 2)}
    return lowest_sector[0]

In [9]:
dict_sectors()

'Real Estate'

<h3>Hint: How to sort tuples by an arbitrary element?</h3>

In [355]:
x = [('a',23.2,'b'),('c',17.4,'f'),('d',29.2,'z'),('e',1.74,'bb')]
#Sort by the first element of the tuple


In [356]:
x = [('a',23.2,'b'),('c',17.4,'f'),('d',29.2,'z'),('e',1.74,'bb')]
#Sort by the element at position 1
