# Lab Question 5

So far, when scraping multiple pages we've saved each page's extracted data to a separate file.
However, to do analysis, we usually want all the data in a single file or database for easy loading and processing.

Update your solution to question 4 so that each ticker is a row in a Pandas DataFrame, rather than a separate file.
The columns should be named `"ticker"` and `"market_cap"`.

For example, the first row of the DataFrame might be

| ticker | market_cap |
| ----- | ----- | 
| AAPL | 2.118T |

### Solution
There are several approaches to this, but the simplest is to create a dictionary for each time through the loop.
The dictionary can contain `ticker` and `market_cap` keys.
By keeping all the dictionaries in a list, we can pass them all into `pd.DataFrame()` to create a DataFrame from them.

In [1]:
import requests
from bs4 import BeautifulSoup
# Remember we need to import Pandas to use DataFrames!
import pandas as pd

In [2]:
tickers = ['AAPL', 'MSFT', 'GOOG', 'AMZN']
rows = []

for ticker in tickers:
    url = 'https://finance.yahoo.com/quote/' + ticker
    filename = ticker + '.txt'
    
    response = requests.get(url)
    bs = BeautifulSoup(response.content, 'html.parser')
    quote_summary = bs.find(name='div', id='quote-summary')
    market_cap_td = quote_summary.find(name='td', attrs={'data-test':'MARKET_CAP-value'})
    market_cap = market_cap_td.span.string
    # Create a dictionary representing the information about this ticker.
    ticker_dict = {'ticker': ticker, 'market_cap': market_cap}
    # Add this dictionary to our list of rows.
    rows.append(ticker_dict)

# Create a DataFrame from our rows list
ticker_df = pd.DataFrame(rows)

In [3]:
ticker_df

Unnamed: 0,ticker,market_cap
0,AAPL,2.118T
1,MSFT,1.896T
2,GOOG,1.599T
3,AMZN,1.644T


## Bonus Challenge

The current format of our market caps isn't ideal for numeric computation. Not only is it a string, but it's not trivially convertible to a number because it's abbreviated (e.g. 2.118T needs to become 2,118,000,000,000). Write code to update the market_cap column to be numeric, and then compute the average market cap of these companies.

### Solution

Since all our companies are listed in trillions, we can just multiply their caps by 1,000,000,000 (after we remove the `"T"` and convert them to numbers).

This would be quite a bit trickier if some companies' caps were in millions or billions; we'd have to take different actions based on whether the string ended with `"T"`, `"B"`, or `"M"`.

In [4]:
# Get rid of the trailing "T"
market_caps = ticker_df['market_cap'].str.replace('T', '')
# Make the column numeric (remember to use floats here, not ints, so we don't lose decimal points!)
market_caps = market_caps.astype('float')
# Multiply every value by 1 trillion -- A convenient feature of Python numbers is that they ignore
# underscores, which lets us write big numbers like this in a more readable way.
market_caps = market_caps * 1_000_000_000
market_caps

0    2.118000e+09
1    1.896000e+09
2    1.599000e+09
3    1.644000e+09
Name: market_cap, dtype: float64

In [5]:
# Now overwrite the original column
ticker_df['market_cap'] = market_caps
ticker_df

Unnamed: 0,ticker,market_cap
0,AAPL,2118000000.0
1,MSFT,1896000000.0
2,GOOG,1599000000.0
3,AMZN,1644000000.0


In [6]:
# Get the average market cap
ticker_df['market_cap'].mean()

1814250000.0