# Pulling market research from a website

**Last Updated: October 7, 2020**

From: https://www.cnbc.com/2020/06/16/meet-the-2020-cnbc-disruptor-50-companies.html

    These are the 2020 CNBC Disruptor 50 companies

    In the eighth annual Disruptor 50 list, CNBC identifies private companies whose breakthroughs are influencing business and market competition at an accelerated pace. They are poised to emerge from the coronavirus pandemic with tech platforms that have the power to dominate. The start-ups making the 2020 Disruptor list are at the epicenter of a world changing in previously unimaginable ways, turning ideas in cybersecurity, education, health IT, logistics/delivery, fintech and agriculture into a new wave of billion-dollar businesses. 

    A majority of them, in fact, already are billion-dollar businesses: 36 disruptors this year are unicorns that have already reached or passed the $1 billion valuation mark. Maybe more important this year: 37 have hired new employees since the pandemic began, and 19 have pivoted their products or launched new ones to meet the challenges of the pandemic.

    The 50 companies selected using the proprietary Disruptor 50 methodology have raised over $74 billion in venture capital, according to PitchBook, at an implied Disruptor 50 list market valuation of near-$277 billion. Technology is already a major part of our daily lives and the public markets, and that will only increase on the other side of Covid-19, from the future of food supply to health-care diagnostics and the way we shop, study, work and pay.

In [2]:
# Import relevant libraries
import time
import requests
import pandas as pd 
from bs4 import BeautifulSoup

## Start with summary website

In [19]:
url = 'https://www.cnbc.com/2020/06/16/meet-the-2020-cnbc-disruptor-50-companies.html'
page = requests.get(url)

### Grab HTTP response status code

From Wikipedia: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

    This is a list of Hypertext Transfer Protocol (HTTP) response status codes. Status codes are issued by a server in response to a client's request made to the server. It includes codes from IETF Request for Comments (RFCs), other specifications, and some additional codes used in some common applications of the HTTP. The first digit of the status code specifies one of five standard classes of responses. The message phrases shown are typical, but any human-readable alternative may be provided. Unless otherwise stated, the status code is part of the HTTP/1.1 standard (RFC 7231).

In [20]:
page

<Response [200]>

### Parse the main webpage html using Beautiful Soup

In [21]:
soup = BeautifulSoup(page.text, "html.parser")

### Save Beautiful Soup output to text file

In [208]:
with open('pages/cnbc_50_website.txt', 'w') as f:
    for line in soup.prettify():
        f.write(str(line))

### Look at page header

In [22]:
print(page.headers)

{'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '131279', 'X-Request-Id': 'cb16efdb-8f1e-4c27-98d6-d537d6f1d05f', 'Content-Encoding': 'gzip', 'Access-Control-Allow-Origin': '*', 'X-Aicache-OS': 'xxx.x1.5.164:81, x.xx.246.254:80', 'Expires': 'Fri, 19 Jun 2020 02:08:34 GMT', 'Cache-Control': 'max-age=0, no-cache', 'Pragma': 'no-cache', 'Date': 'Fri, 19 Jun 2020 02:08:34 GMT', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding, User-Agent', 'Set-Cookie': 'region=USA; expires=Thu, 17-Sep-2020 02:08:34 GMT; path=/; domain=.cnbc.com, akaas_CNBC_Audience_Segmentation=1595124514~rv=49~id=189bdf7381eed43a427000cb6f7315a5; path=/; Expires=Sun, 19 Jul 2020 02:08:34 GMT; Domain=.www.cnbc.com; Secure; SameSite=None', 'Content-security-policy': "frame-ancestors 'self' *.cnbc.com *.acorns.com;"}


### Find results within table and save to initial DataFrame

In [23]:
results = soup.find_all('tr')
print('Number of results', len(results))

Number of results 50


In [24]:
company=[]
cnbc_link=[]
cnbc_sum=[]
for i in range(len(results)):
    company.append(results[i].find('td', attrs={'class': 'BasicTable-textData'}).find('a').get_text())
    cnbc_link.append(results[i].find('td', attrs={'class': 'BasicTable-textData'}).find('a').attrs['href'])
    cnbc_sum.append(results[i].find_all('td', attrs={'class': 'BasicTable-textData'})[1].get_text())
df = pd.DataFrame({"company":company, "cnbc_link": cnbc_link, "cnbc_sum": cnbc_sum})
df

Unnamed: 0,company,cnbc_link,cnbc_sum
0,Stripe,https://www.cnbc.com/id/106539909,Unlocking the lockdown's biggest value
1,Coupang,https://www.cnbc.com/id/106539921,Beating Bezos at his own online game?
2,Indigo Agriculture,https://www.cnbc.com/id/106539922,The future of farming is carbon negative
3,Coursera,https://www.cnbc.com/id/106539923,Online ed's biggest test begins
4,Klarna,https://www.cnbc.com/id/106539925,No online sale left behind
5,Tempus,https://www.cnbc.com/id/106539930,Precision medicine for the Covid crisis
6,Zipline,https://www.cnbc.com/id/106539932,Medicine takes flight autonomously
7,SoFi,https://www.cnbc.com/id/106539934,The future of your financial future
8,Neteera,https://www.cnbc.com/id/106539935,Contactless health
9,Gojek,https://www.cnbc.com/id/106539937,"Indonesia's original ridehail, growing up"


## Webscrape the 50 additional websites and add contents to DataFrame

### Testing output from one of the websites, starting with Stripe

In [10]:
page = requests.get(df.cnbc_link[0])
soup = BeautifulSoup(page.text, 'html.parser')

### HTML snippet 

In [11]:
soup.find_all('div', {'class':'group'})[0].p

<p><strong>Founders:</strong> Patrick Collison (CEO), John Collison<br/><strong>Launched:</strong> 2010<br/><strong>Headquarters:</strong> San Francisco<strong><br/>Funding:</strong> $1.6 billion<br/><strong>Valuation:</strong> $36 billion<strong><br/>Industry:</strong> Global e-payments<br/><strong>Previous appearances on Disruptor 50 List: </strong>5<strong> </strong>(<a href="https://www.cnbc.com/2019/05/14/stripe-2019-disruptor-50.html">No. 13</a> in 2019)</p>

### Parsing is a success!

In [17]:
data_chunk = {}
for element in soup.find_all("div", {"class":"group"})[0].p.find_all('strong'):
    if element.next_sibling!=None:
        key = element.get_text().strip('[: \xa0]')
        val = element.next_sibling.strip('[ \xa0(]')
        if key != '\xa0' and key != None and key != '':
            data_chunk.update({key:val})
        continue
    continue
data_chunk

{'Founders': 'Patrick Collison (CEO), John Collison',
 'Launched': '2010',
 'Headquarters': 'San Francisco',
 'Funding': '$1.6 billion',
 'Valuation': '$36 billion',
 'Industry': 'Global e-payments',
 'Previous appearances on Disruptor 50 List': '5'}

### Now scrape the 50 additional pages and add to DataFrame

In [542]:
for i in range(df.shape[0]):
    page = requests.get(df.cnbc_link[i])
    soup = BeautifulSoup(page.text, 'html.parser')
    for element in soup.find_all("div", {"class":"group"})[0].p.find_all('strong'):
        if element.next_sibling!=None:
            key = element.get_text().strip('[: \xa0]')
            val = element.next_sibling.strip('[ \xa0(]')
            if key != '\xa0' and key != None and key != '':
                try:
                    df[key][i]=val
                except:
                    df[key]=0
                    df[key][i]=val
            continue
        continue
    time.sleep(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


### Save the 50 webpages html using Beautiful Soup

In [None]:
for i in range(df.shape[0]):
    page = requests.get(df.cnbc_link[i])
    soup = BeautifulSoup(page.text, 'html.parser')
    with open('pages/page_{}.txt'.format(i), 'w') as f:
        for line in soup.prettify():
            f.write(str(line))
    time.sleep(10)

## Save results to csv

In [4]:
df.to_csv('output/cnbc_50_webscraped_list.csv', index=False)
pd.read_csv('output/cnbc_50_webscraped_list.csv').head(5)

Unnamed: 0,company,cnbc_link,cnbc_sum,Founders,Launched,Headquarters,Funding,Valuation,Industry,Previous appearances on Disruptor 50 List,Founder,Key technologies,CEO
0,Stripe,https://www.cnbc.com/id/106539909,Unlocking the lockdown's biggest value,"Patrick Collison (CEO), John Collison",2010,San Francisco,$1.6 billion,$36 billion,Global e-payments,5,0,0,0
1,Coupang,https://www.cnbc.com/id/106539921,Beating Bezos at his own online game?,0,2010,"Seoul, South Korea",$3.4 billion (PitchBook),$9 billion (PitchBook),"E-commerce, retail",0,Bom Kim (CEO),"Artificial intelligence, cloud computing, mach...",0
2,Indigo Agriculture,https://www.cnbc.com/id/106539922,The future of farming is carbon negative,"David Berry, Geoffrey von Maltzahn and Noubar ...",2014,Boston,$850 million,$3.5 billion,"Agriculture, farming",2,0,"Artificial intelligence, machine learning",David Perry
3,Coursera,https://www.cnbc.com/id/106539923,Online ed's biggest test begins,"Andrew Ng, Daphne Koller",2012,"Mountain View, California",$315 million,$1.6 billion (PitchBook),"Higher education, online learning",4,0,"AI, cloud computing, deep learning, Internet o...",Jeff Maggioncalda
4,Klarna,https://www.cnbc.com/id/106539925,No online sale left behind,"Sebastian Siemiatkowski (CEO), Niklas Adalbert...",2005,Stockholm,$970 million,$5.5 billion,"E-commerce, financial services, fintech",2,0,"AI, cloud computing, edge computing, machine l...",0
