# Web Scraper Tutorial

In [1]:
from bs4 import BeautifulSoup

BeautifulSoup is a library for parsing HTML and XML. It uses its parser to represent the document as a nested data structure and facilitate us to extract information from the HTML page. BeautifulSoup tolerates highly flawed HTML and still lets you easily extract the data you need.

In [2]:
import urllib.request, urllib.parse, urllib.error

Since HTTP is so common, we have a  library that does all the work for us and  makes web pages look like a file.

In [3]:
import ssl
import csv

This module provides access to Transport Layer Security (often known as “Secure Sockets Layer”) encryption and peer authentication facilities for network sockets, both client-side and server-side.
The csv module implements classes to read and write tabular data in CSV format.

__INSTRUCTIONS:__

__1. The program may run slowly, so please clear the kernel before running the program.__

__2. The 'stocks.csv' file takes time to load, so please wait for few seconds after the program executes and gives the output. If still there is no data in the file, try closing, refresh and then open the file again.__

The below function gets the details of the stock based on the ticker symbol entered by the user. Here I have extracted 4 values of the each stock i.e. OPEN price, PREV CLOSE price, VOLUME, MARKET CAP

In [None]:
#function for getting the stock data from Yahoo Finance or CNN Money’s Market Movers website.

def stock_data(s_inp):   
    ctx= ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    s_url = "https://finance.yahoo.com/quote/{0}?p={0}&.tsrc=fin-srch-v1".format(usr_inp) #URL of website containing the deatils of the users input stock
    html = urllib.request.urlopen(s_url, context=ctx).read() #loading the html
    soup = BeautifulSoup(html, 'lxml') #using BeautifulSoup on the current html with lxml parser
    tag = soup.findAll('div', attrs={'id':"quote-summary"}) #finding the particular division with stock details
    table = (tag[0].findAll("table")) #table containing stock details
    data = [] #list for storing all the values
    for rows in table: 
        for value in rows.findAll("td"): 
            if value.text in ["Previous Close", "Open", "Volume", "Market Cap"]: 
                data.append(value.find_next('td').text) 
            else: 
                continue
    return data #return the data list containing the required stock details

The below script collects the list of most actives, gainers and losers from the stock website entered by the user. It takes the ticker symbols and names of these companies (and categories) and build a csv file (called stocks.csv) with data about each stock.

In [1]:
ctx= ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
print("This is a program to scrape data from the https://money.cnn.com/data/hotstocks/ for a class project.")
print("Which stock are you interested in:\n")
url = "https://money.cnn.com/data/hotstocks/"  #URL containing all the stocks 
html = urllib.request.urlopen(url, context=ctx).read() #loading html from the above URL
soup = BeautifulSoup(html, "lxml") #using BeautifulSoup on the current html with lxml parser
tag=soup.findAll('table', attrs={'class':"wsod_dataTable wsod_dataTableBigAlt"}) #searching for the particular table which contains the data
Header = ["Most Actives", "Gainers","Losers"] #Stock categories list
stocks = [] #list for storing all the stocks
csvl=[]
for head, rows in zip(Header, tag):
        print(head+":") #Printing the category name
        for line in rows.findAll('td'): #Looping thrrough all the stocks in the table
            if line.a!=None: #Condition to check that the stock is not empty, if true
                ticker = line.a.text #Get the Ticker Symnol of the stock
                fullname = line.span.text #get the full name of the stock given in span text
                print(ticker,fullname) #print the ticker and fullname of stock
                stocks.extend([head,ticker,fullname]) #store each stock in the stock list
                csvl.append([head,ticker,fullname])  #this list will be used for storing details in the csv file
        print('\n')   

while True: #While loop which will only end once the user inputs the correct name of stock in the list
    usr_inp=input("User Inputs:")    #Input the stock name from the user
    if usr_inp not in stocks:       #if the input is not in the stock list, then
        print("Wrong Input, the stock is not in the list!") 
        continue
    else: 
        print("The data for",usr_inp,stocks[stocks.index(usr_inp)+1],"is the following:")
        result=stock_data(usr_inp)     #passing the user input to the stock_data function
        print("Open:",result[1])       #opening value of the stock 
        print("PREV CLOSE:",result[0]) #previous close value of the stock
        print("VOLUME:",result[2])     #volume of the stock
        print("MARKET CAP:",result[3])   #printing the market cap of the stock
        with open('stocks.csv', 'w') as file:   #open the 'stocks.csv file'
            writer = csv.writer(file, delimiter=',') #function for writing ino the file
            writer.writerow(["Category", "Symbol", "Name", "Previous Close", "Open", "Volume", "Market Cap"]) #adding a row of all the column headings 
            for s_inp in csvl:  #for each stock in the list, add its details to the csv file
                writer.writerow(s_inp+stock_data(s_inp[1]))
        break

This is a program to scrape data from the https://money.cnn.com/data/hotstocks/ for a class project.
Which stock are you interested in:

Most Actives:
F Ford Motor Co
GE General Electric Co
NCLH Norwegian Cruise Line Holdings Ltd
WFC Wells Fargo & Co
BAC Bank of America Corp
DAL Delta Air Lines Inc
CCL Carnival Corp
MRO Marathon Oil Corp
OXY Occidental Petroleum Corp
DIS Walt Disney Co


Gainers:
CARR Carrier Global Corp
VTR Ventas Inc
LW Lamb Weston Holdings Inc
LEG Leggett & Platt Inc
HFC HollyFrontier Corp
LB L Brands Inc
GPS Gap Inc
TDG TransDigm Group Inc
HP Helmerich and Payne Inc
KIM Kimco Realty Corp


Losers:
MSI Motorola Solutions Inc
FLT Fleetcor Technologies Inc
DLR Digital Realty Trust Inc
FLS Flowserve Corp
BLL Ball Corp
MTD Mettler-Toledo International Inc
TMO Thermo Fisher Scientific Inc
CTVA Corteva Inc
MLM Martin Marietta Materials Inc
DHR Danaher Corp


User Inputs:WFC
The data for WFC Wells Fargo & Co is the following:
Open: 25.64
PREV CLOSE: 25.23
VOLUME: 51,470,08

__EXECUTION:__

User Inputs:GPS

The data for GPS Gap Inc is the following:

Open: 7.67

PREV CLOSE: 7.42

VOLUME: 10,877,042

MARKET CAP: 3.018B