# DS3000 Day 2

Sept. 11/12, 2025

Admin
- Homework 1 is due Tomorrow/Tonight, Sept. 12 by 11:59 PM
- Homework 2 will be posted on Sept 12th, due Oct 3rd by 11:59 PM
      - Note: you have three weeks to do this, but **do not** put it off! The sooner you complete everything the better.
- Lab 1 scheduled for **next Tuesday (Sept. 16th)**, please bring a **charged up** computer with the ability to edit jupyter notebooks.


Push-Up Tracker
- Section 03:0
- Section 04:1
- Section 05:0

Content:
- Data Pipeline
- Intro to Web Scraping

# Data Pipeline: What is it?

A data pipeline is a collection of functions* which split all the functionality of our data collection and processing

(*can be other structures too, but it may be easier to first understand each as a function)


# Why build a data pipeline?

- Allows pipeline to be run in parts (rather than the whole thing)
- Allows pipeline to be built by different programmers working on different parts in parallel
- Allows us to test each piece of our code seperately
- Allows for modification / re-use of different sections

What we call a "Data Pipeline" here is a specific instance of "Factoring" a piece of software, splitting up its functionality into pieces.
    


# OpenWeather API Pipeline Activity

OpenWeather API offers a few different queries (see [here](https://openweathermap.org/api) for details):
- One Call API (which we have access to)
- Solar Radiation API
- etc.


**Goal:**

Build a library of functions which can be pieced together to support the collection, cleaning and display of features from OpenWeather into a scatter plot of two features.

### Lets design one together: 

(think: input/outputs -> handwritten notes)

# Plan out a pipeline

Write a few 'empty' functions including little more than the docstring:

```python
def some_fnc(a_string, a_list):
    """ processes a string and a list (somehow)
    
    Args:
        a_string (str): an input string which ...
        a_list (list): a list which describes ...
        
    Returns:
        output (dict): the output dict which is ...
    """
    pass
```

and a script which uses them:

```python
# inputs (not necessarily complete)
lat = 42
lon = 71

some_output = some_fnc(lat, lon)
some_other_output = some_other_fnc(some_output)

```

which would, if the functions worked, produce a graph like this (note: this is from an earlier semester; our graph will look different):

<img src="https://i.ibb.co/Ct0JtRJ/newplot-1.png" width=500\img>

**NOTE:** we haven't talked about creating plots yet, but we will next week! For now, I will provide everything you need in the examples.

# What might these empty functions look like?

In [1]:
def openweather_onecall(latlon_tuple, api_key, units='imperial'):
    """ returns weather data from one location via onecall
    
    https://openweathermap.org/api/one-call-api 
    
    Args:
        latlon_tuple (tuple): first element is lattitude,
            second is longitude            
        api_key (str): API key required to access data
        units (str): 'imperial', 'standard', 'metric'
        
    Returns:
        weather_dict (dict): a nested dictionary (tree) which
            contains weather data
    """
    pass
    
def get_clean_df_daily(daily_dict_list):
    """ formats daily_dict to a pandas dataframe
    
    see https://openweathermap.org/api/one-call-api for
    full daily_dict specification
    
    Args:
        daily_dict_list (list): list of dictionaries of daily
            weather features
            
    Returns:
        df_daily (pd.DataFrame): each row is weather from one
            day
    """
    pass
    
def scatter_plotly(df, feat_x, feat_y, f_html='scatter.html'):
    """ creates a plotly scatter plot, exports as html 
    
    Args:
        df (pd.DataFrame): pandas dataframe
        x_feat (str): x axis of scatter
        y_feat (str): y axis of scatter
        f_html (str): output html file
        
    Returns:
        f_html (str): output html file
    """ 
    pass

When the pipeline above is complete, the following script should plot a daily max temp scatter for Boston:

# Let's go **SLOWLY** through this solution

In [2]:
# !pip3 install plotly --break-system-packages
# packages for today
# depresses future warnings (careful here; only use on your own if you know what you're doing)
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# for calling API and cleaning data
import requests
import json
from datetime import datetime
# pandas (for data frames) and plotly (for plotting)
import pandas as pd
import plotly
import plotly.express as px

In [3]:
def openweather_onecall(latlon_tuple, api_key, units='imperial'):
    """ returns weather data from one location via onecall
    
    https://openweathermap.org/api/one-call-api 
    
    Args:
        latlon_tuple (tuple): first element is lattitude,
            second is longitude
        api_key (str): API key required to access data
        units (str): 'imperial', 'standard', 'metric'
        
    Returns:
        weather_dict (dict): a nested dictionary (tree) which
            contains weather data
    """
    # build url
    lat, lon = latlon_tuple
    url = f'https://api.openweathermap.org/data/3.0/onecall?lat={lat}&lon={lon}&appid={api_key}&units={units}'
    
    # get url as a string
    url_text = requests.get(url).text
    
    # convert json to a nested dict
    weather_dict = json.loads(url_text)

    # another, perhaps cleaner option
    # weather_dict = requests.get(url).json()
    
    return weather_dict

def get_clean_df_daily(daily_dict_list):
    """ formats daily_dict to a pandas series
    
    see https://openweathermap.org/api/one-call-api for
    full daily_dict specification
    
    Args:
        daily_dict_list (list): list of dictionaries of daily
            weather features
            
    Returns:
        df_daily (pd.DataFrame): each row is weather from one
            day
    """
    # format to dataframe
    df_weather = pd.DataFrame()
    for daily_dict in daily_dict_list:
        daily_series = pd.Series(dtype='object')

        # build datetime data (.fromtimestamp() assumes local time zone)
        daily_series['date'] = datetime.fromtimestamp(daily_dict['dt'])
        daily_series['sunrise'] = datetime.fromtimestamp(daily_dict['sunrise'])
        daily_series['sunset'] = datetime.fromtimestamp(daily_dict['sunset'])

        


        # build temp data
        temp_dict = daily_dict['temp']
        for temp_feat, temp in temp_dict.items():
            daily_series[f'temp_{temp_feat}'] = temp

        # build prob of precipitation
        # NOTE: I did confirm that the rain column appears only if there is rain forecasted in the next 48 hours
        daily_series['pop'] = daily_dict['pop']

                
        # collect row in df_weather
        df_weather = pd.concat([df_weather, daily_series.to_frame().T])

    return df_weather     

def scatter_plotly(df, feat_x, feat_y, f_html='scatter.html'):
    """ creates a plotly scatter plot, exports as html 
    
    Args:
        df (pd.DataFrame): pandas dataframe
        x_feat (str): x axis of scatter
        y_feat (str): y axis of scatter
        f_html (str): output html file
        
    Returns:
        f_html (str): output html file
    """
    # creat scatter plot
    fig = px.scatter(df, x=feat_x, y=feat_y)

    
    # export scatter to html
    plotly.offline.plot(fig, filename=f_html)
    
    return f_html

In [4]:
# inputs
feat_x = 'date'
feat_y = 'temp_max'
latlon_tuple = 42.3601, -71.0589
units = 'imperial'
api_key = 'cf758020c3c57082bbfd8b62d88ca683'

# get data
weather_dict = openweather_onecall(latlon_tuple, 
                                   units=units,
                                   api_key=api_key)
weather_dict

{'lat': 42.3601,
 'lon': -71.0589,
 'timezone': 'America/New_York',
 'timezone_offset': -14400,
 'current': {'dt': 1757277131,
  'sunrise': 1757240166,
  'sunset': 1757286519,
  'temp': 61.3,
  'feels_like': 61.32,
  'pressure': 1016,
  'humidity': 89,
  'dew_point': 58.05,
  'uvi': 0.75,
  'clouds': 100,
  'visibility': 10000,
  'wind_speed': 8.05,
  'wind_deg': 60,
  'weather': [{'id': 501,
    'main': 'Rain',
    'description': 'moderate rain',
    'icon': '10d'}],
  'rain': {'1h': 1.26}},
 'minutely': [{'dt': 1757277180, 'precipitation': 1.002},
  {'dt': 1757277240, 'precipitation': 0.744},
  {'dt': 1757277300, 'precipitation': 0.486},
  {'dt': 1757277360, 'precipitation': 0.4434},
  {'dt': 1757277420, 'precipitation': 0.4008},
  {'dt': 1757277480, 'precipitation': 0.3582},
  {'dt': 1757277540, 'precipitation': 0.3156},
  {'dt': 1757277600, 'precipitation': 0.273},
  {'dt': 1757277660, 'precipitation': 0.449},
  {'dt': 1757277720, 'precipitation': 0.625},
  {'dt': 1757277780, 'prec

In [5]:
# clean weather dict (make dataframe from dict, process timestamps etc)
# print(weather_dict['daily'])

df_daily = get_clean_df_daily(weather_dict['daily'])
# df_daily

In [6]:
# make scatter
f_html = scatter_plotly(df_daily, feat_x=feat_x, feat_y=feat_y)

## Web Scraping

![i](https://crawlbase.com/blog/best-data-memes/web-scraping-memes.jpg)
* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites
* Why scrape?
    * if an API is available for a service, we will nearly always prefer the API to scraping
    * ... but not all services have APIs or the available APIs are too expensive for our project (for example Twitter)
  ![h1](https://pbs.twimg.com/media/Fn84g5DXwAAvLOB.jpg)
  ![h2](https://i.kym-cdn.com/photos/images/original/002/525/430/ee4)
    * newly published information might not yet be available through ready datasets
* Downsides of scraping:
    * no reference documentation (unlike APIs)
    * no guarantee that a webpage we scrape will look and work the same way the next day (might need to rewrite the whole scraper)
    * if it violates the terms of service it might be seen as a felony (https://www.aclu.org/cases/sandvig-v-barr-challenge-cfaa-prohibition-uncovering-racial-discrimination-online)
    * legal and moral greyzone (even if the ToS does not disallow it, somebody has to pay for the traffic and when you're scraping you're not looking at ads)
    * ... but everbody does it anyway (https://www.hollywoodreporter.com/thr-esq/genius-says-it-caught-google-lyricfind-redhanded-stealing-lyrics-400m-suit-1259383)
* Web scraping pipeline:
    * because the webpages might change their structure it's extra important to keep the crawling/extraction step separate from transformations and loading
    * ETL (Extraction-Transform-Load):
        * **Crawl**: open a given URL using requests and get the HTML source;
        * **Extract**: extract interesting content from the webpage's source.
        * **Transform**: our usual unit conversions, etc
        * **Load**: representing the data in an easy way for storage and analysis
    * **Pro tip**: it's usually a good idea to not only store the transformed data, but also the raw HTML source - because the webpages might change and we might be late to realize we're not extracting right. If we have the original HTML source we can go back to it
    

## Best case scenario
Some webpages publish their data in the form of simple tables. In these (rare) cases we can just use pandas .read_html to scrape this data:

https://www.espn.com/soccer/team/squad/_/id/363/eng.chelsea

In [8]:
# !pip3 install lxml --break-system-packages
import pandas as pd
# read html extracts all the <table> elements from html and returns a list of DataFrames created from them
tables = pd.read_html('https://www.espn.com/soccer/team/squad/_/id/363/eng.chelsea')
# tables.to_csv('hhshs.csv',index=False)
len(tables)

2

In [11]:
tables[0]
# tables[1]
# tables[2]
# tables[3]

Unnamed: 0,Name,POS,Age,HT,WT,NAT,APP,SUB,SV,GA,A,FC,FA,YC,RC,Unnamed: 15
0,Robert Sánchez1,G,27,"6' 6""",196 lbs,Spain,3,0,8,1,0,0,0,0,0,
1,Filip Jørgensen12,G,23,"6' 4""",185 lbs,Denmark,0,0,0,0,0,0,0,0,0,
2,Gaga Slonina44,G,21,"6' 4""",192 lbs,USA,0,0,0,0,0,0,0,0,0,


# Beautiful Soup

Even if the .html does look relatively clean, it's still just a big string. How can we deal with it? Luckily there is a module made for just this purpose, and it's even a magic command which we can install directly in jupyter notebook:

In [17]:
#pip install bs4

In [12]:
from bs4 import BeautifulSoup
import requests
url = 'https://sapiezynski.com/ds3000/scraping/01.html' 
str_html = requests.get(url).text
soup = BeautifulSoup(str_html)

In [13]:
soup

<html>
<head>
<!-- comments in HTML are marked like this -->
<!-- the head tag contains the meta information not displayed but helps browsers render the page -->
</head>
<body>
<!-- This is the body of the document that contains all the visible elements.-->
<h1>Heading 1</h1>
<h2>This is what heading 2 looks like</h2>
<p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>
<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>
<p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
</body>
</html>

In [14]:
## getting elements by their tag name:
soup.find_all('p')

soup.find_all('p')[0]

# find_all returns a list, where each element is an instance of the specified tag

<p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

# `.find_all()` on subtrees of soup object


The `.find_all()` method works not only on the whole `soup` object, but also on subtrees of the soup object.  

Consider the site at https://sapiezynski.com/ds3000/scraping/02.html:

```html
<html>
    <body>
        <p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>
        
        <p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a></p>.
    </body>
</html>
```

**Goal**: Grab links from the first paragraph only:

In [15]:
# getting the content of the page
url = 'https://sapiezynski.com/ds3000/scraping/02.html'
response = requests.get(url)
soup = BeautifulSoup(response.text)

# finding all paragraphs:
for var in soup.find_all('p'):
    print(var.a)


<a href="https://duckduckgo.com">DuckDuckGo</a>
<a href="https://firefox.com">Firefox</a>


## Finding tags by `class_`

**Tip**: This is often one of the most useful ways to localize a particular part of a web page.

In [16]:
# get soup
url = 'https://www.allrecipes.com/search?q=cheese+fondue'
response = requests.get(url)
soup = BeautifulSoup(response.text)

In [17]:
soup

<!DOCTYPE html>
<html class="comp searchTemplate static-html html mntl-html no-js" data-ab="99,99,99,89,99,99,99,99,99,99,99,99,62,99,99,99,99,77,99,99,85,99,58,78,99,64" data-allrecipes-resource-version="4.56.0" data-ddm-standard-tracking="true" data-ddm-standard-video="true" data-mantle-resource-version="4.2.131" data-mm-ads-resource-version="2.2.41" data-mm-digital-issues-resource-version="3.0.7" data-mm-myrecipes-resource-version="3.3.6" data-mm-recipes-resource-version="3.0.13" data-mm-transactional-resource-version="3.0.34" data-mm-video-resource-version="3.0.12" data-resource-version="4.56.0" data-tracking-container="true" id="searchTemplate_1-0" lang="en"><!--
<globe-environment environment="k8s-prod" application="allrecipes" dataCenter="us-east-1"/>
-->
<head class="loc head">
<script type="text/javascript">
var MMads = window.MMads || {};
var MMAds = window.MMAds || {};</script>
<link href="//c.amazon-adsystem.com" rel="preconnect"/>
<link href="//securepubads.g.doubleclick.n

Our **goal** is to get a list of recipes.  Maybe we should find all the `div` tags? What about `span` tags?

In [18]:
# finding via tag ... problematic as we have too many div tags!
len(soup.find_all('div'))

286

In [19]:
len(soup.find_all('span'))

100

Tags can have multiple "classes" they belong to.  For example, in https://www.allrecipes.com/search?q=cheese+fondue the first recipe is encapsulated in this html tag:

    <span class="card__title"><span class="card__title-text">Cheese Fondue</span></span>
    
So this particular span tag belongs to classes:
- `card__title`
- `card__title-text`
    
I suspect only our target recipes belong to the `card__title-text` class.  Lets find them all:

In [20]:
recipe_list = soup.find_all(class_='card__title-text')

len(recipe_list)

24

In [21]:
recipe_list

[<span class="card__title-text">Cheese Fondue</span>,
 <span class="card__title-text">Best Formula Three-Cheese Fondue</span>,
 <span class="card__title-text">Beer Cheese Fondue</span>,
 <span class="card__title-text">Classic Cheese Fondue</span>,
 <span class="card__title-text">Cheese Fondue</span>,
 <span class="card__title-text">Chef John's Classic Cheese Fondue Is the Ultimate Cheese Lover's Recipe</span>,
 <span class="card__title-text">Basic Fondue</span>,
 <span class="card__title-text">Quick Fontina Cheese Fondue</span>,
 <span class="card__title-text">Classic Swiss Fondue</span>,
 <span class="card__title-text">YouTube + Chill: For Serious Cheese-Lovers Only</span>,
 <span class="card__title-text">Crab Cheese Fondue</span>,
 <span class="card__title-text">Cheese</span>,
 <span class="card__title-text">25 Best Appetizers to Make if You Absolutley Love Cheese</span>,
 <span class="card__title-text">The Most Popular Recipes of the 1970s</span>,
 <span class="card__title-text">How

# Let me show you how one can scrape Yelp!


In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import pandas as pd
import requests
import urllib
import csv
import re
import time


def html_parse(restUrl,businessname):
	review = []
	frndCnt = []
	revNum = []
	revStars = []
	revDate = []
	revName = []
	revplace=[]
	name_bus=[]
	pages=[]
	userImage=[]
	mylist=[]
	# Change the cookies of the header, each time you start the code.
	headers = {'Host': 'www.yelp.com',
	'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:96.0) Gecko/20100101 Firefox/96.0',
	'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
	'Accept-Language': 'en-US,en;q=0.5',
	'Accept-Encoding': 'gzip, deflate, br',
	'Referer': 'https://www.yelp.com/biz/babes-chicken-dinner-house-arlington',
	'Connection': 'keep-alive',
	'Cookie': 'location=%7B%22city%22%3A+%22Arlington%22%2C+%22state%22%3A+%22TX%22%2C+%22country%22%3A+%22US%22%2C+%22latitude%22%3A+32.735949939078246%2C+%22longitude%22%3A+-97.10804452636717%2C+%22max_latitude%22%3A+32.787698%2C+%22min_latitude%22%3A+32.633367%2C+%22max_longitude%22%3A+-97.046431%2C+%22min_longitude%22%3A+-97.193004%2C+%22zip%22%3A+%22%22%2C+%22address1%22%3A+%22%22%2C+%22address2%22%3A+%22%22%2C+%22address3%22%3A+null%2C+%22neighborhood%22%3A+null%2C+%22borough%22%3A+null%2C+%22provenance%22%3A+%22YELP_GEOCODING_ENGINE%22%2C+%22display%22%3A+%22Arlington%2C+TX%22%2C+%22unformatted%22%3A+%22Arlington%2C+TX%22%2C+%22isGoogleHood%22%3A+null%2C+%22usingDefaultZip%22%3A+null%2C+%22accuracy%22%3A+4.0%2C+%22language%22%3A+null%7D; hl=en_US; wdi=1|074C9D39FB9D3030|0x1.8a4c7aa035c74p+30|0e02339f8f47ee66; xcj=1|FZLQ4LWbaVj16JK0vGR_KdY1amXpgIMKbm-hD_99nI8; _ga=GA1.2.074C9D39FB9D3030; OptanonConsent=isIABGlobal=false&datestamp=Fri+Jun+17+2022+15%3A41%3A30+GMT%2B0530+(India+Standard+Time)&version=6.34.0&hosts=&consentId=0afa044a-61ac-43a4-b6cb-ac083cb80861&interactionCount=1&landingPath=NotLandingPage&groups=C0003%3A1%2CC0002%3A1%2CC0001%3A1%2CC0004%3A1%2CBG40%3A1&AwaitingReconsent=false&isGpcEnabled=0; __qca=P0-245944813-1641321769732; __adroll_fpc=e571db94111ecf4acd659df7506a17ef-1641321770400; __ar_v4=%7CBHPKS4B4ONEJJMGH4QCJZR%3A20220431%3A1%7CQB5JPFIKRZDSBOZSULG4YB%3A20220431%3A1%7C7YX6SJQ4RZAMPB6LZ7CHFF%3A20220431%3A1; _clck=1d8mcd|1|f25|0; _gcl_au=1.1.123541649.1651385235; _fbp=fb.1.1651385238172.1782538474; G_ENABLED_IDPS=google; _tt_enable_cookie=1; _ttp=1c53e490-7fd4-478b-8c4e-629277b2e8f0; zs=0xo4H7b5BVHMsIAeNGzhlZBMo72RYg; zss=epnEQ4yagE06xTLejnF36E8oo72RYg; uuac=NgSrlf-frzQoUUkj1joyyL8hhbTs4jio1p-SJJoulVM; _uetvid=1d7f19c06d8e11eca230515ee5bcd359; bse=c034339cf55d4b0ca3eeb4eeec9a7045; pid=d8b9c679071bdd2b; _gid=GA1.2.1801390908.1655460692; _gat_www=1; _gat_global=1',
	'Upgrade-Insecure-Requests': '1',
	'Sec-Fetch-Dest': 'document',
	'Sec-Fetch-Mode': 'navigate',
	'Sec-Fetch-Site': 'same-origin',
	'Sec-Fetch-User': '?1',
	'Cache-Control': 'max-age=0'}
	html = requests.get(restUrl,headers=headers)

	soup = BeautifulSoup(html.text, "html.parser")

    ## We will fill here

	wrCsv = pd.DataFrame(list(zip(*[review, frndCnt, revNum, revDate, revName,revplace,revStars,userImage]))).add_prefix('Col')


	
def main():
	f = open('batch-newadd.txt', 'r')
	for line in f:
		dat = line
		dat1 = dat.replace('\n','')
		try:
			time.sleep(4)	
			restUrl = 'https://www.yelp.com/not_recommended_reviews/'+dat1
			check = requests.get(restUrl)
			if check.status_code == 200:
				print(restUrl)
				html_parse(restUrl,dat1)
			else:
				continue
		except:
			with open('not-fount.txt','a') as file:
				file.write(str(dat1))
				file.write('\n')
			pass
main()