# Project 2: Web Scraping and API access

In [2]:
!pip install beautifulsoup4



## Part 1: Explore the html for Wikipedia articles. 

### A. Using inspect element, copy the html code for a table.

##### Wikipedia page I chose: https://en.wikipedia.org/wiki/Doughnut

<table class="wikitable">
<thead>
<tr>
<th>Type</th>
<th>Serving size</th>
<th>Calories</th>
<th>Sugar (g)</th>
<th>Fat (g)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Glazed</td>
<td>52 g</td>
<td>190</td>
<td>10</td>
<td>11</td>
</tr>
<tr>
<td>Chocolate Frosted</td>
<td>52 g</td>
<td>250</td>
<td>13</td>
<td>14</td>
</tr>
<tr>
<td>Cream-filled</td>
<td>76 g</td>
<td>340</td>
<td>23</td>
<td>20</td>
</tr>
</tbody>
</table>

### B. Using inspect element, find the html syntax for a link. 

<a href="/wiki/Coffee" title="Coffee">coffee</a>

### C. Using inspect element, find the html syntax for linking an image

<img alt="A glazed ring doughnut" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Glazed-Donut.jpg/220px-Glazed-Donut.jpg" decoding="async" width="220" height="163" class="thumbimage">

## Part 2: Explore one Wikipedia page with the beautifulsoup package

In [1]:
import bs4
import requests
import pandas as pd

In [3]:
#save and print the text content of a page with all tags removed

url = "https://en.wikipedia.org/wiki/Doughnut"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.content, "html.parser")
plain_text = soup.get_text()
clean_text = "\n".join([line.strip() for line in plain_text.splitlines() if line.strip()])
print(clean_text)

Doughnut - Wikipedia
Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us
Contribute
HelpLearn to editCommunity portalRecent changesUpload file
Search
Search
Appearance
Donate
Create account
Log in
Personal tools
Donate Create account Log in
Pages for logged out editors learn more
ContributionsTalk
Contents
move to sidebar
hide
(Top)
1
History
Toggle History subsection
1.1
Forerunner
1.2
England and North America
2
Etymology
Toggle Etymology subsection
2.1
"Dough nut"
2.2
"Donut"
3
Types
Toggle Types subsection
3.1
Rings
3.1.1
Topping
3.2
Holes
3.3
Filled
3.4
Other shapes
4
Science
Toggle Science subsection
4.1
Cake vs yeast style
4.2
Physical structure
4.3
Molecular composition
4.4
Health effects
4.5
Dough rheology
5
Regional variations
Toggle Regional variations subsection
5.1
Asia
5.1.1
Cambodia
5.1.2
China
5.1.3
India
5.1.4
Indonesia
5.1.5
Japan
5.1.6
Malaysia
5.1.7
Nepal
5.1.8
Pakistan
5.1.9
Phil

In [11]:
#download an image with beautifulsoup and save it in this repository

response = requests.get(url)
soup = bs4.BeautifulSoup(response.content, "html.parser")

image_tag = soup.find("img")

base_url = "https://en.wikipedia.org"
image_url = (
    base_url + image_tag["src"] if image_tag["src"].startswith("/") else image_tag["src"]
)

image_data = requests.get(image_url).content
with open("donut_image.jpg", "wb") as file:
    file.write(image_data)

print(f"Image downloaded from {image_url}")

Image downloaded from https://en.wikipedia.org/static/images/icons/wikipedia.png


In [12]:
#find all the links in a page with beautifulsoup
#print the first 100 characters of ten of these links

links = soup.find_all("a", href=True)

for i, link in enumerate(links[:10]):
    print(f"Link {i + 1}: {link['href'][:100]}")

Link 1: #bodyContent
Link 2: /wiki/Main_Page
Link 3: /wiki/Wikipedia:Contents
Link 4: /wiki/Portal:Current_events
Link 5: /wiki/Special:Random
Link 6: /wiki/Wikipedia:About
Link 7: //en.wikipedia.org/wiki/Wikipedia:Contact_us
Link 8: /wiki/Help:Contents
Link 9: /wiki/Help:Introduction
Link 10: /wiki/Wikipedia:Community_portal


## Part 3: Downloading scripts

In [23]:
scripts=pd.read_csv('pudding_data.csv')

In [24]:
scripts

Unnamed: 0,imdb_id,script_id,title,year,gross (inflation-adjusted),link
0,tt0019777,4031,The Cocoanuts,1929,,http://www.pages.drexel.edu/~ina22/splaylib/Sc...
1,tt0021884,8521,Frankenstein,1931,298.0,Frankenstein (Florey & Fort) [1931-5-23] [Scan...
2,tt0022054,1086,The Last Flight,1931,,"film_20100519/all_imsdb_05_19_10/Last-Flight,-..."
3,tt0022626,1631,American Madness,1932,,http://www.imsdb.com/Movie Scripts/American Ma...
4,tt0022958,2438,Grand Hotel,1932,,http://www.imsdb.com/Movie Scripts/Grand Hotel...
...,...,...,...,...,...,...
1995,tt3733778,8533,Pay the Ghost,2015,,"Pay The Ghost (Dan Kay, 9-1-09).pdf"
1996,tt3808342,5499,Son of Saul,2015,0.0,http://gointothestory.blcklst.com/wp-content/u...
1997,tt3850214,8056,Dope,2015,18.0,Dope (2013.10.31) [Digital].pdf
1998,tt3859076,5507,Truth,2015,2.0,http://gointothestory.blcklst.com/wp-content/u...


In [25]:
#using the links in the "link" column, download the first 1000 characters of each script
#use requests and bs4, remember to remove all html tags

from bs4 import BeautifulSoup

def fetch_cleaned_script(url):
    if not url.startswith("http"):
        return "Invalid URL"
    
    response = requests.get(url, stream=True, timeout=None)  # No timeout limit
    content_type = response.headers.get('Content-Type', '')
    
    if "text/html" not in content_type:  # Skip non-HTML content
        return "Non-HTML content"
    
    soup = BeautifulSoup(response.content, "html.parser")
    text = soup.get_text(strip=True)
    return text[:1000]

scripts["script_text"] = scripts["link"].apply(fetch_cleaned_script)
scripts.to_csv("pudding_texts.csv", index=False)
print("Downloaded and saved cleaned script text in 'pudding_texts.csv'")

KeyboardInterrupt: 

In [22]:
#add a new column to the df with the text downloaded
#save this new dataframe as "pudding_texts.csv"

scripts["script_text"] = scripts["link"].apply(fetch_cleaned_script)
scripts.to_csv("pudding_texts.csv", index=False)

print("New DataFrame saved as 'pudding_texts.csv'")

KeyboardInterrupt: 

^^^ BOTH OF THE ABOVE ARE AS CORRECT AS I CAN GET THEM BUT KEEP TIMING OUT ON MY SYSTEM NO MATTER WHAT I DO ^^^

## Part 4: TMDB database

#### Browse the documentation at https://developer.themoviedb.org/reference/intro/getting-started. Create an account to authenticate

In [None]:
#create a dataset of the movies in theaters now. Include metadata fields you are interested in. 

api_key = "902af2c22e89a979e3a3ada7d9dabddd"
url = f"https://api.themoviedb.org/3/movie/now_playing?api_key={api_key}&language=en-US&page=1"

response = requests.get(url)
movies = response.json()

movie_data = [
    {
        "title": movie["title"],
        "release_date": movie["release_date"],
        "overview": movie["overview"],
        "vote_average": movie["vote_average"],
        "poster_path": movie["poster_path"],
    }
    for movie in movies["results"]
]

movies_df = pd.DataFrame(movie_data)
movies_df.to_csv("movies_in_theaters.csv", index=False)
print("Dataset of movies in theaters saved as 'movies_in_theaters.csv'")

Dataset of movies in theaters saved as 'movies_in_theaters.csv'


In [28]:
#download the movie posters for 10 of these movies and save them to this repository

poster_base_url = "https://image.tmdb.org/t/p/w500"

# Download and save posters for the first 10 movies
for i, poster_path in enumerate(movies_df["poster_path"][:10]):
    poster_url = poster_base_url + poster_path
    response = requests.get(poster_url)
    with open(f"poster_{i + 1}.jpg", "wb") as file:
        file.write(response.content)

print("Downloaded movie posters for the first 10 movies.")

Downloaded movie posters for the first 10 movies.
