# Week 1 Lab: Data Collection for Machine Learning

**CS 203: Software Tools and Techniques for AI**

---

## Lab Overview

In this lab, you will learn to collect data from the web using:

1. **HTTP fundamentals** - Understanding how the web works
2. **curl** - Command-line HTTP client
3. **Python requests** - Programmatic API calls
4. **BeautifulSoup** - Web scraping when APIs don't exist

**Goal**: Build a movie data collection pipeline for Netflix-style movie prediction.

---

## Setup

First, let's install and import the required libraries.

In [None]:
# Install required packages (uncomment if needed)
# !pip install requests beautifulsoup4 pandas

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time

print("All imports successful!")

All imports successful!


---

# Part 1: HTTP Fundamentals

Before we start collecting data, we need to understand how the web works.

## 1.1 Understanding URLs

A URL (Uniform Resource Locator) has several components:

```
https://api.omdbapi.com:443/v1/movies?t=Inception&y=2010#details
└─┬──┘ └──────┬───────┘└┬─┘└───┬───┘└─────────┬────────┘└───┬───┘
  │           │         │      │              │             │
Protocol    Host      Port   Path          Query        Fragment
```

### Question 1.1 (Solved): Parse a URL

Use Python's `urllib.parse` to break down a URL into its components.

In [None]:
# SOLVED EXAMPLE
from urllib.parse import urlparse, parse_qs

url = "https://api.omdbapi.com/?apikey=demo&t=Inception&y=2010"

parsed = urlparse(url)

print(f"Scheme (protocol): {parsed.scheme}")
print(f"Host (domain): {parsed.netloc}")
print(f"Path: {parsed.path}")
print(f"Query string: {parsed.query}")

# Parse query parameters into a dictionary
params = parse_qs(parsed.query)
print(f"\nParsed parameters: {params}")

Scheme (protocol): https
Host (domain): api.omdbapi.com
Path: /
Query string: apikey=demo&t=Inception&y=2010

Parsed parameters: {'apikey': ['demo'], 't': ['Inception'], 'y': ['2010']}


### Question 1.2: Parse a Different URL

Parse the following GitHub API URL and extract:
1. The host
2. The path
3. All query parameters as a dictionary

URL: `https://api.github.com/search/repositories?q=machine+learning&sort=stars&order=desc`

In [None]:
# YOUR CODE HERE
url = "https://api.github.com/search/repositories?q=machine+learning&sort=stars&order=desc"

# Parse the URL

parsed = urlparse(url)

# Print the host

print(f"Host: {parsed.netloc}")

# Print the path
print(f"Path: {parsed.path}")

# Print the query parameters as a dictionary
params = parse_qs(parsed.query)
print(f"Params: {params}")


Host: api.github.com
Path: /search/repositories
Params: {'q': ['machine learning'], 'sort': ['stars'], 'order': ['desc']}


---

## 1.2 HTTP Status Codes

HTTP status codes tell you what happened with your request:

| Range | Category | Common Examples |
|-------|----------|----------------|
| 2xx | Success | 200 OK, 201 Created |
| 3xx | Redirect | 301 Moved, 302 Found |
| 4xx | Client Error | 400 Bad Request, 401 Unauthorized, 404 Not Found |
| 5xx | Server Error | 500 Internal Error, 503 Service Unavailable |

### Question 1.3: Match Status Codes

Match each scenario to the most likely HTTP status code:

1. You requested a movie that doesn't exist in the database
2. You made too many requests and hit the rate limit
3. Your API key is invalid
4. The request was successful and data was returned
5. The server crashed while processing your request

Status codes to choose from: `200`, `401`, `404`, `429`, `500`

In [None]:
# YOUR ANSWERS HERE
answers = {
    "movie_not_found": 404,      # Replace None with the status code
    "rate_limited": 429,
    "invalid_api_key": 401,
    "success": 200,
    "server_crashed": 500
}

print(answers)

{'movie_not_found': 404, 'rate_limited': 429, 'invalid_api_key': 401, 'success': 200, 'server_crashed': 500}


---

# Part 2: Making Requests with `curl`

`curl` is a command-line tool for making HTTP requests. It's essential for quick testing.

## 2.1 Basic curl Commands

You can run shell commands in Jupyter using `!` prefix.

### Question 2.1 (Solved): Your First API Call

Let's call a simple public API that requires no authentication.

In [None]:
# SOLVED EXAMPLE
# JSONPlaceholder is a free fake API for testing
!curl -s "https://jsonplaceholder.typicode.com/posts/1"

{
  "userId": 1,
  "id": 1,
  "title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
  "body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
}

### Question 2.2: Pretty Print with jq

The output above is hard to read. Use `jq` to format it nicely.

**Hint**: Pipe the curl output to jq: `curl ... | jq .`

In [None]:
# YOUR CODE HERE
# Fetch the same post but format the output with jq
!curl -s "https://jsonplaceholder.typicode.com/posts/1" | jq .

[1;39m{
  [0m[34;1m"userId"[0m[1;39m: [0m[0;39m1[0m[1;39m,
  [0m[34;1m"id"[0m[1;39m: [0m[0;39m1[0m[1;39m,
  [0m[34;1m"title"[0m[1;39m: [0m[0;32m"sunt aut facere repellat provident occaecati excepturi optio reprehenderit"[0m[1;39m,
  [0m[34;1m"body"[0m[1;39m: [0m[0;32m"quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"[0m[1;39m
[1;39m}[0m


### Question 2.3: Extract Specific Fields with jq

Fetch all posts from `https://jsonplaceholder.typicode.com/posts` and extract only the `title` field from each post.

**Hint**: Use `jq '.[].title'` to get the title from each element in the array.

In [None]:
# YOUR CODE HERE
!curl -s "https://jsonplaceholder.typicode.com/posts" | jq '.[].title'

[0;32m"sunt aut facere repellat provident occaecati excepturi optio reprehenderit"[0m
[0;32m"qui est esse"[0m
[0;32m"ea molestias quasi exercitationem repellat qui ipsa sit aut"[0m
[0;32m"eum et est occaecati"[0m
[0;32m"nesciunt quas odio"[0m
[0;32m"dolorem eum magni eos aperiam quia"[0m
[0;32m"magnam facilis autem"[0m
[0;32m"dolorem dolore est ipsam"[0m
[0;32m"nesciunt iure omnis dolorem tempora et accusantium"[0m
[0;32m"optio molestias id quia eum"[0m
[0;32m"et ea vero quia laudantium autem"[0m
[0;32m"in quibusdam tempore odit est dolorem"[0m
[0;32m"dolorum ut in voluptas mollitia et saepe quo animi"[0m
[0;32m"voluptatem eligendi optio"[0m
[0;32m"eveniet quod temporibus"[0m
[0;32m"sint suscipit perspiciatis velit dolorum rerum ipsa laboriosam odio"[0m
[0;32m"fugit voluptas sed molestias voluptatem provident"[0m
[0;32m"voluptate et itaque vero tempora molestiae"[0m
[0;32m"adipisci placeat illum aut reiciendis qui"[0m
[0;32m"doloribus ad provident

### Question 2.4: View Response Headers

Use the `-I` flag to fetch only the response headers (no body) from:
`https://api.github.com`

What is the value of the `X-RateLimit-Limit` header?

In [None]:
# YOUR CODE HERE
!curl -s https://api.github.com -I

#value of X-RateLimit-Limit is 60 / 30 minutes

HTTP/2 200 
[1mdate[0m: Tue, 13 Jan 2026 03:10:18 GMT
[1mcache-control[0m: public, max-age=60, s-maxage=60
[1mvary[0m: Accept,Accept-Encoding, Accept, X-Requested-With
[1mx-github-api-version-selected[0m: 2022-11-28
[1maccess-control-expose-headers[0m: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset
[1maccess-control-allow-origin[0m: *
[1mstrict-transport-security[0m: max-age=31536000; includeSubdomains; preload
[1mx-frame-options[0m: deny
[1mx-content-type-options[0m: nosniff
[1mx-xss-protection[0m: 0
[1mreferrer-policy[0m: origin-when-cross-origin, strict-origin-when-cross-origin
[1mcontent-security-policy[0m: default-src 'none'
[1mserver[0m: github.com
[1mcontent-type[0m: application/json; charset=utf-8
[1mx

### Question 2.5: Add Custom Headers

Make a request to `https://httpbin.org/headers` with the following custom headers:
- `User-Agent: CS203-Lab/1.0`
- `Accept: application/json`

**Hint**: Use `-H "Header-Name: value"` for each header.

In [None]:
# YOUR CODE HERE
!curl -s https://httpbin.org/headers -H "User-Agent: CS203-Lab/1.0" -H "Accept: application/json"

{
  "headers": {
    "Accept": "application/json", 
    "Host": "httpbin.org", 
    "User-Agent": "CS203-Lab/1.0", 
    "X-Amzn-Trace-Id": "Root=1-6965b7b5-44edbd7d7361c6d15e91f85a"
  }
}


---

# Part 3: Python `requests` Library

While `curl` is great for testing, we need Python for automation.

## 3.1 Basic GET Requests

### Question 3.1 (Solved): Simple GET Request

Make a GET request and inspect the response object.

In [None]:
# SOLVED EXAMPLE
import requests

response = requests.get("https://jsonplaceholder.typicode.com/posts/1")

print(f"Status Code: {response.status_code}")
print(f"Content-Type: {response.headers['Content-Type']}")
print(f"Response OK: {response.ok}")
print(f"\nJSON Data:")
print(response.json())

Status Code: 200
Content-Type: application/json; charset=utf-8
Response OK: True

JSON Data:
{'userId': 1, 'id': 1, 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}


### Question 3.2: Fetch Multiple Posts

Fetch posts from `https://jsonplaceholder.typicode.com/posts` and:
1. Print the total number of posts
2. Print the titles of the first 5 posts

In [None]:
# YOUR CODE HERE
response = requests.get("https://jsonplaceholder.typicode.com/posts")
data = response.json()

total_post = len(data)
title = [p["title"] for p in data[:5]]

print("total post: ", total_post)
print("title of first 5:\n", title)

total post:  100
title of first 5:
 ['sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'qui est esse', 'ea molestias quasi exercitationem repellat qui ipsa sit aut', 'eum et est occaecati', 'nesciunt quas odio']


### Question 3.3 (Solved): Using Query Parameters

The proper way to add query parameters is using the `params` argument.

In [None]:
# SOLVED EXAMPLE
import requests

# Bad way (manual string building)
# url = "https://jsonplaceholder.typicode.com/posts?userId=1"

# Good way (using params)
response = requests.get(
    "https://jsonplaceholder.typicode.com/posts",
    params={"userId": 1}
)

posts = response.json()
print(f"User 1 has {len(posts)} posts")
print(f"\nActual URL used: {response.url}")

User 1 has 10 posts

Actual URL used: https://jsonplaceholder.typicode.com/posts?userId=1


### Question 3.4: Filter Posts by User

Fetch all posts by user 5 and user 7. Compare how many posts each user has.

**Hint**: Make two separate requests with different `userId` values.

In [None]:
# YOUR CODE HERE
response5 = requests.get(
    "https://jsonplaceholder.typicode.com/posts",
    params={"userId": 5}
)
response7 = requests.get(
    "https://jsonplaceholder.typicode.com/posts",
    params={"userId": 7}
)

data5 = response5.json()
data7 = response7.json()

print("total post of user 5", len(data5))
print("total post of user 7", len(data7))

total post of user 5 10
total post of user 7 10


---

## 3.2 Working with Real APIs

Let's work with some real-world APIs.

### Question 3.5 (Solved): GitHub API - Public Repositories

The GitHub API is free to use (with rate limits) and doesn't require authentication for public data.

In [None]:
# SOLVED EXAMPLE
import requests

# Fetch information about a popular repository
response = requests.get(
    "https://api.github.com/repos/pandas-dev/pandas",
    headers={"Accept": "application/vnd.github.v3+json"}
)

if response.ok:
    repo = response.json()
    print(f"Repository: {repo['full_name']}")
    print(f"Description: {repo['description']}")
    print(f"Stars: {repo['stargazers_count']:,}")
    print(f"Forks: {repo['forks_count']:,}")
    print(f"Language: {repo['language']}")
else:
    print(f"Error: {response.status_code}")

Repository: pandas-dev/pandas
Description: Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Stars: 47,562
Forks: 19,505
Language: Python


### Question 3.6: Compare Popular ML Libraries

Fetch information about these ML-related repositories and create a comparison table:
- `scikit-learn/scikit-learn`
- `pytorch/pytorch`
- `tensorflow/tensorflow`

Show: name, stars, forks, and primary language.

**Hint**: Loop through the repos and collect data into a list of dictionaries, then create a DataFrame.

In [None]:
# YOUR CODE HERE
repos = [
    "scikit-learn/scikit-learn",
    "pytorch/pytorch",
    "tensorflow/tensorflow"
]

df = []

# Fetch data for each repo
for repo in repos:
  response = requests.get(
      f"https://api.github.com/repos/{repo}",
      headers={"Accept": "application/vnd.github.v3+json"}
  )
  data = response.json()
  dic = {}
  dic['name']= data['full_name']
  dic['description'] = data['description']
  dic['stars'] = data['stargazers_count']
  dic['forks'] = data['forks_count']
  dic['language'] = data['language']

  df.append(dic)

# Create a DataFrame
df = pd.DataFrame(df)
df


Unnamed: 0,name,description,stars,forks,language
0,scikit-learn/scikit-learn,scikit-learn: machine learning in Python,64608,26599,Python
1,pytorch/pytorch,Tensors and Dynamic neural networks in Python ...,96569,26487,Python
2,tensorflow/tensorflow,An Open Source Machine Learning Framework for ...,193318,75152,C++


### Question 3.7: Search GitHub Repositories

Use the GitHub search API to find the top 10 most starred repositories with "machine learning" in their description.

API endpoint: `https://api.github.com/search/repositories`

Parameters:
- `q`: search query (e.g., "machine learning")
- `sort`: "stars"
- `order`: "desc"
- `per_page`: 10

Print the name and star count of each repository.

In [None]:
# YOUR CODE HERE
res = requests.get(
    "https://api.github.com/search/repositories",
    headers={"Accept": "application/vnd.github.v3+json"},
    params = {"q": "machine learning", "sort": "stars", "order": "desc", "per_page": 10}
  )
repo_df = []
for repo, rank in enumerate(res.json()["items"]):
  repo_info = {}
  repo_info["rank"] = rank+1
  repo_info["name"] = repo["name"]
  repo_info["starts"] = repo['stargazers_count']

repo_df = pd.DataFrame(repo_df)
repo_df

KeyError: 'items'

---

## 3.3 Error Handling

Real-world APIs fail. We need to handle errors gracefully.

### Question 3.8 (Solved): Handling HTTP Errors

In [None]:
# SOLVED EXAMPLE
import requests

def fetch_with_error_handling(url):
    """Fetch URL with proper error handling."""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raises exception for 4xx/5xx
        return response.json()
    except requests.exceptions.Timeout:
        print(f"Timeout: Request took too long")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP Error: {e.response.status_code}")
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
    return None

# Test with valid URL
print("Valid URL:")
data = fetch_with_error_handling("https://jsonplaceholder.typicode.com/posts/1")
if data:
    print(f"  Got post: {data['title'][:50]}...")

# Test with invalid URL (404)
print("\nInvalid URL (404):")
fetch_with_error_handling("https://jsonplaceholder.typicode.com/posts/99999")

Valid URL:
  Got post: sunt aut facere repellat provident occaecati excep...

Invalid URL (404):
HTTP Error: 404


### Question 3.9: Robust Fetcher Function

Write a function `safe_fetch(url, max_retries=3)` that:

1. Attempts to fetch the URL
2. If it fails with a 5xx error, retries up to `max_retries` times
3. Waits 1 second between retries
4. Returns the JSON data if successful, None otherwise

Test it with `https://httpbin.org/status/500` (always returns 500) and `https://jsonplaceholder.typicode.com/posts/1` (always works).

In [None]:
# YOUR CODE HERE
import time

def safe_fetch(url, max_retries=3):
    """Fetch URL with retry logic for server errors."""
    while True:
      try:
          if max_retries < 0:
            return None
          response = requests.get(url, timeout=10)
          response.raise_for_status()
          return response.json()
      except requests.exceptions.HTTPError as e:
          # print(str(e.response.status_code)[0])
          if str(e.response.status_code)[0] == '5':
            time.sleep(1)
            print("retrying")
            max_retries -= 1


# Test your function
print("Testing with working URL:")
result = safe_fetch("https://jsonplaceholder.typicode.com/posts/1")
print(f"Result: {result}")

print("\nTesting with failing URL (500):")
result = safe_fetch("https://httpbin.org/status/500")
print(f"Result: {result}")

Testing with working URL:
Result: {'userId': 1, 'id': 1, 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}

Testing with failing URL (500):
retrying
retrying
retrying
retrying
Result: None


---

# Part 4: The OMDb Movie API

Now let's work with the OMDb API - our main data source for the Netflix project.

**Note**: You need an API key from https://www.omdbapi.com/apikey.aspx (free tier available).

For this lab, we'll use a demo key that has limited functionality.

In [None]:
# Set your API key here
# Get a free key from: https://www.omdbapi.com/apikey.aspx
OMDB_API_KEY = "4c3965c5"  # Replace with your actual key

# For demo purposes, you can try with key "demo" but it's very limited
# OMDB_API_KEY = "demo"

### Question 4.1 (Solved): Fetch a Single Movie

In [None]:
# SOLVED EXAMPLE
import requests

def fetch_movie(title, year=None, api_key=OMDB_API_KEY):
    """Fetch movie data from OMDb API."""
    params = {
        "apikey": api_key,
        "t": title,  # Search by title
        "type": "movie"
    }
    if year:
        params["y"] = year

    response = requests.get("https://www.omdbapi.com/", params=params)

    if response.ok:
        data = response.json()
        if data.get("Response") == "True":
            return data
        else:
            print(f"Movie not found: {data.get('Error')}")
    return None

# Fetch Inception
movie = fetch_movie("The 13th Warrior", 1999)
print(movie)
if movie:
    print(f"Title: {movie['Title']}")
    print(f"Year: {movie['Year']}")
    print(f"Director: {movie['Director']}")
    print(f"IMDB Rating: {movie['imdbRating']}")
    print(f"Genre: {movie['Genre']}")

{'Title': 'The 13th Warrior', 'Year': '1999', 'Rated': 'R', 'Released': '27 Aug 1999', 'Runtime': '102 min', 'Genre': 'Action, Adventure, History', 'Director': 'John McTiernan', 'Writer': 'Michael Crichton, William Wisher, Warren Lewis', 'Actors': 'Antonio Banderas, Diane Venora, Dennis Storhøi', 'Plot': 'A man, having fallen in love with the wrong woman, is sent by the sultan himself on a diplomatic mission to a distant land as an ambassador. Stopping at a Viking village port to restock on supplies, he finds himself unwittingly em...', 'Language': 'English, Latin, Swedish, Norse, Old, Danish, Arabic', 'Country': 'United States', 'Awards': '2 wins & 2 nominations total', 'Poster': 'https://m.media-amazon.com/images/M/MV5BY2IwMTYyNjctYzhjZi00Y2Y3LWE3NjktMGVjMjFhNzk0NWFjXkEyXkFqcGc@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '6.6/10'}, {'Source': 'Rotten Tomatoes', 'Value': '34%'}, {'Source': 'Metacritic', 'Value': '42/100'}], 'Metascore': '42', 'imdbRatin

### Question 4.2: Explore the Response

Fetch data for "The Dark Knight" and print ALL available fields in the response.

Which fields might be useful for predicting movie success?

In [None]:
# YOUR CODE HERE
movie = fetch_movie("The Dark Knight")
movie

#usefule fields: 'Rated', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Ratings', 'BoxOffice'

{'Title': 'The Dark Knight',
 'Year': '2008',
 'Rated': 'PG-13',
 'Released': '18 Jul 2008',
 'Runtime': '152 min',
 'Genre': 'Action, Crime, Drama',
 'Director': 'Christopher Nolan',
 'Writer': 'Jonathan Nolan, Christopher Nolan, David S. Goyer',
 'Actors': 'Christian Bale, Heath Ledger, Aaron Eckhart',
 'Plot': 'When a menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman, James Gordon and Harvey Dent must work together to put an end to the madness.',
 'Language': 'English, Mandarin',
 'Country': 'United States, United Kingdom',
 'Awards': 'Won 2 Oscars. 163 wins & 165 nominations total',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_SX300.jpg',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '9.1/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '94%'},
  {'Source': 'Metacritic', 'Value': '85/100'}],
 'Metascore': '85',
 'imdbRating': '9.1',
 'imdbVotes': '3,115,102',
 'imdbID': 'tt0468569',
 

### Question 4.3: Fetch Multiple Movies

Create a function `fetch_movies(titles)` that:
1. Takes a list of movie titles
2. Fetches data for each movie
3. Returns a list of movie dictionaries (only successful fetches)
4. Adds a 0.5 second delay between requests (to respect rate limits)

Test it with: `["Inception", "The Matrix", "Interstellar", "NonExistentMovie123"]`

In [None]:
def fetch_movie(title, api_key=OMDB_API_KEY):
    """Fetch movie data from OMDb API."""
    params = {
        "apikey": api_key,
        "t": title,  # Search by title
        "type": "movie"
    }

    try:
      response = requests.get("https://www.omdbapi.com/", params=params)

      if response.ok:
          data = response.json()
          if data.get("Response") == "True":
              return data
          else:
              print(f"Movie not found: {title}")
    except:
      pass

# YOUR CODE HERE
def fetch_movies(titles):
    """Fetch multiple movies from OMDb API."""
    fetched_movies = []
    for title in titles:
      try:
        fetched_movie = fetch_movie(title)
        if fetched_movie:
          fetched_movies.append(fetched_movie)
      except:
        pass
      time.sleep(0.5)

    return fetched_movies


# Test
test_titles = ["Inception", "The Matrix", "Interstellar", "NonExistentMovie123"]
movies = fetch_movies(test_titles)
print(f"Successfully fetched {len(movies)} out of {len(test_titles)} movies")

Movie not found: NonExistentMovie123
Successfully fetched 3 out of 4 movies


### Question 4.4: Create a Movie DataFrame

Using the movies you fetched, create a pandas DataFrame with these columns:
- title
- year (as integer)
- genre
- director
- imdb_rating (as float)
- imdb_votes (as integer, remove commas)
- runtime_minutes (as integer, extract from "148 min")
- box_office (keep as string for now)

**Hint**: You'll need to clean the data types.

In [None]:
# YOUR CODE HERE
import pandas as pd

data = []
for movie in movies:
  data.append({
      "title": movie['Title'],
      "year": int(movie['Year']),
      "genre": movie['Genre'],
      "director": movie['Director'],
      "imdb_rating": float(movie['imdbRating']),
      "imdb_votes":int(movie['imdbVotes'].replace(',', '')),
      "runtime_minutes": int(movie['Runtime'].split(" ")[0]),
      "box_office": movie['BoxOffice']
  })

df = pd.DataFrame(data)
df

Unnamed: 0,title,year,genre,director,imdb_rating,imdb_votes,runtime_minutes,box_office
0,Inception,2010,"Action, Adventure, Sci-Fi",Christopher Nolan,8.8,2767518,148,"$292,587,330"
1,The Matrix,1999,"Action, Sci-Fi","Lana Wachowski, Lilly Wachowski",8.7,2217731,136,"$177,559,005"
2,Interstellar,2014,"Adventure, Drama, Sci-Fi",Christopher Nolan,8.7,2454660,169,"$203,227,580"


### Question 4.5: Search Movies by Title

OMDb also has a search endpoint that returns multiple results.

Use the `s` parameter instead of `t` to search for movies containing "Star Wars".

API endpoint: `https://www.omdbapi.com/?apikey=YOUR_KEY&s=Star Wars&type=movie`

Print the title and year of each result.

In [None]:
# YOUR CODE HERE
params = {
    "apikey": OMDB_API_KEY,
    "s": "Star Wars",
    "type": "movie",
}

res = requests.get("https://www.omdbapi.com/", params=params)
data = res.json()
for movie in data['Search']:
  print("Title: ", movie['Title'])
  print("Year: ", movie['Year'])
  print()

Title:  Star Wars: Episode IV - A New Hope
Year:  1977

Title:  Star Wars: Episode V - The Empire Strikes Back
Year:  1980

Title:  Star Wars: Episode VI - Return of the Jedi
Year:  1983

Title:  Star Wars: Episode VII - The Force Awakens
Year:  2015

Title:  Star Wars: Episode I - The Phantom Menace
Year:  1999

Title:  Star Wars: Episode III - Revenge of the Sith
Year:  2005

Title:  Star Wars: Episode II - Attack of the Clones
Year:  2002

Title:  Rogue One: A Star Wars Story
Year:  2016

Title:  Star Wars: Episode VIII - The Last Jedi
Year:  2017

Title:  Star Wars: Episode IX - The Rise of Skywalker
Year:  2019



### Question 4.6: Handle Pagination

The OMDb search API returns 10 results per page and includes a `totalResults` field.

Write a function `search_all_movies(query)` that:
1. Searches for movies matching the query
2. Fetches ALL pages of results (use the `page` parameter)
3. Returns a list of all movies found

**Hint**: `totalResults` tells you how many movies exist. Divide by 10 to get the number of pages.

Test with a query that has many results like "Batman".

In [None]:
# YOUR CODE HERE
def search_all_movies(query, api_key=OMDB_API_KEY):
    """Search OMDb and return ALL matching movies across all pages."""
    params = {
        "apikey": api_key,
        "s": query,
        "type": "movie"
    }
    res = requests.get("https://www.omdbapi.com/", params=params)
    data = res.json()
    movies = []
    movies+=[movie for movie in data['Search']]
    total_pages = int(data['totalResults'])//10
    for page in range(1, total_pages+1):
      params['page'] = page
      res = requests.get("https://www.omdbapi.com/", params=params)
      time.sleep(0.01)
      data = res.json()
      movies+=[movie for movie in data['Search']]

    return movies

# Test
all_batman = search_all_movies("Batman")
print(all_batman[0])
print(f"Found {len(all_batman)} Batman movies")

{'Title': 'Batman Begins', 'Year': '2005', 'imdbID': 'tt0372784', 'Type': 'movie', 'Poster': 'https://m.media-amazon.com/images/M/MV5BMzA2NDQzZDEtNDU5Ni00YTlkLTg2OWEtYmQwM2Y1YTBjMjFjXkEyXkFqcGc@._V1_SX300.jpg'}
Found 520 Batman movies


---

# Part 5: Web Scraping with BeautifulSoup

When APIs don't exist or don't have what we need, we scrape.

## 5.1 HTML Basics

### Question 5.1 (Solved): Parse HTML

In [None]:
# SOLVED EXAMPLE
from bs4 import BeautifulSoup

html = """
<html>
<body>
    <div class="movie" id="movie-1">
        <h2 class="title">Inception</h2>
        <span class="year">2010</span>
        <span class="rating">8.8</span>
        <a href="/movies/inception">More Info</a>
    </div>
    <div class="movie" id="movie-2">
        <h2 class="title">The Matrix</h2>
        <span class="year">1999</span>
        <span class="rating">8.7</span>
        <a href="/movies/matrix">More Info</a>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# Find all movie divs
movies = soup.find_all('div', class_='movie')
print(f"Found {len(movies)} movies\n")

# Extract data from each
for movie in movies:
    title = movie.find('h2', class_='title').text
    year = movie.find('span', class_='year').text
    rating = movie.find('span', class_='rating').text
    link = movie.find('a')['href']

    print(f"{title} ({year}) - Rating: {rating} - Link: {link}")

[<div class="movie" id="movie-1">
<h2 class="title">Inception</h2>
<span class="year">2010</span>
<span class="rating">8.8</span>
<a href="/movies/inception">More Info</a>
</div>, <div class="movie" id="movie-2">
<h2 class="title">The Matrix</h2>
<span class="year">1999</span>
<span class="rating">8.7</span>
<a href="/movies/matrix">More Info</a>
</div>]
Found 2 movies

Inception (2010) - Rating: 8.8 - Link: /movies/inception
The Matrix (1999) - Rating: 8.7 - Link: /movies/matrix


### Question 5.2: CSS Selectors

Rewrite the above extraction using CSS selectors (`.select()` and `.select_one()`) instead of `.find()` and `.find_all()`.

**Hint**:
- `.movie` selects elements with class "movie"
- `.movie .title` selects elements with class "title" inside class "movie"

In [None]:
# YOUR CODE HERE
# Use the same 'soup' from above

# Extract using CSS selectors
movies = soup.select('.movie')
print(f"found {len(movies)} movies")
# print(movies)
for movie in movies:
  title = movie.select_one('.movie .title').text
  year = movie.select_one('.movie .year').text
  rating = movie.select_one('.movie .rating').text
  link = movie.select_one('.movie a')['href']

  print(f"{title} ({year}) - Rating: {rating} - Link: {link}")


found 2 movies
Inception (2010) - Rating: 8.8 - Link: /movies/inception
The Matrix (1999) - Rating: 8.7 - Link: /movies/matrix


### Question 5.3: Scrape a Real Website

Let's scrape the example website `http://quotes.toscrape.com/` which is designed for scraping practice.

Extract all quotes from the first page, including:
- The quote text
- The author name
- The tags

Return the results as a list of dictionaries.

In [None]:
# YOUR CODE HERE
import requests
from bs4 import BeautifulSoup

# Fetch the page
url = "http://quotes.toscrape.com/"

# Parse the HTML
res = requests.get(url)
html = res.text
soup = BeautifulSoup(html, 'html.parser')

# Extract quotes
scraped_quotes = []
quotes = soup.find_all('div', class_='quote')
for quote in quotes:
  text = quote.find('span', class_='text').text
  author = quote.find('small', class_='author').text
  tags = [tag.text for tag in quote.find_all('a', class_='tag')]
  scraped_quotes.append({
      "text": text,
      "author": author,
      "tags": tags
  })

scraped_quotes

[{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  'author': 'Albert Einstein',
  'tags': ['change', 'deep-thoughts', 'thinking', 'world']},
 {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  'author': 'J.K. Rowling',
  'tags': ['abilities', 'choices']},
 {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  'author': 'Albert Einstein',
  'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']},
 {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  'author': 'Jane Austen',
  'tags': ['aliteracy', 'books', 'classic', 'humor']},
 {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  'author': 'Marilyn Monroe',
  'tags': ['be-

### Question 5.4: Handle Pagination in Scraping

The quotes website has multiple pages. Scrape the first 3 pages and collect all quotes.

Pages follow the pattern:
- Page 1: `http://quotes.toscrape.com/page/1/`
- Page 2: `http://quotes.toscrape.com/page/2/`
- etc.

**Remember**: Add a delay between requests to be polite!

In [None]:
import time

# YOUR CODE HERE
def scrape_page(page_number):
  print("scraping page: ", page_number)
  res = requests.get(f"http://quotes.toscrape.com/page/{page_number}/")
  html = res.text
  soup = BeautifulSoup(html, 'html.parser')

  # Extract quotes
  scraped_quotes = []
  quotes = soup.find_all('div', class_='quote')
  for quote in quotes:
    text = quote.find('span', class_='text').text
    author = quote.find('small', class_='author').text
    tags = [tag.text for tag in quote.find_all('a', class_='tag')]
    scraped_quotes.append({
        "text": text,
        "author": author,
        "tags": tags
    })
  time.sleep(0.5)
  return scraped_quotes

quotes = []
for page in range(1, 4):
  quotes += scrape_page(page)

quotes

scraping page:  1
scraping page:  2
scraping page:  3


[{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  'author': 'Albert Einstein',
  'tags': ['change', 'deep-thoughts', 'thinking', 'world']},
 {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  'author': 'J.K. Rowling',
  'tags': ['abilities', 'choices']},
 {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  'author': 'Albert Einstein',
  'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']},
 {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  'author': 'Jane Austen',
  'tags': ['aliteracy', 'books', 'classic', 'humor']},
 {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  'author': 'Marilyn Monroe',
  'tags': ['be-

### Question 5.5: Extract Table Data

Scrape the table from `https://www.w3schools.com/html/html_tables.asp`.

The table contains company data. Extract all rows and create a pandas DataFrame.

**Hint**: Look for `<table>`, `<tr>` (table row), `<th>` (header), and `<td>` (data cell) elements.

In [None]:
# YOUR CODE HERE
# Hint: pandas has a read_html() function that can do this automatically!
# But try doing it manually first to understand the process.
import pandas as pd

res = requests.get("https://www.w3schools.com/html/html_tables.asp")
html = res.text
soup = BeautifulSoup(html, 'html.parser')

soup

rows = soup.find("table", class_="ws-table-all").find_all('tr')

df = []

for row in rows[1:]:
  fields = row.find_all('td')
  df.append({
      "Company": fields[0].text,
      "Contact": fields[1].text,
      "Country": fields[2].text,
  })

df = pd.DataFrame(df)
df

Unnamed: 0,Company,Contact,Country
0,Alfreds Futterkiste,Maria Anders,Germany
1,Centro comercial Moctezuma,Francisco Chang,Mexico
2,Ernst Handel,Roland Mendel,Austria
3,Island Trading,Helen Bennett,UK
4,Laughing Bacchus Winecellars,Yoshi Tannamuri,Canada
5,Magazzini Alimentari Riuniti,Giovanni Rovelli,Italy


---

# Part 6: Building the Movie Data Pipeline

Now let's put everything together to build a complete data collection pipeline for our Netflix project.

## 6.1 The Complete Pipeline

### Question 6.1 (Solved): Movie Data Collector Class

In [None]:
# SOLVED EXAMPLE
import requests
import pandas as pd
import time
from typing import List, Dict, Optional

class MovieDataCollector:
    """Collect movie data from OMDb API."""

    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "http://www.omdbapi.com/"
        self.delay = 0.1  # Seconds between requests

    def fetch_movie(self, title: str, year: Optional[int] = None) -> Optional[Dict]:
        """Fetch a single movie by title."""
        params = {
            "apikey": self.api_key,
            "t": title,
            "type": "movie"
        }
        if year:
            params["y"] = year

        try:
            response = requests.get(self.base_url, params=params, timeout=10)
            response.raise_for_status()
            data = response.json()

            if data.get("Response") == "True":
                return data
        except Exception as e:
            print(f"Error fetching {title}: {e}")

        return None

    def fetch_movies(self, titles: List[str]) -> List[Dict]:
        """Fetch multiple movies."""
        movies = []

        for i, title in enumerate(titles):
            print(f"Fetching {i+1}/{len(titles)}: {title}")
            movie = self.fetch_movie(title)

            if movie:
                movies.append(movie)

            time.sleep(self.delay)

        return movies

    def to_dataframe(self, movies: List[Dict]) -> pd.DataFrame:
        """Convert movie data to cleaned DataFrame."""
        if not movies:
            return pd.DataFrame()

        # Extract relevant fields
        rows = []
        for m in movies:
            rows.append({
                "title": m.get("Title"),
                "year": m.get("Year"),
                "genre": m.get("Genre"),
                "director": m.get("Director"),
                "actors": m.get("Actors"),
                "imdb_rating": m.get("imdbRating"),
                "imdb_votes": m.get("imdbVotes"),
                "runtime": m.get("Runtime"),
                "box_office": m.get("BoxOffice"),
                "imdb_id": m.get("imdbID")
            })

        df = pd.DataFrame(rows)

        # Clean data types
        df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64")
        df["imdb_rating"] = pd.to_numeric(df["imdb_rating"], errors="coerce")
        df["imdb_votes"] = df["imdb_votes"].str.replace(",", "").pipe(pd.to_numeric, errors="coerce").astype("Int64")
        # Fix: str.extract returns a DataFrame, we need column 0 to get a Series
        df["runtime_min"] = df["runtime"].str.extract(r"(\d+)").iloc[:, 0].pipe(pd.to_numeric, errors="coerce").astype("Int64")

        return df

# Usage example
# collector = MovieDataCollector(OMDB_API_KEY)
# movies = collector.fetch_movies(["Inception", "The Matrix"])
# df = collector.to_dataframe(movies)
# df

### Question 6.2: Add Search Functionality

Extend the `MovieDataCollector` class to add a `search_movies(query, max_results=50)` method that:
1. Searches for movies matching the query
2. Handles pagination to get up to `max_results` movies
3. For each search result, fetches the full movie details
4. Returns the detailed movie data

**Hint**: Search results only contain basic info (title, year, poster, imdbID). You need to use the imdbID to fetch full details.

In [None]:
# YOUR CODE HERE
# Extend the MovieDataCollector class or add a method
class ExtendedMovieDataCollector(MovieDataCollector):
  def __init__(self, api_key: str):
     super().__init__(api_key)

  def fetch_movies_by_id(self, ids):
    try:
      params = {
          "apikey": self.api_key,
          "type": "movie",
      }

      movies = []
      for id in ids:
        params['i'] = id
        res = requests.get("https://www.omdbapi.com/", params=params)
        time.sleep(self.delay)
        movie = res.json()
        movies.append(movie)

      return movies
    except Exception as e:
      print("Error", e)

    return None

  def search_movies(self, query, max_results=50):
    try:
      params = {
          "apikey": self.api_key,
          "type": "movie",
          "s": query,
      }
      res = requests.get("https://www.omdbapi.com/", params=params)
      data = res.json()
      ids = []
      ids += [movie["imdbID"] for movie in data['Search']]
      total_pages = int(data['totalResults'])//10
      for page in range(1, min(total_pages+1, max_results//10)):
        params['page'] = page
        res = requests.get("https://www.omdbapi.com/", params=params)
        time.sleep(self.delay)
        data = res.json()
        ids += [movie["imdbID"] for movie in data['Search']]

      ids = ids[:max_results]
      movies = self.fetch_movies_by_id(ids)
      return movies
    except Exception as e:
      print("Error", e)

    return None


collector = ExtendedMovieDataCollector(OMDB_API_KEY)
movies = collector.search_movies("action", max_results=5)
df = collector.to_dataframe(movies)
df

Unnamed: 0,title,year,genre,director,actors,imdb_rating,imdb_votes,runtime,box_office,imdb_id,runtime_min
0,Last Action Hero,1993,"Action, Adventure, Comedy",John McTiernan,"Arnold Schwarzenegger, F. Murray Abraham, Art ...",6.5,171378,130 min,"$50,016,394",tt0107362,130
1,Back in Action,2025,"Action, Comedy",Seth Gordon,"Jamie Foxx, Cameron Diaz, McKenna Roberts",5.9,64120,114 min,,tt21191806,114
2,Looney Tunes: Back in Action,2003,"Animation, Adventure, Comedy",Joe Dante,"Brendan Fraser, Jenna Elfman, Steve Martin",5.8,41828,91 min,"$20,991,364",tt0318155,91
3,An Action Hero,2022,"Action, Comedy, Crime",Anirudh Iyer,"Ayushmann Khurrana, Jaideep Ahlawat, Gautam Jo...",7.0,32711,130 min,,tt15600222,130
4,A Civil Action,1998,"Biography, Drama",Steven Zaillian,"John Travolta, Robert Duvall, Tony Shalhoub",6.6,32240,115 min,"$56,709,981",tt0120633,115


### Question 6.3: Build a Genre-Based Dataset

Use your collector to build a dataset of popular movies from different genres:

1. Search for 10 movies each for: "action", "comedy", "drama", "horror", "sci-fi"
2. Combine all results into a single DataFrame
3. Remove any duplicates (some movies might appear in multiple searches)
4. Save to CSV

**Note**: This might take a while due to rate limiting. Start with fewer movies for testing.

In [None]:
# YOUR CODE HERE
collector = ExtendedMovieDataCollector(OMDB_API_KEY)
moviesdata = pd.DataFrame()
genres = ['action', 'comedy', 'drama', 'horror', 'sci-fi']
for genre in genres:
  movies = collector.search_movies(genre, max_results=10)
  df = collector.to_dataframe(movies)
  df['query_genre'] = genre
  moviesdata = pd.concat([moviesdata, df], ignore_index=True)

moviesdata = moviesdata.drop_duplicates(subset=['imdb_id'])

moviesdata.to_csv('moviesdata.csv')

moviesdata

Unnamed: 0,title,year,genre,director,actors,imdb_rating,imdb_votes,runtime,box_office,imdb_id,runtime_min,query_genre
0,Last Action Hero,1993,"Action, Adventure, Comedy",John McTiernan,"Arnold Schwarzenegger, F. Murray Abraham, Art ...",6.5,171378,130 min,"$50,016,394",tt0107362,130.0,action
1,Back in Action,2025,"Action, Comedy",Seth Gordon,"Jamie Foxx, Cameron Diaz, McKenna Roberts",5.9,64120,114 min,,tt21191806,114.0,action
2,Looney Tunes: Back in Action,2003,"Animation, Adventure, Comedy",Joe Dante,"Brendan Fraser, Jenna Elfman, Steve Martin",5.8,41828,91 min,"$20,991,364",tt0318155,91.0,action
3,An Action Hero,2022,"Action, Comedy, Crime",Anirudh Iyer,"Ayushmann Khurrana, Jaideep Ahlawat, Gautam Jo...",7.0,32711,130 min,,tt15600222,130.0,action
4,A Civil Action,1998,"Biography, Drama",Steven Zaillian,"John Travolta, Robert Duvall, Tony Shalhoub",6.6,32240,115 min,"$56,709,981",tt0120633,115.0,action
5,Missing in Action,1984,"Action, Adventure, Drama",Joseph Zito,"Chuck Norris, M. Emmet Walsh, David Tress",5.5,18026,101 min,"$22,812,411",tt0087727,101.0,action
6,Action Jackson,1988,"Action, Comedy, Crime",Craig R. Baxley,"Carl Weathers, Craig T. Nelson, Vanity",5.6,13166,96 min,"$20,256,975",tt0094612,96.0,action
7,Action Point,2018,Comedy,Tim Kirkby,"Johnny Knoxville, Eleanor Worthington-Cox, Chr...",5.1,12551,85 min,"$5,059,608",tt6495770,85.0,action
8,321 Action,2020,Drama,Shady Al Ramly,"Rakan Abdulwahed, Dyler, Majed Fawaz",1.0,10229,100 min,,tt13423846,100.0,action
9,Missing in Action 2: The Beginning,1985,"Action, Drama, Thriller",Lance Hool,"Chuck Norris, Soon-Tek Oh, Steven Williams",5.3,10068,100 min,"$10,755,447",tt0089604,100.0,action


### Question 6.4: Data Quality Analysis

Using the dataset you created:

1. How many movies have missing IMDB ratings?
2. How many movies have missing box office data?
3. What's the distribution of ratings? (min, max, mean, median)
4. Which directors appear most frequently?
5. What's the average runtime by genre?

These quality checks will be important for Week 2 (Data Validation)!

In [None]:
# YOUR CODE HERE

total_missing_imdb_ratings = moviesdata['imdb_rating'].isna().sum()
print(total_missing_imdb_ratings, "Movies has missing IMDB ratings")

total_missing_box_office = (moviesdata['box_office'] == 'N/A').sum()
print(total_missing_box_office, "Movies has missing box office data")

print("min of ratings", moviesdata['imdb_rating'].min())
print("max of ratings", moviesdata['imdb_rating'].max())
print("mean of ratings", moviesdata['imdb_rating'].mean())
print("median of ratings", moviesdata['imdb_rating'].median())

most_appearing_director = moviesdata['director'].value_counts().idxmax()
print(most_appearing_director, "is the most appearing director")
# average_runtime_by_genre = moviesdata.groupby('query_genre')['runtime'].mean()
# print("average runtime by genre")
# print(average_runtime_by_genre)
# moviesdata.drop(columns=['query_genre'], inplace=True)

0 Movies has missing IMDB ratings
32 Movies has missing box office data
min of ratings 1.0
max of ratings 8.3
mean of ratings 6.183999999999998
median of ratings 6.5
N/A is the most appearing director


---

# Part 7: Challenge Problems

These are optional advanced exercises for those who finish early.

### Challenge 7.1: Rate Limit Handler

Create a `RateLimiter` class that:
1. Tracks how many requests have been made
2. Automatically adds delays to stay under a rate limit
3. Handles 429 (Too Many Requests) responses by waiting and retrying

```python
limiter = RateLimiter(requests_per_minute=30)
response = limiter.get("https://api.example.com/data")
```

In [None]:
# # YOUR CODE HERE
# class RateLimiter():
#   def __init__(self, requests_per_minute: int, max_retries: int = 3):
#     self.requests_per_minute = requests_per_minute
#     self.delay = 60/requests_per_minute
#     self.window = 60.0
#     self.request_times = []
#     self.max_retries = max_retries

#   def get(self, url, **kwargs):
#     retries = 0
#     delay = 1

#     while True:
#       now = time.time()
#       while self.request_times and now - self.request_times[-1] >= self.window:
#         self.request_times.pop()

#       if len(self.request_times) >= self.requests_per_minute:
#         sleep_time = self.window - (now - self.request_times[-1])
#         if sleep_time > 0:
#           time.sleep(sleep_time)

#       self.request_times.append(time.time())

#       response = requests.get(url, **kwargs)
#       data = response.json()
#       if response.status_code != 429 and data["Response"] != 'False':
#         return response

#       if retries >= self.max_retries:
#         response.raise_for_status()

#       time.sleep(delay)

#       delay *= 2
#       retries += 1
import asyncio

class ParentRateLimiter():
  def __init__(self, requests_per_minute: int):
    self.requests_per_minute = requests_per_minute
    self.window = 60.0
    self.request_times = []

class SyncRateLimiter(ParentRateLimiter):
  def get(self, url, **kwargs):
    while True:
      now = time.time()
      while self.request_times and now - self.request_times[0] >= self.window:
        self.request_times.pop(0)

      if len(self.request_times) >= self.requests_per_minute:
          time.sleep(self.window - (now - self.request_times[0]))

      self.request_times.append(time.time())
      return requests.get(url, **kwargs)


class AsyncRateLimiter(ParentRateLimiter):
  async def get(self, session, url, **kwargs):
    async with asyncio.Lock():
      now = time.time()

      while self.request_times and now - self.request_times[0] >= self.window:
        self.request_times.pop(0)

      if len(self.request_times) >= self.requests_per_minute:
          await asyncio.sleep(self.window - (now - self.request_times[0]))

      self.request_times.append(time.time())

      return await session.get(url, **kwargs)

limiter = SyncRateLimiter(requests_per_minute=10)

params = {
    "apikey": OMDB_API_KEY,
    "type": "movie",
    "s": "Batman",
}

response = limiter.get("https://www.omdbapi.com/", params=params)
response

<Response [200]>

### Challenge 7.2: Async Movie Collector

The synchronous approach is slow because we wait for each request to complete.

Create an async version using `aiohttp` that can fetch multiple movies concurrently (while still respecting rate limits).

Compare the time to fetch 20 movies with sync vs async approach.

In [None]:
!pip install aiohttp



In [None]:
import aiohttp
import asyncio
import time
import requests

# YOUR CODE HERE
# Hint: You'll need to install aiohttp: pip install aiohttp
# And use asyncio to run the async code
class AsyncSyncMovieCollector():
  def __init__(self, asynclimiter, synclimiter):
    self.asynclimiter = asynclimiter
    self.synclimiter = synclimiter
  async def get_async(self, urls, **kwargs):
    async with aiohttp.ClientSession() as session:
      tasks = [
          self.asynclimiter.get(session, url, **kwargs)
          for url in urls
      ]
      return await asyncio.gather(*tasks)

  def get_sync(self, urls, **kwargs):
    responses = []
    for url in urls:
      response = self.synclimiter.get(url, **kwargs)
      responses.append(response)

urls = ['http://python.org']*20

asynclimiter = AsyncRateLimiter(requests_per_minute=20)
synclimiter = SyncRateLimiter(requests_per_minute=20)

async_sync_collector = AsyncSyncMovieCollector(asynclimiter, synclimiter)
st = time.time()
responses = await async_sync_collector.get_async(urls)
en = time.time()
async_time = en-st

st = time.time()
responses = async_sync_collector.get_sync(urls)
en = time.time()
sync_time = en-st

print("Async time", async_time)
print("Sync time", sync_time)

Async time 0.20731663703918457
Sync time 1.8067951202392578


### Challenge 7.3: Multi-Source Data Fusion

Create a data collection pipeline that:
1. Fetches basic movie data from OMDb
2. Enriches it with additional data from another source (e.g., Wikipedia API for plot summaries)
3. Merges the data based on movie title/year
4. Handles cases where data is missing from one source

Wikipedia API example:
```
https://en.wikipedia.org/api/rest_v1/page/summary/Inception_(film)
```

In [None]:
import requests
import pandas as pd

# YOUR CODE HERE
def fetch_additional_data(titles):
  descriptions = []
  for title in titles:
    headers = {
      "User-Agent": "MyWikiApp/1.0 (https://example.com; contact@example.com)",
      "Accept": "application/json"
  }
    response = requests.get(f"https://en.wikipedia.org/api/rest_v1/page/summary/{title}", headers=headers)
    data = response.json()
    descriptions.append(data.get("extract", ""))

  return pd.Series(descriptions, index=titles.index)

collector = ExtendedMovieDataCollector(OMDB_API_KEY)
moviesdata = pd.DataFrame()
genres = ['action', 'comedy', 'drama', 'horror', 'sci-fi']
for genre in genres:
  movies = collector.search_movies(genre, max_results=5)
  df = collector.to_dataframe(movies)
  df['genre'] = genre
  df['description'] = fetch_additional_data(df['title'])
  moviesdata = pd.concat([moviesdata, df], ignore_index=True)

moviesdata

Unnamed: 0,title,year,genre,director,actors,imdb_rating,imdb_votes,runtime,box_office,imdb_id,runtime_min,description
0,Last Action Hero,1993,action,John McTiernan,"Arnold Schwarzenegger, F. Murray Abraham, Art ...",6.5,171378,130 min,"$50,016,394",tt0107362,130,Last Action Hero is a 1993 American fantasy ac...
1,The King of Comedy,1982,comedy,Martin Scorsese,"Robert De Niro, Jerry Lewis, Diahnne Abbott",7.8,129135,109 min,"$2,536,242",tt0085794,109,(The) King of Comedy may refer to:
2,Confessions of a Teenage Drama Queen,2004,drama,Sara Sugarman,"Lindsay Lohan, Megan Fox, Adam Garcia",4.7,33006,89 min,"$29,331,068",tt0361467,89,Confessions of a Teenage Drama Queen is a 2004...
3,The Rocky Horror Picture Show,1975,horror,Jim Sharman,"Tim Curry, Susan Sarandon, Barry Bostwick",7.4,178887,100 min,"$113,028,197",tt0073629,100,The Rocky Horror Picture Show is a 1975 indepe...
4,The Sci-Fi Boys,2006,sci-fi,Paul Davids,"Peter Jackson, Ray Harryhausen, Leonard Maltin",6.9,490,80 min,,tt0800191,80,


---

# Summary

In this lab, you learned:

1. **HTTP Fundamentals**: URLs, status codes, headers
2. **curl**: Command-line HTTP requests
3. **Python requests**: Programmatic data collection
4. **Error handling**: Timeouts, retries, status codes
5. **OMDb API**: Real-world movie data
6. **BeautifulSoup**: Web scraping when APIs don't exist
7. **Data pipelines**: Building reusable collection code

## Next Week

**Week 2: Data Validation & Quality**

The data we collected today is messy! Next week we'll learn:
- Schema validation with Pydantic
- Data type cleaning
- Handling missing values
- Quality metrics

---

## Submission

Save your completed notebook and submit:
1. This notebook with all cells executed
2. The CSV file of movies you collected
3. A brief summary (1 paragraph) of what you learned