# Data Acquisition via API Calls (NewsAPI)

---

## Phase 1: Setup and Security

### Install Dependencies

In [16]:
# Install the necessary library for loading environment variables (like API keys)
!pip install python-dotenv

print("Dependencies installed successfully.")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Dependencies installed successfully.


### Import Modules
Import all necessary Python libraries. Note that `requests` is the primary tool for submitting HTTP requests.

In [17]:
import numpy as np
import pandas as pd
import os
import requests
import json
import dotenv
from datetime import datetime

print("Modules imported.")

Modules imported.


### Load Environment Variables
This loads your `.env` file, which is crucial for securely handling your NewsAPI key without exposing it in the notebook.

In [18]:
# Load the .env file to access environment variables (e.g., your API Key)
dotenv.load_dotenv()

print("Environment variables loaded.")

Environment variables loaded.


### Retrieve API Key
We access the stored API key using os.getenv().

In [19]:
# Get the News API key from the environment variables
newskey = os.getenv('newskey')

# We won't print the key, but we confirm it's loaded (e.g., by checking its length)
if newskey:
    print("News API key successfully retrieved.")
else:
    print("ERROR: News API key not found. Check your .env file.")

News API key successfully retrieved.


---

## Phase 2: Building the API Request Components
An HTTP request requires four main components: URL root, endpoint, headers (for security/metadata), and parameters (the query).

### Build User Agent Header
Many APIs require a User-agent to identify the client application. We use httpbin.org to dynamically grab a standard User-agent string.

In [20]:
# 1. Get a standard User-agent string
r = requests.get('https://httpbin.org/user-agent')
useragent = json.loads(r.text)['user-agent']

# 2. Build the full headers dictionary
# The API key is sent via the 'X-Api-Key' header, a common secure method.
headers = {'User-agent': useragent,
           'X-Api-Key': newskey}

print(f"User-agent: {useragent}")
print("Headers dictionary created, containing the API key.")

User-agent: python-requests/2.32.4
Headers dictionary created, containing the API key.


### Define URL Root and Endpoint
This defines the fixed address for the NewsAPI service we want to use.

In [21]:
# Define the fixed URL parts
root = 'https://newsapi.org'
endpoint = '/v2/everything' # This endpoint searches all articles

print(f"API Base URL: {root + endpoint}")

API Base URL: https://newsapi.org/v2/everything


### Define Query Parameters
Parameters are used to customize the search (e.g., topic, language, date).

In [22]:
# Define the search parameters as a Python dictionary
params = {'q': '"tallest mountain"',  # Topic to search for (using quotes for exact phrase)
         'searchIn': 'content',      # Search within the article content
         'language': 'en',           # Restrict to English articles
         'pageSize': 100}            # Request up to 100 articles

print("Query parameters defined:")
print(params)

Query parameters defined:
{'q': '"tallest mountain"', 'searchIn': 'content', 'language': 'en', 'pageSize': 100}


---

## Phase 3: API Execution and Parsing

### Submit the GET Request
This cell sends the request and checks the response status. A `<Response [200]>` indicates success.

In [23]:
# Combine the components and submit the GET request
r = requests.get(root + endpoint,
                headers = headers,
                params = params)

# Display the response object (it should show <Response [200]>)
print(f"Request submitted. Status: {r}")

Request submitted. Status: <Response [200]>


### Parse the JSON Response
The response (`r.text`) is a single string containing the JSON data. We use `json.loads()` to convert this string into a usable Python dictionary.

In [24]:
# Convert the JSON response string into a Python dictionary
myjson = json.loads(r.text)

print(f"JSON response converted to a Python dictionary (Type: {type(myjson)})")
print("Top-level keys in the response:")
print(list(myjson.keys()))

JSON response converted to a Python dictionary (Type: <class 'dict'>)
Top-level keys in the response:
['status', 'totalResults', 'articles']


### View Raw JSON Data Structure (Optional)
This cell is often left commented out to avoid printing a massive wall of text but serves as a way to inspect the data structure.

In [25]:
# Uncomment this line to inspect the full structure of the JSON dictionary
# print(json.dumps(myjson, indent=4))

### Normalize JSON to DataFrame
The key data is nested under the articles key. Pandas’ json_normalize() function flattens this nested data into a clean, tabular DataFrame.

In [26]:
# Use json_normalize to extract the list of article dictionaries ('articles')
news_df = pd.json_normalize(myjson, record_path = ['articles'])

print(f"DataFrame created with {len(news_df)} articles.")
print("First 5 rows of the DataFrame:")
display(news_df.head())

DataFrame created with 22 articles.
First 5 rows of the DataFrame:


Unnamed: 0,author,title,description,url,urlToImage,publishedAt,content,source.id,source.name
0,Lydia Mansel,This Historic Train Climbs the Tallest Mountai...,Riding the Mount Washington Cog Railway is one...,https://www.travelandleisure.com/mount-washing...,https://s.yimg.com/ny/api/res/1.2/JmSXRiaGiWkv...,2025-09-20T16:00:00Z,Key Points\r\n<ul><li>The Mount Washington Cog...,,Travel+Leisure
1,Express Web Desk,"Polish skier becomes first to climb, ski down ...","In 2018, the Polish climber was the first pers...",https://indianexpress.com/article/world/polish...,https://images.indianexpress.com/2025/09/polan...,2025-09-28T04:50:59Z,Polish skier Andrzej Bargiel made history this...,,The Indian Express
2,Georgie English,CLEAREST EVER SIGNS OF LIFE ON MARS...,NASA has revealed the clearest signs of life o...,https://www.the-sun.com/tech/15159152/nasa-mar...,https://www.the-sun.com/wp-content/uploads/sit...,2025-09-10T17:34:37Z,NASA has revealed the clearest signs of life o...,,The-sun.com
3,ABC News,"Sudan landslide claims 1,000 lives, village 'c...",The Sudan Liberation Movement/Army is appealin...,https://www.abc.net.au/news/2025-09-02/sudan-l...,https://live-production.wcms.abc-cdn.net.au/73...,2025-09-02T07:52:51Z,"At least 1,000 people were killed in a landsli...",abc-news-au,ABC News (AU)
4,"sascha.pare@futurenet.com (Sascha Pare) , Sasc...",The geology that holds up the Himalayas is not...,A 100-year-old theory explaining how Asia can ...,https://www.livescience.com/planet-earth/geolo...,https://cdn.mos.cms.futurecdn.net/KMwyEqed8eMT...,2025-08-30T15:50:00Z,Scientists may have just toppled a 100-year-ol...,,Live Science


---

## Phase 4: Data Analysis and Export

### Clean and Prepare for Export
Perform a final step to ensure the data is properly formatted before saving.

In [27]:
# Clean up the publishedAt column (convert to datetime)
news_df['publishedAt'] = pd.to_datetime(news_df['publishedAt'])

# Select a final set of columns for the CSV
final_df = news_df[['publishedAt', 'title', 'description', 'url', 'source.name', 'author']].copy()

print("DataFrame prepared for export.")

DataFrame prepared for export.


### Export to CSV
This is the final Load (L) phase of the data acquisition process.

In [28]:
# Define a filename based on the current date for organization
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_filename = f'news_articles_{timestamp}.csv'

# Export the final clean DataFrame to a CSV file
final_df.to_csv(output_filename, index=False)

print(f"Successfully exported {len(final_df)} articles to {output_filename}")

Successfully exported 22 articles to news_articles_20250930_212832.csv


---

## Function Definition (Optional Consolidation)
This final cell provides the consolidation of all steps into a single, reusable function—demonstrating how a script would execute the entire API call process.

In [29]:
def grab_latest_articles():
    """
    Consolidates all steps: builds request, calls API, parses JSON, and returns a DataFrame.
    """
    
    # Prompt the user
    topic = input("Please enter your topic of interest: ")
    
    # Build our Headers
    newskey = os.getenv('newskey')
    r = requests.get('https://httpbin.org/user-agent')
    useragent = json.loads(r.text)['user-agent']
    headers = {'User-agent': useragent,
               'X-Api-Key': newskey}

    # Build our URL and Parameters
    root = 'https://newsapi.org'
    endpoint = '/v2/everything'
    params = {'q': topic,
              'searchIn': 'content',
              'language': 'en',
              'pageSize': 100}
    
    # Submit our Request
    r = requests.get(root + endpoint,
                headers = headers,
                params = params)
    
    # Create and return the pandas dataframe
    myjson = json.loads(r.text)
    news_df = pd.json_normalize(myjson, record_path = ['articles'])
    
    return news_df

In [30]:
grab_latest_articles()

Unnamed: 0,author,title,description,url,urlToImage,publishedAt,content,source.id,source.name
0,Elissa Welle,Grammarly can now fix your Spanish and French ...,"For 16 years, a team of linguists carefully cr...",https://www.theverge.com/news/775144/grammarly...,https://platform.theverge.com/wp-content/uploa...,2025-09-10T12:51:11Z,<ul><li></li><li></li><li></li></ul>\r\nIt can...,the-verge,The Verge
1,Ece Yildirim,5 Things to Know About Why Salesforce Stock is...,"Earlier this week, CEO Marc Benioff said that ...",https://gizmodo.com/5-things-to-know-salesforc...,https://gizmodo.com/app/uploads/2025/09/beniof...,2025-09-04T15:08:57Z,Salesforce stock was down almost 8% this morni...,,Gizmodo.com
2,Zeyi Yang,China Turns Legacy Chips Into a Trade Weapon,"As Washington pushes for a TikTok deal, Beijin...",https://www.wired.com/story/china-probe-us-chi...,https://media.wired.com/photos/68cb4d9a4eff22d...,2025-09-18T15:00:00Z,While the Trump administration was trying to m...,wired,Wired
3,Daniel Geiger,What a $9 billion takeover battle says about t...,Core Scientific shareholders are pushing back ...,https://www.businessinsider.com/ai-data-center...,https://i.insider.com/68cd9506183847aa39d7222b...,2025-09-22T09:06:01Z,Assets with access to power are increasingly v...,business-insider,Business Insider
4,Lucas Ropek,"Now, Netflix Is Rumored to Want Warner Bros.",It appears there may be some competition to ac...,https://gizmodo.com/now-netflix-is-rumored-to-...,https://gizmodo.com/app/uploads/2025/09/Warner...,2025-09-20T11:30:03Z,"The media industry is in a state of tumult, an...",,Gizmodo.com
...,...,...,...,...,...,...,...,...,...
95,Sean O'Neill,Certares Founder Greg O’Hara at Skift Global F...,Travel investor Greg O'Hara discussed Middle E...,http://skift.com/2025/09/20/certares-founder-g...,https://skift.com/wp-content/uploads/2025/09/g...,2025-09-20T18:31:40Z,Key Points\r\n<ul><li>Greg OHara of Certares d...,,Skift
96,Abdul Rahman,Snowflake Inc. (SNOW) Transitions into a Data ...,Snowflake Inc. (NYSE:SNOW) is one of the best ...,https://finance.yahoo.com/news/snowflake-inc-s...,https://media.zenfs.com/en/insidermonkey.com/8...,2025-09-13T13:53:25Z,Snowflake Inc. (NYSE:SNOW) is one of the best ...,,Yahoo Entertainment
97,Nikola Balić,Vibe Coding Through the Berghain Challenge,How my AI coding partner and I obsessed over a...,https://nibzard.com/berghain/,https://nibzard.com/api/og/berghain,2025-09-06T14:01:25Z,# Part 1: The Billboard That Started Everythin...,,Nibzard.com
98,Godfrey Benjamin,UK’s First Bitcoin Treasury Company B HODL Buy...,"B HODL Plc., a UK-listed Bitcoin BTC $113 031 ...",https://www.coinspeaker.com/uks-first-bitcoin-...,https://media.zenfs.com/en/coinspeaker_us_106/...,2025-09-24T12:12:27Z,"B HODL Plc., a UK-listed Bitcoin treasury comp...",,Coinspeaker
