# Collect articles about hurricane Helene coverage

1. Use wayback Machine to scrape everything from a news website

2. Use <a href = "https://guides.lib.unc.edu/news-Stories/NC-News"> NewsBank</a>, a global news database resource providing online archives of media publications. 

## 2. News Bank pdf about "Helene" coverage

Use Selenium to download news article pdfs from news bank

keyword: "helene"

date: Sep 23, 2024 (when the first 3 articles about Helene were published) - Oct 23, 2024

Location: USA - North Carolina

Ended up with manually downloading 5000+ articles in groups, because Newsbank requires identification authorization

In [1]:
import pdfplumber
import os
import selenium.webdriver as webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By 
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import re


extract text from pdf with <a href = 'https://github.com/jsvine/pdfplumber'>pdfplumber</a>

In [3]:
#   FUNC extract_text_from_pdf()
#   Input: directory (str) of pdf
#   Output: header (list) - article header information
#         article (list) - article text
def extract_text_from_pdf(dir, headers, articles):
    # header, article = [],[]
    with pdfplumber.open(dir) as pdf:
        for page in pdf.pages:
            # extract texts on each page and split texts by "OpenURL Link"
            texts = page.extract_text().split("\nOpenURL Link\n")
            # there is no "OpenURL Link" on that page
            if len(texts) == 1:
                articles[-1] += texts[0]
            else:
                headers.append(texts[0])
                articles.append(texts[1])
    return 

In [7]:
#   FUNC: decompose_header
#   Input: header (str)
#   Output: append titles, dates, newspapers, authors, and word_counts
def decompose_header(header):


    date_match = re.search(r"\b(September|October) \d{1,2}, \d{4}\b", header)
    
    date = date_match.group(0) if date_match else ""
    dates.append(date)
    loc = header.find(date)
    titles.append(header[:loc].replace("\n", ""))

    texts = header[loc:].split('\n')
    newspapers.append(texts[0][len(date)+1:].replace('| ', ""))

    # Line three
    match = re.search(r"Author: (.*?)Section: ", texts[1])
    author = match.group(1) if match else ""
    if author == "":
        author = texts[1][8:]
    authors.append(author)
    match_word = re.search(r"(\d+)\s*Words", texts[1])
    word_count = match_word.group(1) if match_word else ""
    word_counts.append(word_count)

    return

In [9]:
# FUNC: correct_news_name
# Input: newspaper name from NewsBank(str) e.g. Watauga Democrat, The (Boone, NC)
# Output: newspaper name corrected e.g. The Watauga Democrat
def correct_news_name(newspaper):
    if ", The" in newspaper:
        loc = newspaper.find(", The")
        newspaper = newspaper[:loc]
    elif "(" in newspaper:
        loc = newspaper.find('(')
        newspaper = newspaper[:loc]

    # Get rid of "The" at the beginning
    pattern = r"^The\s"
    newspaper = re.sub(pattern, "", newspaper)
    
    return newspaper

In [None]:
# Get all files under news_bank_pdf
directory = 'news_bank_pdf'
pdf_paths = [os.path.join(directory, file) for file in os.listdir(directory) if file.endswith(".pdf")]

headers, articles = [],[]
# Extract articles and headers from pdfs
for path in pdf_paths:
    extract_text_from_pdf(path, headers, articles)

# Save articles and headers to dataframe
temp = pd.DataFrame({"header":headers,
                    "article": articles})


In [None]:
# Decompose headers
titles, dates, newspapers, authors, word_counts = [],[],[],[],[]
temp['header'].apply(lambda x: decompose_header(x))

temp['title'] = titles
temp['date'] = dates
temp['newspaper'] = newspapers
temp['author'] = authors
temp['word_count'] = word_counts
temp['newspaper'] = temp['newspaper'].apply(lambda x: correct_news_name(x))

## Extract Texts from New Readable PDFs

### Manually Extract Articles

Import to <a href="https://notebooklm.google/">NotebookLM</a>, an AI document assistant by Google

In [4]:
import pandas as pd

pdfs = pd.read_csv("pdfs.csv")
pdfs.head()

Unnamed: 0,header,article
0,"Images of Destruction - and Hope\nOctober 6, 2...","Hurricane Helene swept across the Southeast, c..."
1,They were in the basement frantically preparin...,They were in the basement frantically preparin...
2,Free legal assistance available for Helene sto...,As thousands of North Carolinians continue to ...
3,No power but only minor damage: Spruce Pine qu...,The world's main producer of high-purity quart...
4,Milton takes turn to target Florida on 'destru...,Orlando Sentinel/Tribune Content Agency \nORLA...


In [5]:
for_str = """Dane Jackson won't offer up any spoilers on his status for Sunday's game against the Atlanta Falcons.\nThe veteran cornerback, who has been on injured reserve since the season started, could make his Carolina Panthers debut this weekend. But for now, he's just doing what he's told, and not sharing those directions with anyone outside of Bank of America Stadium.\n"I'm just following the plan that they've got for me," Jackson said with a big smile on Thursday after practice.\nJackson signed a two-year deal with the team in free agency in March. He was projected to be the favorite at the No. 2 cornerback spot opposite Jaycee Horn, but he suffered a notable hamstring injury in training camp in August.\nAnd he has been sidelined ever since.\n"It's been a process, for sure," Jackson said. "Never had a (hamstring injury) to this extent, so it's definitely been a process. But I've been working with the strength staff, with the training room staff - doing my own thing on the side, too - just trying to get to it and get back as healthy as I can."\nJackson built a bond with teammates in trainers room\nDuring Jackson's stint on the sideline, he bonded with fellow veterans D.J. Wonnum and Amare Barno, who have been on the physically unable to perform (PUP) list since July.\nThe trio worked in the trainers room together as they went through their respective rehab assignments. The bond between Wonnum and Jackson, in particular, helped the pair get back on the right track to returning to the field.\n"We've definitely (grown) closer since we've both been hurt, we've both been out," Jackson said. "We both like to play around a lot. Getting each other through the day - sometimes, you come in here hurt, and you've got to find it yourself. Just getting each other through the days and bonding with each other and growing together as teammates for sure."\nJackson, who played four seasons with the Buffalo Bills, is eager to play. He signed with Carolina largely due to his relationship and background with GM Dan Morgan.\nThe GM bet on Jackson, who wants to make the most of his opportunity with his """
head = "Panthers CB Dane Jackson preparing to return from IR\nOctober 13, 2024 | Charlotte Observer, The (NC) | Charlotte, North Carolina | Page 36\nAuthor: Mike Kaye"
pdfs[pdfs['header'] ==head]['article'] += for_str

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  pdfs[pdfs['header'] ==head]['article'] += for_str
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/ind

In [10]:
# Basic cleaning
import re

titles, dates, newspapers, authors, word_counts = [],[],[],[],[]
pdfs['header'].apply(lambda x: decompose_header(x))

pdfs['title'] = titles
pdfs['date'] = dates
pdfs['newspaper'] = newspapers
pdfs['author'] = authors
pdfs['word_count'] = word_counts

pdfs['newspaper'] = pdfs['newspaper'].apply(lambda x: correct_news_name(x))

pdfs.head()

Unnamed: 0,header,article,title,date,newspaper,author,word_count
0,"Images of Destruction - and Hope\nOctober 6, 2...","Hurricane Helene swept across the Southeast, c...",Images of Destruction - and Hope,"October 6, 2024",Charlotte Observer,THE CHARLOTTEOBSERVER,
1,They were in the basement frantically preparin...,They were in the basement frantically preparin...,They were in the basement frantically preparin...,"October 6, 2024",Charlotte Observer,MARTHA QUILLIN,
2,Free legal assistance available for Helene sto...,As thousands of North Carolinians continue to ...,Free legal assistance available for Helene sto...,"October 6, 2024",Charlotte Observer,CHYNA BLACKMON cblackmon@charlotteobserver.com,
3,No power but only minor damage: Spruce Pine qu...,The world's main producer of high-purity quart...,No power but only minor damage: Spruce Pine qu...,"October 6, 2024",Charlotte Observer,BRIAN GORDON,
4,Milton takes turn to target Florida on 'destru...,Orlando Sentinel/Tribune Content Agency \nORLA...,Milton takes turn to target Florida on 'destru...,"October 9, 2024",Charlotte Observer,RICHARD TRIBOU,


## Filter news article data

In [None]:
# Concat with other articles
temp['newspaper'] = temp['newspaper'].apply(lambda x: correct_news_name(x))
temp = temp[temp['word_count'].isna() == False]

articles = pd.concat([temp, pdfs])

In [None]:
county = pd.read_csv('helene_county_newsrooms/WNC Helene counties.csv')

# Clean the county name
county['County'] = county['County'].str.replace(' (County)', '')

news_census = pd.read_excel('helene_county_newsrooms/NC-News-Census-1.xlsx')

# Join two datasets
helene_news = pd.merge(county, news_census, how='left')

# Filter by 'newspaper' & 'digital' types only
helene_news.Type = helene_news.Type.str.lower()
helene_newspaper = helene_news[helene_news['Type'].isin(['newspaper','digital'])]

# Save it locally
# helene_newspaper.to_csv('wnc_newspaper.csv', index=False)

# Remove "The " at the start of the names
helene_newspaper['newspaper'] = helene_newspaper.Outlet.apply(lambda x: correct_news_name(x))

# Keep necessary info only
h_news = helene_newspaper[['County', 'Outlet','newspaper']]

# Merge
helene_articles = pd.merge(articles, h_news, how="inner")

# Save
# helene_articles.to_csv('helene_all_articles.csv', index=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  helene_newspaper['newspaper'] = helene_newspaper.Outlet.apply(lambda x: correct_news_name(x))
