# Author: Kenny Nguyen
# HCDE 530
# Mini Project 2b

This notebook describes webscraping from the [Dear Abby letter archives](https://www.uexpress.com/life/dearabby/archives) and subsequent data cleaning after its conversion to a DataFrame. The CSV file made from this DataFrame was then imported into Clarifai for text classification using machine learning.

## Load Libraries
Loading relevant libraries for scraping and data cleaning.

In [None]:
# Before running this cell, run the following line of code:
# pip install beautifulsoup4

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import requests # Sends request to get website HTML
from bs4 import BeautifulSoup # Parses HTML


import datetime as dt # Library for reading or converting date formats
from collections import namedtuple # Cleaner way to access a tuple by field name instead of position index
from calendar import monthrange # Indicates days in each month for specified year

from tqdm import tqdm # Creates loading bar that is useful for displaying iterator progress

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Gather List of Webpages
As of June 2021, each "Dear Abby" letter is displayed on its own webpage. The URLs for each webpage correspond to the year, month, and date of the letter.

In [None]:
# Create a named tuple whose fields can be accessed by name rather than index.
Date = namedtuple("Date", ["year", "month", "day"])

# Create a function that returns all dates in a given year.
# (Works for leap years!)
def all_dates_in_year(year):
    for month in range(1, 13):
        for day in range(1, monthrange(year, month)[1] + 1):
            yield Date(year, month, day)

# Initialize an empty list that will store the URLs to be scraped.
url_list = []

# Use a for loop to add the URLs for each "Dear Abby" letter into the url_list.
for calendar_year in range(2010, 2021):
    for date in all_dates_in_year(calendar_year):
        base = "https://www.uexpress.com/life/dearabby"
        year = date[0]

        # Prepend 0 in front of single-digit months.
        if date[1] < 10:
            month = "0" + str(date[1])
        else:
            month = date[1]

        # Prepend 0 in front of single-digit days.
        if date[2] < 10:
            day = "0" + str(date[2])
        else:
            day = str(date[2])

        # Build full URL using base, year, month, and day then append to url_list.
        full_link = base + "/" + str(year) + "/" + str(month) + "/" + day
        url_list.append(full_link)

## Scrape Webpages

Scraping each webpage in the list of URLs for the date, title, and letter text. The top-most iterator is wrapped in tqdm() to display a loading bar for scraping progress. There is over 4000 URLs and the initial scraping took over 1.5 hours, so run this code block only once!

In [None]:
list_header = ["year", "month", "day", "url", "title", "text"] # Becomes header for DataFrame
data = [] # Stores data from webscraping

# Iterate through each URL in the list to scrape its contents.
for url in tqdm(url_list):
    # Send a GET request for the HTML of the webpage then parse it for readability.
    req = requests.get(url)
    soup = BeautifulSoup(req.content,'html.parser')
    
    # Target part of the webpage containing the letter.
    block = soup.find(class_="ContentSidebar_content__main__3P2AH")
    if block is not None:
        
        # Some pages contain a series of letters. Target the webpage contents even
        # further to return every letter.
        block = block.find_all(class_="Article_article__section__2lhpN")
        
        # Use a for loop to scan each letter on a webpage for contents.
        for letter in block:
            sub_data = []

            # Convert datetime string on page into datetime object.
            date_time_str = letter.find("time").get("datetime")
            date_time_obj = dt.datetime.strptime(date_time_str, '%Y-%m-%d')

            title = letter.find("h1").get_text()
            
            # The text of an individual may be split into multiple paragraphs.
            # To avoid a fencepost error, record the first paragraph before using a for-loop
            # to add in every paragraph after it.
            text = letter.find("p").get_text()
            if text is not None:
                for paragraph in letter.find_all("p")[1:]:
                    # End text scraping when it reaches the author's reply to the reader's problem.
                    if paragraph.get_text().startswith("DEAR"):
                        break
                    else:
                        text = text + " " + paragraph.get_text()

                sub_data.append(date_time_obj.year)
                sub_data.append(date_time_obj.month)
                sub_data.append(date_time_obj.day)
                sub_data.append(url)
                sub_data.append(title)
                sub_data.append(text)
                data.append(sub_data)

## Create and Clean DataFrame
After scraping the webpages, they are converted into a DataFrame and saved as a CSV file. This will prevent us from having to scrape the pages again, which is a long process. Instead, we can directly read in the CSV file when we want to manipulate the data.

In [None]:
# Only run this line when you first scrape the code. Subsequent manipulation should be done by
# reading in the CSV file that is created.
# dataFrame = pd.DataFrame(data, columns = list_header)

# Convert the DataFrame into a CSV file. This can be downloaded from the output folder.
# dataFrame.to_csv("dearabby.csv")

# Run this line to read the CSV after it is uploaded. This should be the beginning step after the DataFrame
# was created and converted into a CSV.
dataFrame = pd.read_csv("../input/dear-abby-20102020/dearabby.csv")

# Filter for letters that are addressed to Abby. No letters starting with "DEAR READERS" i.e.
# letters from the authors to the readers of the column.
dataFrame = dataFrame[dataFrame["text"].str.startswith("DEAR ABBY")]

# Remove the addressee text from the letters.
dataFrame["text"] = dataFrame["text"].str.replace("DEAR ABBY: ", "")

# Remove any double-quotes from the letters.
dataFrame["text"] = dataFrame["text"].str.replace("\"", "")

# Wrap commas in double-quotes, which is a requirement for inputting text data into Clarifai.
dataFrame["text"] = dataFrame["text"].str.replace(",", "\",\"")