# Cleaning C19 American literature records from LC MDS books data

- This notebook continues from the previous notebook.
- The overarching purpose of the cleaning steps undertaken here are to **normalize places of publication** for long nineteenth century works of US literature.
- Extract the `PS` range for American literature
  - See the [LC Classification Outline](https://www.loc.gov/aba/cataloging/classification/lcco/) for information about specific ranges
- Filter `PS` range for American literature (`PS` includes Canadian lit, too)
- Get publication years and `PS` lead digits as integers for sorting
- Clean places of publication

In [1]:
%load_ext autoreload
%autoreload 2

import sys

sys.path.append("../src")

from data_collection import *
from data_cleaning import *

In [2]:
# change to your local path based on the output of the previous notebook:
output_dir = "/Users/erik/Desktop/LC_PS"

df = load_pickles_to_dataframe(output_dir)

# Confirm expected shape
df.shape == (303567, 7)

True

In [3]:
df["ps_digits"] = df["classifications"].apply(
    lambda x: [get_ps_digits(y) for y in x if "PS" in y][0]
)

In [4]:
# PS(8001-8599) == Canadian literature
df["is_canadian"] = df["ps_digits"] > 8000

In [5]:
# Drop Canadian records
df = df[~df["is_canadian"]]

In [6]:
df["year_int"] = df["year"].apply(get_years_ints)

In [7]:
df["publisher_year_int"] = df["publishers"].apply(get_publishers_year_ints)

In [8]:
# combine_first integers from publishers and year columns
df["year_publisher_int"] = df["year_int"].combine_first(df["publisher_year_int"])

In [9]:
df["min_year"] = df.apply(get_min_year, axis=1)

In [10]:
# Fill missing values with min year 0
df["min_year"] = df["min_year"].fillna(0).astype(int)

In [11]:
df["places_clean"] = df["places"].apply(lambda x: [clean_string(s) for s in x])

In [12]:
df["places_clean"] = df["places_clean"].apply(flatten_list)

In [13]:
cols = ["lccn", "classifications", "year", "places", "publishers"]

for col in cols:
    df[col] = clean_col(df, col)

In [14]:
df["decade"] = df["min_year"].apply(get_decade)

In [15]:
# add boolean for if places is list with length > 1
df["multiple_places"] = df["places_clean"].apply(lambda x: len(x) > 1)

In [16]:
df_exploded = df.explode("places_clean")

In [17]:
df_exploded["target_city"] = df_exploded["places_clean"].apply(get_target_cities)

In [18]:
df_exploded = df_exploded.explode("target_city")

In [22]:
# Create the data set
df_c19 = df_exploded.copy()
df_c19 = df_c19[
    [
        "lccn",
        "classifications",
        "ps_digits",
        "first_author",
        "title",
        "publishers",
        "min_year",
        "places",
        "multiple_places",
        "target_city",
    ]
]
df_c19 = df_c19[(df_c19["min_year"] <= 1945) & (df_c19["min_year"] >= 1600)]
df_c19.sort_values("ps_digits", inplace=True)
df_c19.to_csv("../data/data.csv", index=False)