# Prepping Data for Preprocessing
The purpose of this notebook is to perform some exploratory data analysis to determine which publications to use for training and testing the model. The output will be a csv file that consists of the publication title and file location of the 15,000 articles to be used.

## Importing Data

In [1]:
import pandas as pd

df = pd.read_excel("19thCenturyUSNewspapers.xlsx")
df.dropna(inplace=True) # dropped metadata rows

### Top 10 Publications 
The goal is to determine the top 10 publications by number of articles published.

In [2]:
publications = set(df["PublicationTitle"])
# print(len(publications))
pubs = []

for pub in publications:
  pubs.append([pub, len(df[df["PublicationTitle"] == pub])])

pubsDF = pd.DataFrame(pubs, columns=["PublicationTitle", "NumArticles"])
pubsDF = pubsDF.sort_values("NumArticles", ascending=False).head(10)
# pubsDF.to_csv("publications.csv")
pubsDF


Unnamed: 0,PublicationTitle,NumArticles
378,"National Intelligencer (Washington, DC)",19495
329,"North American (Philadelphia, PA)",19371
330,"Milwaukee Daily Sentinel (Milwaukee, WI)",17466
54,"Bangor Daily Whig and Courier (Bangor, ME)",15794
394,"Boston Daily Advertiser (Boston, MA)",13468


In [19]:
# print((pubsDF["NumArticles"] > 1500).sum())
# pubsToUse = pubsDF["PublicationTitle"].to_list()
# articlesToUse = df.loc[df["PublicationTitle"].isin(pubsDF["PublicationTitle"])]
# pubsDF[pubsDF["NumArticles"] > 1500].min()
# df.head()

Unnamed: 0,PublicationTitle,IssueDate,ImageLocation,DataLocation,Filename
0,"Arkansas State Gazette (Little Rock, AR)",,,\19thCenturyUSNewspapers_01\XML\NEWSPAPERS\5AHK\,5AHK_PublicationMetadata.xml
1,"Arkansas State Gazette (Little Rock, AR)","October 11, 1836",\19thCenturyUSNewspapers_02\Images\NEWSPAPERS\...,\19thCenturyUSNewspapers_01\XML\NEWSPAPERS\5AH...,5AHK-1836-OCT11_Issue.xml
2,"Arkansas State Gazette (Little Rock, AR)","October 18, 1836",\19thCenturyUSNewspapers_02\Images\NEWSPAPERS\...,\19thCenturyUSNewspapers_01\XML\NEWSPAPERS\5AH...,5AHK-1836-OCT18_Issue.xml
3,"Arkansas State Gazette (Little Rock, AR)","October 25, 1836",\19thCenturyUSNewspapers_02\Images\NEWSPAPERS\...,\19thCenturyUSNewspapers_01\XML\NEWSPAPERS\5AH...,5AHK-1836-OCT25_Issue.xml
4,"Arkansas State Gazette (Little Rock, AR)","November 01, 1836",\19thCenturyUSNewspapers_02\Images\NEWSPAPERS\...,\19thCenturyUSNewspapers_01\XML\NEWSPAPERS\5AH...,5AHK-1836-NOV01_Issue.xml


In [55]:
trainDF = pd.DataFrame()
testDF = pd.DataFrame()
for pub in pubsDF["PublicationTitle"]:
  # Append the first 1000 rows of given publication
  trainDF = pd.concat([trainDF, df.loc[df["PublicationTitle"] == pub][:1000]], ignore_index=True)
  testDF = pd.concat([testDF, df.loc[df["PublicationTitle"] == pub][1000:1500]], ignore_index=True)

trainDF["Label"] = "training"
testDF["Label"] = "testing"

# len(trainDF)
# len(testDF)
# testDF.head()
finalDF = pd.concat([trainDF, testDF], ignore_index=True)

finalDF.to_csv("dataset.csv")

# len(finalDF)
# len(set(finalDF["PublicationTitle"]))
# (finalDF["PublicationTitle"] == 'Bangor Daily Whig and Courier (Bangor, ME)').sum()

In [3]:
finalDF = pd.DataFrame()

for pub in pubsDF["PublicationTitle"]:
  finalDF = pd.concat([finalDF, df[df["PublicationTitle"] == pub][:1500]], ignore_index=True)
finalDF["Location"] = "D:" + finalDF["DataLocation"] + finalDF["Filename"].str.replace("Issue", "Text", regex=True)
finalDF["Location"] = finalDF["Location"].str.replace("\\", "/", regex=True)
finalDF["Location"] = finalDF["Location"].str.replace("19thCenturyUSNewspapers_01", "19cUSNewspapers_01")
finalDF.drop(["IssueDate", "ImageLocation", 'DataLocation', 'Filename'], axis=1, inplace=True)

finalDF.head()
finalDF.to_csv("articles.csv")

