## Problem Statement

The tables are currently unlabeled. The goals are as follows: infer the name of the table from its contents and format them appropriately.

My first attempt to solve this problem involved finding unique column names for each table. That approach does not work because columns are not necessarily unique to specific tables. Alternatively, the script could identify tables based on a unique constellation of columns. 

A potential problem with this approach is that there may be several versions of the same table. For example, tables with passing statistics may differ between quarterbacks if one is from the 1970s versus one from the 2020s. 

It seems the solution requires a combination of approaches. The first step is to identify the number of unique column combinations. 

In [24]:
# loop through each folder in player_tables
# for each folder, open each file, add columns to df1, add first row to df2 as a new row
# count each unique row in df1
# extract unique rows from df2
# write df1 to tables_header_counts.csv
# write df2 to table_first_rows.csv

import os
import pandas as pd

# create empty dataframes
df1 = pd.DataFrame()
df2 = pd.DataFrame()

# loop through each folder in player_tables
for folder in os.listdir('scraping/player_tables'):
    for file in os.listdir('scraping/player_tables/' + folder):
        # read each file in folder
        df = pd.read_csv('scraping/player_tables/' + folder + '/' + file)
        # add columns to df1 as a new row
        df1 = pd.concat([df1, pd.DataFrame(df.columns).T], axis=0)
        # add first row to df2 as a new row
        df2 = pd.concat([df2, pd.DataFrame(df.iloc[0]).T], axis=0)

# count each unique row in df1
df_unique = df1.drop_duplicates()

# extract unique rows from df2
df2 = df2.drop_duplicates()

# write df1 to tables_header_counts.csv
df_unique.to_csv('tables_header_counts.csv', index=False)

# write df2 to table_first_rows.csv
df2.to_csv('table_first_rows.csv', index=False)

KeyboardInterrupt: 