## Problem Statement

The tables are currently unlabeled. The goals are as follows: infer the name of the table from its contents and format them appropriately.

My first attempt to solve this problem involved finding unique column names for each table. That approach does not work because columns are not necessarily unique to specific tables. Alternatively, the script could identify tables based on a unique constellation of columns. 

A potential problem with this approach is that there may be several versions of the same table. For example, tables with passing statistics may differ between quarterbacks if one is from the 1970s versus one from the 2020s. 

It seems the solution requires a combination of approaches. The first step is to identify the number of unique column combinations. 

In [33]:
import os
import pandas as pd

# Folder containing your CSV files
base_path = 'scraping/player_tables'

# List to store the file name and column names
column_data = []

# Iterate through each file in the folder
for folder in os.listdir(base_path):
    for filename in os.listdir(base_path + '/' + folder):
        if filename.endswith('.csv'):
            file_path = os.path.join(base_path,folder, filename)
            
            # Load the CSV file into a DataFrame
            df = pd.read_csv(file_path)
            
            # Get the column names
            columns = df.columns.tolist()
            
            # Append the file name and column names as a row in the list
            column_data.append([file_path] + columns)

# Create a DataFrame with the file name and column names
columns_df = pd.DataFrame(column_data)

# Rename the first column as 'file' and the rest as 'column1', 'column2', etc.
columns_df.columns = ['file'] + [f'column{i+1}' for i in range(len(columns_df.columns) - 1)]

# Drop duplicates, ignoring the 'file' column
unique_columns_df = columns_df.drop_duplicates(subset=columns_df.columns[1:])

# Save the resulting DataFrame to a CSV file
unique_columns_df.to_csv('unique_columns.csv', index=False)

print("Unique column names saved to 'unique_columns.csv'")


Unique column names saved to 'unique_columns.csv'
