## Slice historical questions into years

Take the `questions_from_current_justices.csv` and slice it into separate years, in order to prioritize processing of more recent years.

Save slices at `'../datasets/historical_questions/{year}/{year}_questions_from_current_justices.csv'`

In [1]:
import pandas as pd
import os

In [2]:
input_fp = '../datasets/questions_from_current_justices.csv'
df = pd.read_csv(input_fp)
df.head()

Unnamed: 0.1,Unnamed: 0,transcript_id,question_addressee,justice,question_text,opening_statement
0,0,2009.08-1529-t01,respondent,Sonia Sotomayor,Can you tell me how many PHS personnel work in...,<speaker>Pratik A. Shah</speaker><text>Mr. Chi...
1,1,2009.08-1529-t01,respondent,Sonia Sotomayor,--And is there a reason Congress would want to...,<speaker>Pratik A. Shah</speaker><text>Mr. Chi...
2,2,2009.08-1529-t01,respondent,"Samuel A. Alito, Jr.",Are they paid less than other -- than other Fe...,<speaker>Pratik A. Shah</speaker><text>Mr. Chi...
3,3,2009.08-1529-t01,respondent,"Samuel A. Alito, Jr.","If section 2679(b)(2), instead of saying parag...",<speaker>Pratik A. Shah</speaker><text>Mr. Chi...
4,4,2009.08-1529-t01,petitioner,"John G. Roberts, Jr.",You're not abandoning it; you're taking it fur...,<speaker>Elaine J. Goldenberg</speaker><text>I...


In [4]:
def extract_year(id):
    return id.split('.')[0]

df['year'] = df['transcript_id'].apply(extract_year)
df['year'].value_counts()

2023    4598
2022    4429
2021    3625
2020    3095
2018    2598
2019    2106
2010    2029
2011    1829
2012    1765
2015    1688
2017    1674
2009    1564
2016    1545
2013    1545
2014    1084
2008     961
2007     900
2006     767
2005     574
1991      21
1994      15
1996      14
1998      13
1997      12
1992       9
2002       8
1999       7
2001       7
2000       4
1995       3
1993       3
Name: year, dtype: int64

In [5]:
base_out_dir = '../datasets/historical_questions/'

# group dataset by year and save each group to a separate subdir
for year, group in df.groupby('year'):
    # create subdir
    year_directory = os.path.join(base_out_dir, str(year))
    os.makedirs(year_directory, exist_ok=True)
    
    # save slice to the file
    output_path = os.path.join(year_directory, f'{year}_questions_from_current_justices.csv')
    group.to_csv(output_path, index=False)
    print(f"Saved questions for year {year} to {output_path}")


Saved questions for year 1991 to ../datasets/historical_questions/1991/1991_questions_from_current_justices.csv
Saved questions for year 1992 to ../datasets/historical_questions/1992/1992_questions_from_current_justices.csv
Saved questions for year 1993 to ../datasets/historical_questions/1993/1993_questions_from_current_justices.csv
Saved questions for year 1994 to ../datasets/historical_questions/1994/1994_questions_from_current_justices.csv
Saved questions for year 1995 to ../datasets/historical_questions/1995/1995_questions_from_current_justices.csv
Saved questions for year 1996 to ../datasets/historical_questions/1996/1996_questions_from_current_justices.csv
Saved questions for year 1997 to ../datasets/historical_questions/1997/1997_questions_from_current_justices.csv
Saved questions for year 1998 to ../datasets/historical_questions/1998/1998_questions_from_current_justices.csv
Saved questions for year 1999 to ../datasets/historical_questions/1999/1999_questions_from_current_justi

## Slice historical questions in each year into chunks of 1000 samples each

In [7]:
base_dir = '../datasets/historical_questions/'

In [9]:
for year_folder in os.listdir(base_dir):
    year_path = os.path.join(base_dir, year_folder)
    year_csv = os.path.join(year_path, f'{year_folder}_questions_from_current_justices.csv')
    if os.path.exists(year_csv): # may not exist if previous section has not ran
        df = pd.read_csv(year_csv)

        slices_dir = os.path.join(year_path, 'slices')
        os.makedirs(slices_dir, exist_ok=True)

        # Split the dataframe into chunks of 1000
        num_chunks = (len(df) + 999) // 1000
        for i in range(num_chunks):
            start_idx = i * 1000
            end_idx = start_idx + 1000
            chunk = df[start_idx:end_idx]

            # Save chunk
            output_file = os.path.join(
                slices_dir, f'{year_folder}_{i}_questions_from_current_justices.csv'
            )
            chunk.to_csv(output_file, index=False)
            print(f"Saved slice {i} for year {year_folder} to {output_file}")

Saved slice 0 for year 2018 to ../datasets/historical_questions/2018/slices/2018_0_questions_from_current_justices.csv
Saved slice 1 for year 2018 to ../datasets/historical_questions/2018/slices/2018_1_questions_from_current_justices.csv
Saved slice 2 for year 2018 to ../datasets/historical_questions/2018/slices/2018_2_questions_from_current_justices.csv
Saved slice 0 for year 2019 to ../datasets/historical_questions/2019/slices/2019_0_questions_from_current_justices.csv
Saved slice 1 for year 2019 to ../datasets/historical_questions/2019/slices/2019_1_questions_from_current_justices.csv
Saved slice 2 for year 2019 to ../datasets/historical_questions/2019/slices/2019_2_questions_from_current_justices.csv
Saved slice 0 for year 2005 to ../datasets/historical_questions/2005/slices/2005_0_questions_from_current_justices.csv
Saved slice 0 for year 2006 to ../datasets/historical_questions/2006/slices/2006_0_questions_from_current_justices.csv
Saved slice 0 for year 2014 to ../datasets/histo