## Getting the GSS Data

Since the data files are about 40GB zipped, we can't store a compressed or uncompressed version on GitHub, and the entire dataset can't really be loaded into memory with Colab.

One option is to use Rivana: Download the data, unzip it, and work on it in a persistent environment.

The other option is to avoid opening the entire file at once, and instead work with chunks of the data. That's what this code does for you.

On GitHub, the data are broken into three smaller files, saved in .parquet format. The code below will load these chunks into memory, one at a time, you can specify the variables you want in `var_list`, and the results will be saved in `selected_gss_data.csv`.

You can add more cleaning instructions in between the lines where the data are loaded ( `df = pd.read_parquet(url)`) and the data are saved (`df.loc...`). It's probably easiest to use this code to get only the variables you want, and then clean that subset of the data.

In [12]:
import pandas as pd

# List of variables you want to save
var_list = ['abany', 'abrape', 'childs_rec', 'con_sc_judge', 'conmedic', 'men_vs_women_politics']
output_filename = 'my_selected_gss_data.csv'  # name of the file you want to save

# URLs for the CSV files in the lab_data folder:
csv_file_urls = [
    'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/abany.csv',
    'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/abrape.csv',
    'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/childs_rec.csv',
    'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/con_sc_judge.csv',
    'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/conmedic.csv',
    'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/men_vs_women_politics.csv'
]

phase = 0  # Controls write vs. append mode

# Loop through each CSV file
for url in csv_file_urls:
    print(f"Downloading: {url}")  # Debugging check
    try:
        df = pd.read_csv(url)  # Load the CSV file
        print(df.head())  # Inspect first few rows

        # Save selected columns to output CSV file
        if phase == 0:
            df.to_csv(output_filename, mode='w', header=True, index=False)  # Overwrite on first file
            phase = 1  # Switch to append mode
        else:
            df.to_csv(output_filename, mode='a', header=False, index=False)  # Append next files

    except Exception as e:
        print(f"Error loading {url}: {e}")  # Debugging error message


Downloading: https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/abany.csv
          category   name                                   label response  \
0  Current Affairs  abany  Abortion if woman wants for any reason      Yes   
1  Current Affairs  abany  Abortion if woman wants for any reason      Yes   
2  Current Affairs  abany  Abortion if woman wants for any reason      Yes   
3  Current Affairs  abany  Abortion if woman wants for any reason      Yes   
4  Current Affairs  abany  Abortion if woman wants for any reason      Yes   

  breakdown breakdown_category  year  percent  std_err  
0     Total              Total  1977    36.72     1.75  
1     Total              Total  1978    32.99     1.94  
2     Total              Total  1980    39.20     1.72  
3     Total              Total  1982    40.46     1.94  
4     Total              Total  1983    32.15     1.63  
Downloading: https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/abrape.

I chose variables that I felt overlapped in expressing the common sentiments about survey respondents' confidence levels in the American government's handling of currently controversial issues; specifically, abortion. The six variables chosen are: abany (abortion in any case), abrape (abortion in the case of rape), childs-rec (number of children per respondent), con_sc_judge (confidence in current supreme court justices), conmedic (confidence in American medicine), and men_vs_women_politics (belief whether men are better at politics than women). Altogether, I felt that these variables would demonstrate current beliefs about the American healthcare system as well as the confidence levels in the current government's choices/directives; especially actions regarding abortion since that is a primary interest of mine and the data.  

In [13]:
import pandas as pd
#
var_list = ['wrkstat', 'prestige'] # List of variables you want to save
output_filename = 'selected_gss_data.csv' # Name of the file you want to save the data to
#
phase = 0 # Starts in write mode; after one iteration of loop, switches to append mode
#
for k in range(3): # for each chunk of the data
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet' # Create url to the chunk to be processed
    print(url) # Check the url is correct
    df = pd.read_parquet(url) # Download this chunk of data
    print(df.head()) # Visually inspect the first few rows
    if phase == 0 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='w', # control write versus append
                                header=var_list, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode
    elif phase == 1 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='a', # control write versus append
                                header=None, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode

https://github.com/DS3001/project_gss/raw/main/gss_chunk_1.parquet
   year  id            wrkstat  hrs1  hrs2 evwork    occ  prestige  \
0  1972   1  working full time   NaN   NaN    NaN  205.0      50.0   
1  1972   2            retired   NaN   NaN    yes  441.0      45.0   
2  1972   3  working part time   NaN   NaN    NaN  270.0      44.0   
3  1972   4  working full time   NaN   NaN    NaN    1.0      57.0   
4  1972   5      keeping house   NaN   NaN    yes  385.0      40.0   

         wrkslf wrkgovt  ...  agehef12 agehef13 agehef14  hompoph wtssps_nea  \
0  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
1  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
2  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
3  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
4  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   

   wtssnrps_nea  wtssps_next wt

In [16]:
test_df = pd.read_csv("selected_gss_data.csv")
test_df.columns

Index(['wrkstat', 'prestige'], dtype='object')

In [17]:
import pandas as pd

# List of variables you want as columns
var_list = ['abany', 'abrape', 'childs_rec', 'con_sc_judge', 'conmedic', 'men_vs_women_politics']
output_filename = 'selected_gss_data.csv'  # Output file

# URLs for the CSV files in your GitHub repository
csv_file_urls = {
    'abany': 'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/abany.csv',
    'abrape': 'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/abrape.csv',
    'childs_rec': 'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/childs_rec.csv',
    'con_sc_judge': 'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/con_sc_judge.csv',
    'conmedic': 'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/conmedic.csv',
    'men_vs_women_politics': 'https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/men_vs_women_politics.csv'
}

# Initialize an empty DataFrame
df_combined = pd.DataFrame()

# Loop through each CSV file and merge them into one DataFrame
for var, url in csv_file_urls.items():
    print(f"Downloading: {url}")  # Debugging check
    try:
        df = pd.read_csv(url)  # Load CSV
        df = df.rename(columns={df.columns[0]: var})  # Rename the first column to match the variable name
        df_combined = pd.concat([df_combined, df], axis=1)  # Merge columns

    except Exception as e:
        print(f"Error loading {url}: {e}")

# Save the final DataFrame to a CSV file
df_combined.to_csv(output_filename, index=False)
print(f"Final dataset saved as: {output_filename}")


Downloading: https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/abany.csv
Downloading: https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/abrape.csv
Downloading: https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/childs_rec.csv
Downloading: https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/con_sc_judge.csv
Downloading: https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/conmedic.csv
Downloading: https://raw.githubusercontent.com/yuthimreddy/EDA_HW03/main/lab/lab_data/men_vs_women_politics.csv
Final dataset saved as: selected_gss_data.csv


In [18]:
df_combined.head(5)

Unnamed: 0,abany,name,label,response,breakdown,breakdown_category,year,percent,std_err,abrape,...,std_err.1,men_vs_women_politics,name.1,label.1,response.1,breakdown.1,breakdown_category.1,year.1,percent.1,std_err.2
0,Current Affairs,abany,Abortion if woman wants for any reason,Yes,Total,Total,1977.0,36.72,1.75,Current Affairs,...,,Politics,fepol_r,Men are better suited for politics than are wo...,Agree,Total,Total,1974.0,43.3,
1,Current Affairs,abany,Abortion if woman wants for any reason,Yes,Total,Total,1978.0,32.99,1.94,Current Affairs,...,,Politics,fepol_r,Men are better suited for politics than are wo...,Agree,Total,Total,1975.0,46.89,1.37
2,Current Affairs,abany,Abortion if woman wants for any reason,Yes,Total,Total,1980.0,39.2,1.72,Current Affairs,...,1.27,Politics,fepol_r,Men are better suited for politics than are wo...,Agree,Total,Total,1977.0,46.83,1.5
3,Current Affairs,abany,Abortion if woman wants for any reason,Yes,Total,Total,1982.0,40.46,1.94,Current Affairs,...,1.36,Politics,fepol_r,Men are better suited for politics than are wo...,Agree,Total,Total,1978.0,42.43,1.34
4,Current Affairs,abany,Abortion if woman wants for any reason,Yes,Total,Total,1983.0,32.15,1.63,Current Affairs,...,1.79,Politics,fepol_r,Men are better suited for politics than are wo...,Agree,Total,Total,1982.0,33.57,2.29
