## Getting the GSS Data

Since the data files are about 40GB zipped, we can't store a compressed or uncompressed version on GitHub, and the entire dataset can't really be loaded into memory with Colab.

One option is to use Rivana: Download the data, unzip it, and work on it in a persistent environment.

The other option is to avoid opening the entire file at once, and instead work with chunks of the data. That's what this code does for you.

On GitHub, the data are broken into three smaller files, saved in .parquet format. The code below will load these chunks into memory, one at a time, you can specify the variables you want in `var_list`, and the results will be saved in `selected_gss_data.csv`.

You can add more cleaning instructions in between the lines where the data are loaded ( `df = pd.read_parquet(url)`) and the data are saved (`df.loc...`). It's probably easiest to use this code to get only the variables you want, and then clean that subset of the data.

In [9]:
import pandas as pd
#
var_list = ['year','income16', 'conpress','conmedic','consci','confed','conjudge','conlegis','conarmy'] # List of variables you want to save
output_filename = 'cleaned_selected_gss_data.csv' # Name of the file you want to save the data to
#
modes = ['w','a'] # Has write mode and append mode
phase = 0 # Starts in write mode; after one iteration of loop, switches to append mode
#



for k in range(3): # for each chunk of the data
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet' # Create url to the chunk to be processed
    print(url) # Check the url is correct
    df = pd.read_parquet(url) # Download this chunk of data
    df = df[df['year'] == 2022]
    df = df.dropna(subset=['income16','conpress','conmedic','consci','confed','conjudge','conlegis','conarmy'])
    print(df.head()) # Visually inspect the first few rows
    df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                              mode=modes[phase], # control write versus append
                              header=var_list, # variable names
                              index=False) # no row index saved
    
    phase = 1 # Switch from write mode to append mode

https://github.com/DS3001/project_gss/raw/main/gss_chunk_1.parquet
Empty DataFrame
Columns: [year, id, wrkstat, hrs1, hrs2, evwork, occ, prestige, wrkslf, wrkgovt, commute, industry, occ80, prestg80, indus80, indus07, occonet, found, occ10, occindv, occstatus, occtag, prestg10, prestg105plus, indus10, indstatus, indtag, marital, martype, agewed, divorce, widowed, spwrksta, sphrs1, sphrs2, spevwork, cowrksta, cowrkslf, coevwork, cohrs1, cohrs2, spocc, sppres, spwrkslf, spind, spocc80, sppres80, spind80, spocc10, spoccindv, spoccstatus, spocctag, sppres10, sppres105plus, spind10, spindstatus, spindtag, coocc10, coind10, paocc16, papres16, pawrkslf, paind16, paocc80, papres80, paind80, paocc10, paoccindv, paoccstatus, paocctag, papres10, papres105plus, paind10, paindstatus, paindtag, maocc80, mapres80, mawrkslf, maind80, maocc10, maoccindv, maoccstatus, maocctag, mapres10, mapres105plus, maind10, maindstatus, maindtag, sibs, childs, age, agekdbrn, educ, paeduc, maeduc, speduc, coeduc, cod

In [6]:
unique_values = df['income16'].astype(str).unique()
print(unique_values)

['nan' '$170,000 or over' '$50,000 to $59,999' '$75,000 to $89,999'
 '$60,000 to $74,999' '$30,000 to $34,999' 'under $1,000'
 '$8,000 to $9,999' '$12,500 to $14,999' '$40,000 to $49,999'
 '$5,000 to $5,999' '$35,000 to $39,999' '$25,000 to $29,999'
 '$90,000 to $109,999' '$22,500 to $24,999' '$20,000 to $22,499'
 '$110,000 to $129,999' '$150,000 to $169,999' '$130,000 to $149,999'
 '$1,000 to $2,999' '$17,500 to $19,999' '$6,000 to $6,999'
 '$10,000 to $12,499' '$15,000 to $17,499' '$7,000 to $7,999'
 '$3,000 to $3,999' '$4,000 to $4,999']


In [7]:
unique_values = df['conpress'].astype(str).unique()
print(unique_values)


['nan' 'only some' 'hardly any' 'a great deal']


In [8]:
unique_values = df['conmedic'].astype(str).unique()
print(unique_values)


['nan' 'a great deal' 'only some' 'hardly any']


In [9]:
unique_values = df['consci'].astype(str).unique()
print(unique_values)


['nan' 'a great deal' 'only some' 'hardly any']


In [10]:
unique_values = df['confed'].astype(str).unique()
print(unique_values)


['nan' 'only some' 'a great deal' 'hardly any']


In [11]:
unique_values = df['conlegis'].astype(str).unique()
print(unique_values)


['nan' 'hardly any' 'only some' 'a great deal']


In [12]:
unique_values = df['conarmy'].astype(str).unique()
print(unique_values)


['nan' 'a great deal' 'only some' 'hardly any']
