<a href="https://colab.research.google.com/github/tmegandoan/DS-2002-002/blob/main/DS3001Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import libraries

In [1]:
# importing basic packages that we think we will need to use including Seaborns for plots and NumPy/Pandas for wrangling and cleaning
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading Data

The GSS data was not loading properly into Python due to its size. Because of this, we loaded the data into R and picked some columns that we thought would be the most interesting. The data that we are loading in below is our new dataset.

In [2]:
# the data was uploaded directly into the environment
gss = pd.read_csv('/content/new_gss.csv', low_memory=False)

## Examining the Data

The following entails quick looks at the data in order to understand what is going on and if there are any topics of interest

In [3]:
gss.shape

(72390, 27)

In [4]:
# examining the data
gss.head()

Unnamed: 0,wrkgovt1,indus10,wrkstat,occ10,sibs,conclerg,coneduc,confed,conlabor,conmedic,...,industry,wrkyears,degree,hrs1,agewed,sex,born,granborn,grass,xmovie
0,,5170.0,working full time,"wholesale and retail buyers, except farm products",3.0,,,,,,...,609.0,,bachelor's,,,female,,,,
1,,6470.0,retired,first-line supervisors of production and opera...,4.0,,,,,,...,338.0,,less than high school,,21.0,male,,,,
2,,7070.0,working part time,real estate brokers and sales agents,5.0,,,,,,...,718.0,,high school,,20.0,female,,,,
3,,5170.0,working full time,accountants and auditors,5.0,,,,,,...,319.0,,bachelor's,,24.0,female,,,,
4,,6680.0,keeping house,telephone operators,2.0,,,,,,...,448.0,,high school,,22.0,female,,,,


In [5]:
## looking at the data and examining how many missing rows we have:
gss.isna().sum()

wrkgovt1    68474
indus10      5219
wrkstat        36
occ10        5138
sibs         1785
conclerg    25009
coneduc     23990
confed      24663
conlabor    25987
conmedic    23889
contv       24098
conjudge    25224
childs        261
marital        51
major1      65165
major2      71709
agekdbrn    43523
industry    48183
wrkyears    70895
degree        196
hrs1        30830
agewed      45847
sex           112
born         9357
granborn    13197
grass       33721
xmovie      29630
dtype: int64

## Cleaning the Data

After looking at the number of missing values per row, we see that there are some rows that have more than half of the values missing. If there are more than 75% of values missing, we remove the column.

Because of the nature of the general social survey, we know that people are allowed to skip values whenever they would like. We know that there will probably not be many rows with everything filled in.

There were two rounds of cleaning performed. We first observed our data and renamed a few columns that we might be interested in to give them a more descriptive name. We also dropped rows that had missing values in the columns we were interested in.

For the second round of cleaning, we took our now subsetted data with our 9 selected columns and made sure there were consistencies with the character/string values and that no data was inputted incorrectly (EX: if 'male' was denoted with 'M' instead of 'male'). We also make sure that there are no errors with numeric data such as someone's age being inputted as 188 instead of 18. These errors are unlikely due to the source of this data, but we still want to check for inconsistencies.

In [31]:
# removing columns with more than 75% of the values missing - if missing values > 54,293, we remove the column
dropcol = ['wrkgovt1','major1','major2','wrkyears']
cl_gss = gss.drop(columns=dropcol, axis=1)

In [32]:
# renaming columns that we want to use but have ambigious variable names
cl_gss = cl_gss.rename(columns = {'hrs1':'hrs_wrked_per_week',
                    'occ10':'occupation',
                    'conclerg':'conf_religion',
                    'coneduc':'conf_education',
                    'confed':'conf_fedgov',
                    'conmedic':'conf_medical',
                    'childs':'numb_children',
                    'marital':'marital_status',
                    'agekdbrn':'age_at_frstkid',
                    'agewed':'age_married',
                    'grass':'marijuana_usage'}) # Rename group of variables

In [33]:
# cleaning the hrs_wrked_per_week variable - removing any row that is missing the hrs worked per week
cl_gss['hrs_wrked_per_week'].describe()

cl_gss = cl_gss.dropna(subset=['hrs_wrked_per_week'])


In [34]:
# we can see that of our original data set, only 41,560 rows had worked hours filled in
cl_gss.shape

(41560, 23)

In [35]:
# of our subsetted dataset, we then subset the data again in order to choose specific columns that we are interested in. We also remove
# any rows that do not have all of the information filled in for our desired columns.

fin_cols = ['hrs_wrked_per_week','conf_religion','conf_education','conf_medical', 'numb_children', 'marital_status', 'marijuana_usage', 'degree', 'sex']

cl_gss = cl_gss[fin_cols]

cl_gss = cl_gss.dropna()

# we can see that there are 21644 rows that have all of the information that we need.
cl_gss.shape

(21644, 9)

In [38]:
# the following code ensures that all of the 8 variables that we are choosing to examine are clean --> all variables look fairly clean

# checking to make sure our confidence variables are not differing in phrases
print(cl_gss['conf_education'].value_counts())
print('----------------------------------------')
print(cl_gss['conf_religion'].value_counts())
print('----------------------------------------')
print(cl_gss['conf_medical'].value_counts())

# checking to make sure other string columsna are not differing in phrases
print('----------------------------------------')
print(cl_gss['marital_status'].value_counts())
print('----------------------------------------')
print(cl_gss['marijuana_usage'].value_counts())
print('----------------------------------------')
print(cl_gss['degree'].value_counts())
print('----------------------------------------')
print(cl_gss['sex'].value_counts())



only some       12721
a great deal     5682
hardly any       3241
Name: conf_education, dtype: int64
----------------------------------------
only some       11621
a great deal     5031
hardly any       4992
Name: conf_religion, dtype: int64
----------------------------------------
only some       10085
a great deal     9725
hardly any       1834
Name: conf_medical, dtype: int64
----------------------------------------
married          11791
never married     5221
divorced          3174
separated          737
widowed            721
Name: marital_status, dtype: int64
----------------------------------------
should not be legal    13866
should be legal         7778
Name: marijuana_usage, dtype: int64
----------------------------------------
high school                 11407
bachelor's                   4046
less than high school        2696
graduate                     1949
associate/junior college     1546
Name: degree, dtype: int64
----------------------------------------
male      111

we can conclude that there are no inconsistencies with the categorical values in our new dataset

In [39]:
cl_gss['hrs_wrked_per_week'].value_counts()

40.0    7474
50.0    1606
60.0    1154
45.0    1076
30.0     718
        ... 
69.0       4
71.0       3
81.0       2
79.0       2
87.0       1
Name: hrs_wrked_per_week, Length: 90, dtype: int64

## Our Research Question:
After cleaning and subsetting our data, we can now delve into our research questions. The research questions we are interested in are as follows.

How do certain aspects of life impact the amount of hours worked? For our research questions we will be looking at the relationship between work hours and different socioeconimic factors such as:
- Sex
- Marital status
- Number of kids
- Confidence in major institutions
- Degree
- Marijuana Usage

In [36]:
# examining the dataset:
cl_gss.head(5)


Unnamed: 0,hrs_wrked_per_week,conf_religion,conf_education,conf_medical,numb_children,marital_status,marijuana_usage,degree,sex
1613,27.0,hardly any,a great deal,only some,1.0,married,should not be legal,less than high school,male
1615,40.0,hardly any,hardly any,only some,8.0,married,should be legal,less than high school,female
1616,40.0,only some,only some,only some,2.0,married,should not be legal,high school,male
1618,52.0,a great deal,only some,a great deal,3.0,divorced,should not be legal,high school,female
1620,35.0,hardly any,only some,hardly any,0.0,never married,should be legal,bachelor's,male
