<a href="https://colab.research.google.com/github/tmegandoan/DS-2002-002/blob/main/DS3001Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import libraries

In [1]:
# importing basic packages that we think we will need to use including Seaborns for plots and NumPy/Pandas for wrangling and cleaning
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading Data

The GSS data was not loading properly into Python due to its size. Because of this, we loaded the data into R and picked some columns that we thought would be the most interesting. The data that we are loading in below is our new dataset.

In [2]:
# the data was uploaded directly into the environment
gss = pd.read_csv('/content/new_gss.csv', low_memory=False)

## Examining the Data

The following entails quick looks at the data in order to understand what is going on and if there are any topics of interest

In [3]:
gss.shape

(72390, 27)

In [4]:
# examining the data
gss.head()

Unnamed: 0,wrkgovt1,indus10,wrkstat,occ10,sibs,conclerg,coneduc,confed,conlabor,conmedic,...,industry,wrkyears,degree,hrs1,agewed,sex,born,granborn,grass,xmovie
0,,5170.0,working full time,"wholesale and retail buyers, except farm products",3.0,,,,,,...,609.0,,bachelor's,,,female,,,,
1,,6470.0,retired,first-line supervisors of production and opera...,4.0,,,,,,...,338.0,,less than high school,,21.0,male,,,,
2,,7070.0,working part time,real estate brokers and sales agents,5.0,,,,,,...,718.0,,high school,,20.0,female,,,,
3,,5170.0,working full time,accountants and auditors,5.0,,,,,,...,319.0,,bachelor's,,24.0,female,,,,
4,,6680.0,keeping house,telephone operators,2.0,,,,,,...,448.0,,high school,,22.0,female,,,,


In [5]:
## looking at the data and examining how many missing rows we have:
gss.isna().sum()

wrkgovt1    68474
indus10      5219
wrkstat        36
occ10        5138
sibs         1785
conclerg    25009
coneduc     23990
confed      24663
conlabor    25987
conmedic    23889
contv       24098
conjudge    25224
childs        261
marital        51
major1      65165
major2      71709
agekdbrn    43523
industry    48183
wrkyears    70895
degree        196
hrs1        30830
agewed      45847
sex           112
born         9357
granborn    13197
grass       33721
xmovie      29630
dtype: int64

## Cleaning the Data

After looking at the number of missing values per row, we see that there are some rows that have more than half of the values missing. If there are more than 75% of values missing, we remove the column.

Because of the nature of the general social survey, we know that people are allowed to skip values whenever they would like. We know that there will probably not be many rows with everything filled in.

Because we are just interested in how work hours is affected by different variables, we will be focusing on just cleaning the work hours column first. We remove any rows that do not have values for this column. For the other columns that we are interested in such as occupation type, number of children, and marital status, we will also remove rows that are missing information or are empty cells.

 We will also do other cleaning techniques like renaming columns.

In [6]:
# removing columns with more than 75% of the values missing - if missing values > 54,293, we remove the column
dropcol = ['wrkgovt1','major1','major2','wrkyears']
cl_gss = gss.drop(columns=dropcol, axis=1)

In [7]:
# renaming columns that we want to use but have ambigious variable names
cl_gss = cl_gss.rename(columns = {'hrs1':'hrs_wrked_per_week',
                    'occ10':'occupation',
                    'conclerg':'conf_religion',
                    'coneduc':'conf_education',
                    'confed':'conf_fedgov',
                    'conmedic':'conf_medical',
                    'childs':'numb_children',
                    'marital':'marital_status',
                    'agekdbrn':'age_at_frstkid',
                    'agewed':'age_married',
                    'grass':'marijuana_usage'}) # Rename group of variables

In [8]:
# cleaning the hrs_wrked_per_week variable - removing any row that is missing the hrs worked per week
cl_gss['hrs_wrked_per_week'].describe()

cl_gss = cl_gss.dropna(subset=['hrs_wrked_per_week'])


In [9]:
# we can see that of our original data set, only 41,560 rows had worked hours filled in
cl_gss.shape

(41560, 23)

In [27]:
# of our subsetted dataset, we then subset the data again in order to choose specific columns that we are interested in. We also remove
# any rows that do not have all of the information filled in for our desired columns.

fin_cols = ['hrs_wrked_per_week', 'occupation','conf_religion','conf_education','conf_medical', 'numb_children', 'marital_status', 'marijuana_usage', 'degree']

cl_gss = cl_gss[fin_cols]

cl_gss = cl_gss.dropna()

# we can see that there are 21317 rows that have all of the information that we need.
cl_gss.shape

(21317, 9)

## Our Research Question:
After cleaning and subsetting our data, we can now delve into our research questions. The research questions we are interested in are as follows.

How do certain aspects of life impact the amount of hours worked? For our research questions we will be looking at the relationship between work hours and different socioeconimic factors such as:
- Occupation
- Marital status
- Number of kids
- Confidence in major institutions
- Degree
- Marijuana Usage