# COGS 108 - Data Checkpoint

# Names

- Alessandro Todaro
- Richard Yu
- Vivek Rayalu
- Mengyu Zhang
- Hedy Wang

<a id='research_question'></a>
# Research Question

How does college GPA correlate to salary and job satisfaction, and how does the correlation differ between fields/degrees?

# Dataset(s)

- Dataset Name: IPUMS Higher Ed
- Link to the dataset: https://highered.ipums.org/highered/index.shtml
- Number of observations: ~276,000

This dataset contains data from the National Surveys of College Graduates and Recent College Graduates, or NSCG and NSRCG respectively. Specifically, we extracted data from 1993 to 1999 pertaining to undergraduate GPA, college major, salary, and job satisfaction. 

# Setup

In [19]:
import pandas as pd

df = pd.read_csv('data.csv')

# Data Cleaning

Columns are renamed for easier manipulation. Rows with values that represent missing or non-applicable data are removed. Empty data is NOT removed because the survey's questions changed when taken in different years, so some years do not have certain variables, but those years could still be used for analysis. Finally, the numeric codes used by IPUMS are replaced with the descriptors they represent, for ease of use.

In [20]:
# give columns friendlier names
df.columns = ['id', 'year', 'survey', 'major', 'gpa', 'yearsexperience', 'jobrelated', 'salary', 'jobsatisfaction']

# salary values of 9999998 or 9999999 represent missing or error data
# GPA value of 8 represents respondant did not recall GPA or did not have one
# job related value of 98 represents missing or error data
# major value of 96 represents blank response
# job satisfaction value of 98 represents missing or error data
df = df[df['salary'] < 9999998]
df = df[df['gpa'] != 8]
df = df[df['jobrelated'] != 98]
df = df[df['major'] != 96]
df = df[df['jobsatisfaction'] != 98]


# replace numeric values with the category/descriptor they represent
df['survey'].replace([1,3],['NSCG','NSRCG'],inplace=True)
df['major'].replace([1,2,3,4,5,6,7,9],['Computer and mathematical sciences',
                                       'Life and related sciences',
                                       'Physical and related sciences',
                                       'Social and related sciences',
                                       'Engineering',
                                       'Science and engineering-related fields',
                                       'Non-science and engineering fields',
                                        'Others'],inplace=True)
df['gpa'].replace([1,2,3,4,5],['3.75 - 4.00',
                               '3.25 - 3.74',
                               '2.75 - 3.24',
                               '2.25 - 2.74',
                               '< 2.25'],inplace=True)
df['jobrelated'].replace([1,2,3],['Closely related',
                                             'Somewhat related', 
                                             'Not related',],inplace=True)
df['jobsatisfaction'].replace([1,2,3,4],['Very satisfied',
                                          'Somewhat satisfied',
                                          'Somewhat dissatisfied',
                                          'Very dissatisfied',],inplace=True)

In [23]:
# save cleaned data
df.to_csv('cleaned_data.csv')

In [24]:
df.shape

(225442, 9)

# Project Timeline Proposal (updated)

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/28  |  6 PM | Brainstorm topics/questions  | Discuss and choose topic, assign workload, begin proposal writing | 
| 2/11  | 5:30 PM | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/12  | 5:30 PM  | Import & Wrangle Data | Finalize data submission   |
| 2/20  | TBD | Begin EDA | Discuss/Finalize EDA, Beging talking about analysis & assign roles   |
| 2/25  | TBD | Work on analysis  | Review/edit/critique analysis  |
| 3/10  | TBD | Complete analysis | Begin/assign work on finalizing project  |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |