# DS Fellows Project: College Majors

## By: Robin Hollingsworth

For this assignment, we will be working with a dataset containing information on recent graduates for various majors. It includes variables such as major category, number of employed/unemployed and part-time/full-time workers to evaluate the pros and cons of various majors offered in the US.

The dataset originates from the Census Bureau’s American Community Survey (ACS) Public Use Microdata Sample (PUMS) files and was featured in an article on FiveThirtyEight called "The Economic Guide To Picking A College Major". 

Dataset: https://github.com/fivethirtyeight/data/tree/master/college-majors
         https://www.census.gov/programs-surveys/acs/microdata.html

Article: https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/

*PLAN/GOALS:*
1. Clean data to remove unnecessary columns, etc.
3. Find best majors in regard to unemployment rate, full-time, etc
4. Explore the women to men ratios

In [7]:
import numpy as np
from datascience import *

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

### Question 0: Load and Clean Dataset

We can see that the different columns correspond to a various features/variables and each row represents a particular major.

In [9]:
grads_full = Table().read_table("recent-grads.csv")
grads_full

Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
1,2419,PETROLEUM ENGINEERING,2339,2057,282,Engineering,0.120564,36,1976,1849,270,1207,37,0.0183805,110000,95000,125000,1534,364,193
2,2416,MINING AND MINERAL ENGINEERING,756,679,77,Engineering,0.101852,7,640,556,170,388,85,0.117241,75000,55000,90000,350,257,50
3,2415,METALLURGICAL ENGINEERING,856,725,131,Engineering,0.153037,3,648,558,133,340,16,0.0240964,73000,50000,105000,456,176,0
4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,1258,1123,135,Engineering,0.107313,16,758,1069,150,692,40,0.0501253,70000,43000,80000,529,102,0
5,2405,CHEMICAL ENGINEERING,32260,21239,11021,Engineering,0.341631,289,25694,23170,5180,16697,1672,0.0610977,65000,50000,75000,18314,4440,972
6,2418,NUCLEAR ENGINEERING,2573,2200,373,Engineering,0.144967,17,1857,2038,264,1449,400,0.177226,65000,50000,102000,1142,657,244
7,6202,ACTUARIAL SCIENCE,3777,2110,1667,Business,0.441356,51,2912,2924,296,2482,308,0.0956522,62000,53000,72000,1768,314,259
8,5001,ASTRONOMY AND ASTROPHYSICS,1792,832,960,Physical Sciences,0.535714,10,1526,1085,553,827,33,0.0211674,62000,31500,109000,972,500,220
9,2414,MECHANICAL ENGINEERING,91227,80320,10907,Engineering,0.119559,1029,76442,71298,13101,54639,4650,0.0573423,60000,48000,70000,52844,16384,3253
10,2408,ELECTRICAL ENGINEERING,81527,65511,16016,Engineering,0.19645,631,61928,55450,12695,41413,3895,0.0591738,60000,45000,72000,45829,10874,3170


### 0.1 Clean Columns

We are going to be looking at the majors and major categories in regards to the ratio of men-to-women and the number of graduates that are employed. So, we are going to need the following categories: **"Major", "Major_category", "Men", "Women", "Employed", "Unemployed"**.

In [11]:
grads_clean = grads_full.select(["Major","Major_category","Men","Women","Employed","Unemployed"])
grads_clean

Major,Major_category,Men,Women,Employed,Unemployed
PETROLEUM ENGINEERING,Engineering,2057,282,1976,37
MINING AND MINERAL ENGINEERING,Engineering,679,77,640,85
METALLURGICAL ENGINEERING,Engineering,725,131,648,16
NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1123,135,758,40
CHEMICAL ENGINEERING,Engineering,21239,11021,25694,1672
NUCLEAR ENGINEERING,Engineering,2200,373,1857,400
ACTUARIAL SCIENCE,Business,2110,1667,2912,308
ASTRONOMY AND ASTROPHYSICS,Physical Sciences,832,960,1526,33
MECHANICAL ENGINEERING,Engineering,80320,10907,76442,4650
ELECTRICAL ENGINEERING,Engineering,65511,16016,61928,3895


### 0.1 Count in Major Categories

To get a better scope of the data, we want to use the group method to count the number of occurances of each category of majors. We would also like to sort the resulting list from most occurances to least occurances.

In [18]:
grads_clean.group("Major_category").sort('count', descending = True)

Major_category,count
Engineering,29
Education,16
Humanities & Liberal Arts,15
Biology & Life Science,14
Business,13
Health,12
Computers & Mathematics,11
Agriculture & Natural Resources,10
Physical Sciences,10
Psychology & Social Work,9


We also care about the number of 