## Project Introduction

Using some simple visualization techniques, I will discover some insights regarding the relation between college majors and prospective employment opportunities:

In [None]:
!pip install pandas --upgrade

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
%matplotlib inline

from pandas.plotting import scatter_matrix 

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
grads = pd.read_csv('/kaggle/input/college-earnings-by-major/recent-grads.csv')

In [None]:
grads.head(10)

In [None]:
grads.tail()

In [None]:
grads.describe()

In [None]:
grads.describe(include='object')

In [None]:
grads.info()

Drop rows that contain NULL values

In [None]:
grads.dropna(inplace=True)
grads.info()

### Some scatter plots to explore the relationships between various columns

`Total` and `Median`

In [None]:
grads.plot(kind='scatter', x='Total', y='Median', xlabel='Major population', ylabel='Median income', title='Total vs. Median', xlim=(0, 100000))

There doesn't seem to be much of a relation between the expected income and the number of people enrolled in a major. Money doesn't dictate what people choose to study.

`Total` and `Unemployment_rate`

In [None]:
grads.plot(kind='scatter', x='Total', y='Unemployment_rate', title='Total vs. Unemployment rate', xlim=(0, 100000))

There doesn't seem to be much of a relation between the expected chance for employment and the number of people enrolled in a major either.

`Full_time` and `Median`

In [None]:
grads.plot(kind='scatter', x='Full_time', y='Median', title='Full-time employment vs. Median income', xlim=(0, 50000), ylim=(20000, 80000))

There doesn't seem to be much of a relation between the full-time employment and the median income.

`ShareWomen` and `Unemployment_rate`

In [None]:
grads.plot(kind='scatter', x='ShareWomen', y='Unemployment_rate', title='Share of women in Major vs. Unemployment rate', ylim=(0, 0.125))

There doesn't seem to be much of a relation between the unemployment rate and the proportion of women enrolled in the reletive major. In theory, there isn't a difference in prospective employment rate between majors favored by women and men.

`Men` and `Median`

In [None]:
grads.plot(kind='scatter', x='Men', y='Median', title='Men in Major vs. Median income')

`Women` and `Median`

In [None]:
grads.plot(kind='scatter', x='Women', y='Median', title='Women in Major vs. Median income')

The expected income doesn't seem to be related to whether a major is favored by women or men.

### Next, we'll draw some histograms for certain columns

In [None]:
grads['Total'].describe()

In [None]:
grads['Total'].hist(bins=10, range=(0, 400000))

Most majors have less than 200,000 people enrolled.

In [None]:
grads['Median'].describe()

In [None]:
grads['Median'].hist(bins=11, range=(15000, 110000))

The expected average income for most majors is between \\$20,000 and \\$60,000.

In [None]:
employment_rate = (grads['Employed'] / grads['Total'])
employment_rate.describe()

In [None]:
employment_rate.hist(bins=10, range=(0, 1))

The expected employment rate for most majors is between 50% and 90%.

In [None]:
grads['ShareWomen'].describe()

In [None]:
grads['ShareWomen'].hist(bins=10, range=(0, 1))

Few majors are overwhelmingly favored by men or women (around 20 out of 172).

### A Scatter Matrix of some employment-related data

In [None]:
scatter_matrix(grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(15,15))

### Some bar plots

In [None]:
grads[:20].plot.bar(x='Major', y='ShareWomen', figsize=(15, 8))

Most of the top 20 paying majors have less than 50% female enrollees.

In [None]:
grads[:20].plot.bar(x='Major', y='Unemployment_rate', figsize=(15, 8))

Most of the top 20 paying majors have less than 12.5% unemployment rate.