# Visualizing Earning Based On College Majors
The dataset contains information on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by [American Community Survey](https://www.census.gov/programs-surveys/acs/), which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their [Github repo](https://github.com/fivethirtyeight/data/tree/master/college-majors)

Data | Description
 ---  | --- 
Rank | Rank by median earnings (the dataset is ordered by this column).
Major_code | Major code.
Major | Major description.
Major_category | Category of major.
Total | Total number of people with major.
Sample_size | Sample size (unweighted) of full-time.
Men | Male graduates.
Women | Female graduates.
ShareWomen | Women as share of total.
Employed | Number employed.
Median | Median salary of full-time, year-round workers.
Low_wage_jobs | Number in low-wage service jobs.
Full_time | Number employed 35 hours or more.
Part_time | Number employed less than 35 hours.

In [None]:
import pandas as pd
from numpy import arange
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
recent_grads = pd.read_csv("../input/college-earnings-by-major/recent-grads.csv", index_col = 0)
recent_grads.index.name = None
raw_data_count = recent_grads.shape[0]
raw_data_count

In [None]:
recent_grads.iloc[0,:]

In [None]:
recent_grads.head()

In [None]:
recent_grads.describe()

## Dropping rows with missing values 

In [None]:
recent_grads = recent_grads.dropna()

In [None]:
cleaned_data_count = recent_grads.shape[0]
raw_data_count-cleaned_data_count

Only one row contained missing values

## Exploratory analysis

### Relation Analysis
* **Employed vs. Sample_size**

In [None]:
ax = recent_grads.plot(x='Sample_size', y='Employed', kind='scatter')
ax.set_title('Employed vs. Sample_size')
ax.set_xlim(0,4500)
ax.set_ylim(0,325000)

* **Sample_size vs. Median**

In [None]:
ax = recent_grads.plot(x='Sample_size', y='Median', kind='scatter')
ax.set_title("Sample_size vs. Median")
ax.set_xlim(0,4500)
ax.set_ylim(20000,120000)

* **Sample_size vs. Unemployment_rate**

In [None]:
ax = recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter')
ax.set_title("Sample_size vs. Unemployment_rate")
ax.set_xlim(0,4500)
ax.set_ylim(0,0.2)

All majors make, on average, the same money regardless of the number of full time employees it has projected. 

* **Full_time vs. Median**

In [None]:
ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter')
ax.set_title("Full_time vs. Median")
ax.set_xlim(0,275000)
ax.set_ylim(20000,120000)

* **ShareWomen vs. Unemployment_rate**

In [None]:
ax =recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title = "ShareWomen vs. Unemployment_rate")
ax.set_title("ShareWomen vs. Unemployment_rate")

All majors have, on average, the same unemployment rate regardless of the number of women graduated.

* **Men vs. Median**

In [None]:
ax = recent_grads.plot(x='Men', y='Median', kind='scatter')
ax.set_title("Men vs. Median")
ax.set_xlim(0,180000)
ax.set_ylim(20000,120000)

All majors make, on average, the same money regardless of their popularity.

* **Women vs. Median**

In [None]:
ax = recent_grads.plot(x='Women', y='Median', kind='scatter')
ax.set_title("Women vs. Median")
ax.set_xlim(0,325000)
ax.set_ylim(20000,125000)

### Frequency of occurances
* sample_size

In [None]:
recent_grads['Sample_size'].hist(bins=20, range=(0,4500))

In [None]:
recent_grads['Median'].hist(bins=20, range=(0,120000))

The most common median salary range is $30k-$35k approximately

In [None]:
recent_grads['Employed'].hist(bins=15, range=(0,max(recent_grads['Employed'])))

In [None]:
recent_grads['Full_time'].hist(bins=20, range=(0,max(recent_grads['Full_time'])))

In [None]:
recent_grads['ShareWomen'].hist(bins=20, range=(0,max(recent_grads['ShareWomen'])))

In [None]:
recent_grads['Unemployment_rate'].hist(bins=20, range=(0,max(recent_grads['Unemployment_rate'])))

In [None]:
recent_grads['Men'].hist(bins=12, range=(0,max(recent_grads['Men'])))

In [None]:
recent_grads['Women'].hist(bins=12, range=(0,max(recent_grads['Women'])))

### Generating scatter matrices to study relation and distribution of parameter

In [None]:
scatter_matrix(recent_grads[['Women', 'Men']], figsize=(10,10))

In [None]:
scatter_matrix(recent_grads[['Sample_size','Median']], figsize=(10,10))

In [None]:
scatter_matrix(recent_grads[['Sample_size', 'Median',"Unemployment_rate"]], figsize=(10,10))

### Parameter field value analysis

In [None]:
recent_grads[:15].plot.bar(x='Major', y='Women')

In [None]:
print(recent_grads[:10].plot.bar(x='Major', y='ShareWomen', legend=False))
print(recent_grads[recent_grads.shape[0]-9:].plot.bar(x='Major', y='ShareWomen', legend=False))

In [None]:
print(recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend=False))
print(recent_grads[recent_grads.shape[0]-9:].plot.bar(x='Major', y='Unemployment_rate', legend=False))


### Grouped barplot analyzing proportionality of men and women in different majors

In [None]:
def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

In [None]:
category_majors = recent_grads["Major"].unique()
x = arange(len(category_majors))
width = 0.35
men_count = {}
women_count = {}
# for row in recent_grads:
#     major = row["Major"]
#     if(major in men_count):
#         men_count[major] +=row["Men"]
#         women_count[major]+=row["Women"]
#     else:
#         men_count[major] = row["Men"]
#         women_count[major] = row["Women"]
recent_grads["Men"].value_counts()

In [None]:
fig,ax = plt.subplots()
rects1 = ax.bar(x - width/2, men_count, width, label='Men')
rects2 = ax.bar(x + width/2, women_count, width, label='Women')
plt.show()