**Wage and other data for a group of 3000 male workers in the Mid-Atlantic region. Data was manually assembled by Steve Miller, of Inquidia Consulting (formerly Open BI). From the March 2011 Supplement to Current Population Survey data.**

Data Description

* year: Year that wage information was recorded
* age: Age of worker
* maritl: A factor with levels 1. Never Married 2. Married 3. Widowed 4. Divorced and 5. Separated indicating marital status
* race: A factor with levels 1. White 2. Black 3. Asian and 4. Other indicating race
* education: A factor with levels 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad and 5. Advanced Degree indicating education level
* region: Region of the country (mid-atlantic only)
* jobclass: A factor with levels 1. Industrial and 2. Information indicating type of job
* health: A factor with levels 1. <=Good and 2. >=Very Good indicating health level of worker
* health_ins: A factor with levels 1. Yes and 2. No indicating whether worker has health insurance
* logwage: Log of workers wage
* wage: Workers raw wage



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# load dataset from wages.csv
# [TODO]

df = pd.read_csv('wages.csv')

In [None]:
# examine the first 5 rows
# [TODO]

df.head()

In [None]:
# examine missing values, data types
# [TODO]

df.info()

### Box Plot

In [None]:
# let's examine if the wage column is what we are interested in
# [TODO]

plt.boxplot(df['wage'])
plt.title("Distribution of Wages")
plt.xlabel("Distribution")
plt.ylabel("Wage")
plt.show()

In [None]:
# another way to do this via pandas dataframe?
# [TODO]

df['wage'].plot(kind="box", showfliers=False, title="Distribution of Wages")
plt.ylabel("Wage")

### Line Plot

In [None]:
# plot the average age v.s. year
# [TODO]

df.groupby('year').mean(numeric_only=True)['age'].plot(kind="line")

In [None]:
# plot the average age v.s. year and change line styles
# [TODO]

df.groupby('year').mean(numeric_only=True)['age'].plot(kind="line", c="red", marker='*', ms=10)

In [None]:
# plot the average age v.s. year and an average/mean line
# [TODO]

mean = df.groupby('year').mean(numeric_only=True)['age'].mean()
df.groupby('year').mean(numeric_only=True)['age'].plot(kind="line", c="red", marker='*', ms=10)
plt.axhline(y=mean, ls="--", c="grey")
plt.title("Average Age v.s. Year")
plt.ylabel("Age")
plt.xlabel("Year")

### Bar Plot

In [None]:
# plot the distribution/counts of each education levels
# [TODO]

df.groupby('education').count()['year'].plot(kind="bar", color=["red", "green", "yellow", "brown", "pink"])
plt.title("Counts of Education")
plt.xlabel("Education Level")
plt.ylabel("Counts")

In [None]:
# what's wrong about this? How can we make it better?
# [TODO]

df.groupby('education').count()['year'].sort_values().plot(kind="bar", color=["red", "green", "yellow", "brown", "pink"])
plt.title("Counts of Education")
plt.xlabel("Education Level")
plt.ylabel("Counts")

### Scatter Plot

In [None]:
# plot a scatter plot between age and wage

df[['age', 'wage']].plot(kind="scatter", x='age', y='wage')