# Census Analysis

### Contents
1. Data Wrangling -- collect, organize, define, clean
2. Exploratory Data Analysis
3. Feature Engineering, Pre-processing, Training
4. Modeling
5. Summary and Documentation

The goal of this project is to predict income levels based on data collected in the US census. The income levels are binned at below 50K and above 50K. This is a biniary classification problem.
The full dataset will be split into two parts. One for training at 2/3 of the full set. And the other for testing at 1/3 of the full set.

**Initial Data Collection and Clean**

In [1]:
# Importing necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.graphics.api import abline_plot
from sklearn.model_selection import train_test_split

In [None]:
# Loading in data as csv file
census = pd.read_csv('data/census-data.csv', header=None)

In [None]:
census.head()

In [None]:
census.shape

In [None]:
census.columns

**Data Definition**

In [None]:
# Creating list of column names to apply to dataframe
column_names = ['age', 'worker_class', 'industry_recode', 'occupation_recode', 'education',
                'wage_per_hour', 'edu_enroll_last_week', 'marital_stat', 'industry_code',
                'occupation_code', 'race', 'hispanic_origin', 'sex', 'member_of_union', 'reason_unemployment',
                'employment_stat', 'capital_gains', 'capital_losses', 'stock_divs', 'tax_filler_stat',
                'region_prev_residence', 'state_prev_residence', 'household_family_stat', 'household_summary_in',
                'instance_weight', 'migration_code_change_msa', 'migration_code_change_reg',
                'migration_codemove_reg', 'in_house_one_yearago', 'migration_prev_res_sunbelt', 'persons_worked_for_employer',
                'family_members_under_18', 'country_of_birth_father', 'country_of_birth_mother', 'country_of_birth', 
                'citizenship', 'own_business_or_self_employed', 'veterans_questionnaire', 'veterans_benefits', 
                'weeks_worked_year', 'year', 'income']

In [None]:
# Setting column names for both data sets
census.columns = column_names

**Overview**

There are a total of 42 columns in this data set. The goal here would be to determine the viablity of each column as it pertains to the project analysis. Some of these columns might not be usable in this analysis. While others will need some wrangling to ensure that are used appropriately.

What variables could affect the income level of an indivudial?
- age 
- race
- industry
- occupation
- education
- hours worked
- marital status
- region
- dependants


There are likely more variables that play into this, but this is a good start point for now.

In [None]:
# checking unique values
census.income.unique()

In [None]:
# assigning 1 or 0 based on income category
census.income = pd.Series(np.where(census.income.values == ' 50000+.', 1, 0),
                       census.index)

# sanity check
print(census.income.value_counts())

In [None]:
# dropping irrelevant columns
census.drop(columns=['industry_recode', 'occupation_recode', 'weeks_worked_year'], inplace=True)

The majority of the population are under 40 years of age. 
Persons under 16 will most likely not be making an income and therefore are not applicable to this particluar problem.

In [None]:
# removed entries under the age of 16
census = census[census.age >= 18]

In [None]:
census.shape

Here the education column is getting reorganized and cleaned for analysis

In [None]:
census['education'] = census.loc[:, ('education')].str.replace(' ', '')
census.education.unique()

Cleaning and organizing the age column.

- NoHighschool = 10thgrade, Lessthan1stgrade, 7thand8thgrade, 12thgradenodiploma, 5thor6thgrade, 11thgrade, 9thgrade, 1st2nd3rdor4thgrade

- Highschool = Highschoolgraduate, Somecollegebutnodegree
- AssociatesDegree = Associatesdegree-academicprogram, Associatesdegree-occup/vocational
- BachelorsDegree = Bachelorsdegree(BAABBS)
- MastersDegree = Mastersdegree(MAMSMEngMEdMSWMBA)
- MedSchool = Profschooldegree(MDDDSDVMLLBJD)
- DoctorateDegree = Doctoratedegree(PhDEdD)

In [None]:
# creating a set of variable for replacing the old values in the education column
noHighSchool = ['10thgrade', 'Lessthan1stgrade', '7thand8thgrade', 
                '12thgradenodiploma', '5thor6thgrade', '11thgrade', '9thgrade', '1st2nd3rdor4thgrade']

highschool = ['Highschoolgraduate', 'Somecollegebutnodegree']

associates = ['Associatesdegree-academicprogram', 'Associatesdegree-occup/vocational']
BachelorsDegree = 'Bachelorsdegree(BAABBS)'
MastersDegree = 'Mastersdegree(MAMSMEngMEdMSWMBA)'
MedSchool = 'Profschooldegree(MDDDSDVMLLBJD)'
DoctorateDegree = 'Doctoratedegree(PhDEdD)'

In [None]:
# Replacing old values with new
for i in noHighSchool:
    census.loc[census.education == i, 'education'] = 'NoHighSchool'
    
for i in highschool:
    census.loc[census.education == i, 'education'] = 'HighSchool'

for i in associates:
    census.loc[census.education == i, 'education'] = 'AssociatesDegree'

census.loc[census.education == 'Bachelorsdegree(BAABBS)', 'education'] = 'BachelorsDegree'
census.loc[census.education == 'Mastersdegree(MAMSMEngMEdMSWMBA)', 'education'] = 'MastersDegree'
census.loc[census.education == 'Profschooldegree(MDDDSDVMLLBJD)', 'education'] = 'MedSchool'
census.loc[census.education == 'Doctoratedegree(PhDEdD)', 'education'] = 'DoctorateDegree'

In [None]:
census.education.unique()

# Success!

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
order = ['NoHighSchool', 'HighSchool', 'AssociatesDegree', 'BachelorsDegree', 'MastersDegree', 'DoctorateDegree', 'MedSchool']

sns.countplot(x='education', data=census, order=order)
ax.set(xlabel='Education', ylabel='Count')

In [None]:
census['year'].value_counts()

Is the census year relavent? 

In [None]:
incomes_94 = census.loc[census.year == 94, 'income'].value_counts()
incomes_95 = census.loc[census.year == 95, 'income'].value_counts()

print('1995 census income category counts\n')
print(incomes_95)
print('\n1994 income category counts\n')
print(incomes_94)
print('\nThe difference between years. Remember the value of 1 denotes more than 50,000.\n')
print(incomes_95 - incomes_94)

In [None]:
census.age.describe()

In [None]:
fig, ax = plt.subplots()
sns.boxplot(y='age', data=census)
sns.displot(x='age', data=census)

In [None]:
# removing extra whitespace
census['sex'] = census.loc[:, ('sex')].str.replace(' ', '')

In [None]:
census.sex.value_counts()

In [None]:
census.loc[census['sex'] == 'Female', 'income'].value_counts()


In [None]:
2663/76547

76547 Females in this survey made less than 50k\
2663 Females made more than 50k\
\
The percentage of female earners above 50k is 3.48%

In [None]:
census.loc[census['sex'] == 'Male', 'income'].value_counts()


In [None]:
9719/60246

60246 Males made less than 50k\
9719 Males made more than 50k\
\
The percentage of male earners above 50k is 16.13%

In [None]:
census.employment_stat.unique()

In [None]:
census['employment_stat'].str.replace(' ', '_')

In [None]:
census.employment_stat.value_counts()

In [None]:
# Lets remove the veterans_questionnaire column as it does not appear to have any value to this analysis.

census = census.drop(columns='veterans_questionnaire')

In [None]:
# Here I will check the age of all those that fall into the Armed Forces or Children category just to be sure it is as it should be.
dim = (20, 10)
sns.set(font_scale = 2)
fig, ax = plt.subplots(figsize=dim)
ax.tick_params(axis='x', rotation=90)
sns.boxplot(x='employment_stat', y='age', data=census)

In [None]:
dummies = pd.get_dummies(census)

In [2]:
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.25, random_state=42)

NameError: name 'X' is not defined