# **성인 인구조사 소득 예측 대회**
---

>### Contents
> 
> 1. Introduction
> 2. Import Pakgages

# **[1. Introduction](None)**
---

### **대회 Description**
한국과 마찬가지로 미국도 주기적으로 성인을 대상으로 한 여러 인구조사를 시행합니다.  
이 대회는 1994년 미국 성인을 대상으로 조사한 데이터를 바탕으로 진행됩니다.  

여러분은 이 데이터에서 각 사람의 소득을 예측하면 됩니다.  

나이, 결혼 여부, 직종 등 총 14개의 feature를 통해 예측을 하면 됩니다.  
예측해야 하는 값은 간단합니다.  

* 연소득이 $50,000 이 넘는다면 1  

* 연소득이 $50,000 이 넘지 않는다면 0  

지금과 금액의 가치가 다르겠지만 최대한 여러분의 인사이트를 바탕으로 정확하게 예측하는 모델을 만들어봅시다.  

### **Timeline**

* 20/10/16 12:00 - 대회 시작
* 20/10/31 - 세미나 : 첫 캐글 도전 (안수빈)
* 20/11/07 - 세미나 : 피처 엔지니어링 (김태진)
* 20/11/14 - 세미나 : 모델과 검증 & 앙상블 (강천성)
* **20/11/28 12:00 - 대회 종료**
* 20/12/05 - 캐글코리아 운영진과의 대화

# **[2. Import Pakages](None)**
---

In [1]:
import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [1]:
import tensorflow as tf
import keras

import sklearn
from sklearn import model_selection  # KFold

import lightgbm
from lightgbm import LGBMClassifier

import category_encoders
from category_encoders import ordinal  # OrdinalEncoder

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

In [1]:
print(tf.__version__)
print(sklearn.__version__)

# **[3. EDA](None)**
---

## 3.1. Load Data

In [1]:
# 현재 Directory 확인
print('현재 위치: ', os.getcwd())
print('')

# 파일 확인
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print('데이터 파일: ', os.path.join(dirname, filename))

In [1]:
train_df = pd.read_csv('/kaggle/input/kakr-4th-competition/train.csv')
test_df  = pd.read_csv('/kaggle/input/kakr-4th-competition/test.csv')

sample_submission = pd.read_csv('/kaggle/input/kakr-4th-competition/sample_submission.csv')

In [1]:
train_df.sample(10)

In [1]:
test_df.sample(10)

## Features

* age : 나이
* workclass : 고용 형태
* fnlwgt : 사람 대표성을 나타내는 가중치 (final weight의 약자)
* education : 교육 수준
* education_num : 교육 수준 수치
* marital_status: 결혼 상태
* occupation : 업종
* relationship : 가족 관계
* race : 인종
* sex : 성별
* capital_gain : 양도 소득
* capital_loss : 양도 손실
* hours_per_week : 주당 근무 시간
* native_country : 국적
* income : 수익 (예측해야 하는 값)
  * `>50K : 1`
  * `<=50K : 0`

* final weight 뜬금없네...
* final weight와 income이 어떤 관계에 있나 확인해보자

## Check MissingData

In [1]:
train_df.info()

In [1]:
test_df.info()

* Null이 하나도 없다고? 속지 말자

## Data Distribution

In [1]:
train_df.describe()

In [1]:
train_df.describe(include='O')

In [1]:
test_df.describe()

In [1]:
test_df.describe(include='O')

## EDA

In [1]:
numerical_order = ['id', 'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
categorical_order = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']

In [1]:
def get_min_max_avg(df, feature):
    print('Feature: ', feature)
    print('--------------------------------------')
    print('The max value is:',df[feature].max())
    print('The min value is:',df[feature].min())
    print('The average value is:',df[feature].mean())
    print('The median value is:',df[feature].median())

In [1]:
def get_unique_values(df, feature):
    all_categories = df[feature].unique()
    print(f'Column "{feature}" has {len(all_categories)} unique categroies')
    print('------------------------------------------')
    print('\n'.join(all_categories))

### (1) age

In [1]:
get_min_max_avg(train_df, 'age')

In [1]:
fig, ax = plt.subplots(1, 1, figsize=(25, 5))
ax.set_title('Age Distribution')
ax.hist(train_df['age'], bins=73)
print()

In [1]:
from ggplot import *

In [1]:
fig = plt.figure()

ggplot(train_df, aes(x='age', fill='income')) + geom_density(alpha=0.7) + ggtitle("The age distribution by income")

### (2) workclass

In [1]:
get_unique_values(train_df, 'workclass')
workclass_order = train_df['workclass'].unique()  # ['Private' 'State-gov' '?' 'Self-emp-not-inc' 'Local-gov' 'Federal-gov' 'Self-emp-inc' 'Without-pay' 'Never-worked']

In [1]:
value_counts = train_df['workclass'].value_counts()[workclass_order]
over = train_df[train_df['income'] == '>50K']['workclass'].value_counts()
under = train_df[train_df['income'] == '<=50K']['workclass'].value_counts()[workclass_order]
percentages = (over / value_counts)[workclass_order].reset_index()

print('-------------------- All Data --------------------')
print(value_counts)
print('')

print('-------------------- income > 50K -------------------------')
print(over)
print('')

print('-------------------- income <= 50K -------------------------')
print(under)
print('')

print('-------------------- Ratio -------------------------')
print(percentages)

workclass_df = train_df['workclass'].value_counts()[workclass_order].reset_index(name='counts')
workclass_df = pd.DataFrame(workclass_df)
workclass_df

In [1]:
fig, ax = plt.subplots(1, 2, figsize=(35, 10))

#
sns.countplot(x='workclass', data=train_df, palette="Set2", edgecolor='black', order = workclass_order, ax=ax[0])

value_counts = train_df['workclass'].value_counts()[workclass_order].reset_index()
for i, v in value_counts.iterrows():
    ax[0].text(i - 0.1, v['workclass'] + 150 , value_counts['workclass'][i])

#
sns.countplot(x='workclass', data=train_df, hue='income', palette="Set2", edgecolor='black', order = workclass_order, ax=ax[1])

for i, v in under.reset_index().iterrows():
    ax[1].text(i-0.3, v['workclass'] + 150 , percentages['workclass'][i], rotation=45)

### **(3) fnlwgt : final weight**

In [1]:
get_min_max_avg(train_df, 'fnlwgt')

In [1]:
fig, ax = plt.subplots(1, 1, figsize=(35, 10))
ax.hist(train_df['fnlwgt'], bins=100)
print()

In [1]:
fig, ax = plt.subplots(1, 1, figsize=(100, 10))
sns.scatterplot(data=train_df, x='id', y='fnlwgt', ax=ax)

In [1]:
fig = plt.figure()
ggplot(train_df, aes(x='fnlwgt', fill='income')) + geom_density(alpha=0.7) + ggtitle('The final weight distribution by income')

### **(4) education**

In [1]:
get_unique_values(train_df, 'education')

In [1]:
col = 'education'

fig, ax = plt.subplots(1, 2, figsize=(35, 7))

value_counts = train_df[col].value_counts()

#
sns.countplot(x=col, data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[0])
for i, v in value_counts.reset_index().iterrows():
    ax[0].text(i-0.1, v[col]+150 , v[col])

#
sns.countplot(x=col, hue='income', data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[1]);

### **(5) education_num**

In [1]:
get_min_max_avg(train_df, 'education_num')

In [1]:
fig, ax = plt.subplots(1, 1, figsize=(25, 5))
ax.hist(train_df['education_num'], bins=15)
print()

In [1]:
fig = plt.figure()
ggplot(train_df, aes(x='education_num', fill='income')) + geom_density(alpha=0.7) + ggtitle('The education_num distribution by income')

### **(6) marital_status**

In [1]:
get_unique_values(train_df, 'marital_status')

In [1]:
fig, ax = plt.subplots(1, 2, figsize=(35, 7))
col = 'marital_status'
value_counts = train_df[col].value_counts()
sns.countplot(x=col, data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[0])
sns.countplot(x=col, hue='income', data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[1]);

for i, v in value_counts.reset_index().iterrows():
    ax[0].text(i-0.1, v[col]+150 , v[col])

### **(7) occupation**

In [1]:
get_unique_values(train_df, 'occupation')

In [1]:
col = 'occupation'

fig, ax = plt.subplots(1, 2, figsize=(35, 7))

value_counts = train_df[col].value_counts()

#
sns.countplot(x=col, data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[0])
sns.countplot(y=col, hue='income', data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[1]);

plt.xticks(rotation=45)

for i, v in value_counts.reset_index().iterrows():
    ax[0].text(i-0.12, v[col]+50 , v[col])

### **(8) relationship**

In [1]:
get_unique_values(train, 'relationship')

In [1]:
col = 'relationship'

fig, ax = plt.subplots(1, 2, figsize=(35, 7))

#
value_counts = train_df[col].value_counts()

#
sns.countplot(x=col, data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[0])
sns.countplot(x=col, hue='income', data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[1]);

for i, v in value_counts.reset_index().iterrows():
    ax[0].text(i-0.1, v[col]+150 , v[col])

### **(9) race**

In [1]:
get_unique_values(train, 'race')

In [1]:
col = 'race'
fig, ax = plt.subplots(1, 2, figsize=(35, 7))

#
value_counts = train_df[col].value_counts()

#
sns.countplot(x=col, data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[0])
sns.countplot(x=col, hue='income', data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[1]);

for i, v in value_counts.reset_index().iterrows():
    ax[0].text(i-0.1, v[col]+150 , v[col])

### **(10) sex**

In [1]:
get_unique_values(train_df, 'sex')

In [1]:
col = 'sex'

fig, ax = plt.subplots(1, 2, figsize=(35, 6))

#
value_counts = train_df[col].value_counts()

#
sns.countplot(x=col, data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[0])
sns.countplot(x=col, hue='income', data=train_df, palette="Set2", edgecolor='black', order = value_counts.index, ax=ax[1]);

for i, v in value_counts.reset_index().iterrows():
    ax[0].text(i-0.05, v[col]+150 , v[col])

### **(11) capital_gain: 양도 소득**

In [1]:
fig, ax = plt.subplots(1, 1, figsize=(25, 5))
ax.hist(train_df['capital_gain'], bins=100)
print()

In [1]:
fig = plt.figure()
ggplot(train_df.loc[train_df['capital_gain'] > 0], aes(x='capital_gain', fill='income')) + geom_density(alpha=0.7) + ggtitle('The capital loss distribution by income')

In [1]:
sns.boxplot(x='income', y='capital_gain', data=train_df.loc[train_df['capital_gain'] > 0], palette="Set2", linewidth=2);

### **(12) capital_loss**

In [1]:
get_min_max_avg(train_df, 'capital_loss')

In [1]:
fig, ax = plt.subplots(1, 1, figsize=(25, 5))
ax.hist(train_df['capital_loss'], bins=100)
print()

In [1]:
fig = plt.figure();
ggplot(train_df.loc[train_df['capital_loss'] > 0], aes(x='capital_loss', fill='income')) + geom_density(alpha=0.7) + ggtitle('The capital loss distribution by income')

In [1]:
sns.boxplot(x='income', y='capital_loss', data=train_df.loc[train_df['capital_loss'] > 0], palette="Set2", linewidth=2);

### **(13) hours_per_week: 주당 근무 시간**

In [1]:
get_min_max_avg(train_df, 'hours_per_week')

In [1]:
fig, ax = plt.subplots(1, 1, figsize=(25, 5))
ax.hist(train_df['hours_per_week'], bins=98)
print()

In [1]:
fig = plt.figure()
ggplot(train_df, aes(x='hours_per_week', fill='income')) + geom_density(alpha=0.7) + ggtitle('The hours per week distribution by income')

In [1]:
sns.boxplot(x='income', y='hours_per_week', data=train_df, palette="Set2", linewidth=2);

### **(14) native_country**

In [1]:
train_df['native_country'].value_counts()

## **[4. Build Various Models](None)**
---

## Defaults

In [1]:
train_df.drop(['id'], axis=1, inplace=True)
test_df.drop(['id'], axis=1, inplace=True)

In [1]:
train_y = train_df['income'] != '<=50K'
train_x = train_df.drop(['income'], axis=1)

In [1]:
label_encoder = ordinal.OrdinalEncoder(list(train_x.columns))

In [1]:
train_x = label_encoder.fit_transform(train_x, train_y)
test_x  = label_encoder.transform(test_df)

In [1]:
NFOLDS = 5
folds = model_selection.KFold(n_splits=NFOLDS)

columns = train_x.columns
splits  = folds.split(train_x, train_y)
y_preds = np.zeros(test_x.shape[0])

feature_importances = pd.DataFrame()
feature_importances['feature'] = columns

In [1]:
model = LGBMClassifier(objective='binary', verbose=400, random_state=91)


for fold_n, (train_index, valid_index) in enumerate(splits):
    print('Fold: ', fold_n+1)
    train_X, validation_X = train_x.iloc[train_index], train_x.iloc[valid_index]
    train_Y, validation_Y = train_y.iloc[train_index], train_y.iloc[valid_index]

    evals = [(train_X, train_Y), (validation_X, validation_Y)]
    model.fit(train_X, train_Y, eval_metric='f1', eval_set=evals, verbose=True)
    
    feature_importances[f'fold_{fold_n + 1}'] = model.feature_importances_
        
    y_preds += model.predict(test_x).astype(int) / NFOLDS
    
    del train_X, validation_X, train_Y, validation_Y

In [1]:
feature_importances

In [1]:
sample_submission['prediction'] = y_preds

for ix, row in sample_submission.iterrows():
    if row['prediction'] > 0.5:
        sample_submission.loc[ix, 'prediction'] = 1
    else:
        sample_submission.loc[ix, 'prediction'] = 0

sample_submission = sample_submission.astype({"prediction": int})
sample_submission.to_csv('submission.csv', index=False)

### Drop Features

In [1]:
train_df.drop(['fnlwgt'], axis=1, inplace=True)
test_df.drop(['fnlwgt'], axis=1, inplace=True)

In [1]:
train_y = train_df['income'] != '<=50K'
train_x = train_df.drop(['income'], axis=1)

In [1]:
label_encoder = ordinal.OrdinalEncoder(list(train_x.columns))

In [1]:
train_x = label_encoder.fit_transform(train_x, train_y)
test_x  = label_encoder.transform(test_df)

In [1]:
NFOLDS = 5
folds = model_selection.KFold(n_splits=NFOLDS)

columns = train_x.columns
splits  = folds.split(train_x, train_y)
y_preds = np.zeros(test_x.shape[0])

feature_importances = pd.DataFrame()
feature_importances['feature'] = columns

In [1]:
model = LGBMClassifier(objective='binary', verbose=400, random_state=91)


for fold_n, (train_index, valid_index) in enumerate(splits):
    print('Fold: ', fold_n+1)
    train_X, validation_X = train_x.iloc[train_index], train_x.iloc[valid_index]
    train_Y, validation_Y = train_y.iloc[train_index], train_y.iloc[valid_index]

    evals = [(train_X, train_Y), (validation_X, validation_Y)]
    model.fit(train_X, train_Y, eval_metric='f1', eval_set=evals, verbose=True)
    
    feature_importances[f'fold_{fold_n + 1}'] = model.feature_importances_
        
    y_preds += model.predict(test_x).astype(int) / NFOLDS
    
    del train_X, validation_X, train_Y, validation_Y

In [1]:
feature_importances

In [1]:
sample_submission['prediction'] = y_preds

for ix, row in sample_submission.iterrows():
    if row['prediction'] > 0.5:
        sample_submission.loc[ix, 'prediction'] = 1
    else:
        sample_submission.loc[ix, 'prediction'] = 0

sample_submission = sample_submission.astype({"prediction": int})
sample_submission.to_csv('submission.csv', index=False)