**Introduction**
We were provided data from the Korean Government for years of 2005 to 2018. This included lots of information on the household statistics and their income. The content included in the dataset can be broken down into the following:

id

year : study conducted

wave : from wave 1st in 2005 to wave 14th in 2018

region: 1) Seoul 2) Kyeong-gi 3) Kyoung-nam 4) Kyoung-buk 5) Chung-nam 6) Gang-won &. Chung-buk 7) Jeolla & Jeju

income: yearly income in M KRW(Million Korean Won. 1100 KRW = 1 USD)

family_member: no. of family members

gender: 1) male 2) female

year_born

education_level: 1) no education(under 7 yrs-old) 2) no education(7 & over 7 yrs-old) 3) elementary 4) middle school 5) high school 6) college 7) university degree 8) MA 9) doctoral degree

marriage: marital status. 1) not applicable (under 18) 2) married 3) separated by death 4) separated 5) not married yet 6) others

religion: 1) have religion 2) do not have

occupation: this will be provided in separated code book

company_size

reasonnoneworker: 1) no capable 2) in military service 3) studying in school 4) prepare for school 5) preprare to apply job 6) house worker 7) caring kids at home 8) nursing 9) giving-up economic activities 10) no intention to work 11) others

Throughout my analysis, I will look at how the data can be presented visually and determine if there are any interesting facts that can be pulled from this data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
from matplotlib import pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Load information from the overall dataset and the definitions for the occupations into dataframes. Although the job titles will not be used for analysis, it will be added into our dataset to make it easier to visualize the data.

In [None]:
wf = pd.read_csv('/kaggle/input/korea-income-and-welfare/Korea Income and Welfare.csv')

The analysis only requires the job code and title. In the interest of keeping the data clean, the first two columns can be dropped.

In [None]:
job_check = pd.read_excel('/kaggle/input/korea-income-and-welfare/job_code_translated.xlsx')
job_check = job_check[['job_code', 'job_title']]
job_check = job_check.rename(columns={"job_code": "occupation", "job_title": "job_title"})

**Preprocessing**

I wanted to see if there were any missing values in the dataset. Based on the below, there are no null values in the data.

In [None]:
wf.isna().sum()

Next, using describe we can look at the data in a much more aggregated view. There were a couple of pieces that jump out right away in this view.

1. Income - Dispersed between 468,209 Won and -232,174 Won, with the majority of income being below 5,000.

2. Marriage - There are values that are noted as 0 and 9 which do not have a description associated with it.

3. Religion - There should be binary (1 or 2). Currently there is a 9 in the column. 

4. Occupation, Company_size, Reason_none_worker - Did not show up in the aggregated view. Based on this, it seems like there are some none numeric values in the column.

In [None]:
wf.describe()

In [None]:
wf.nunique()

In [None]:
wf.marriage.unique()

In [None]:
wf = wf[wf['marriage'].between(1, 8)]

In [None]:
wf.marriage.unique()

Next, the reason_none_worker column needs to be cleaned up. The numbers that have descriptions are numbers between 1 and 11. In addition, cells that have ' ' are individuals that currently have jobs.

In [None]:
wf.reason_none_worker.unique()

There were a large amount of rows that have ' ' in the reason_none_worker. To ensure all of these instances were actually individuals who had no jobs, the dataset was filtered by rows under occupation with ' ' which would signify no jobs. Using the value counts, we can see there are situations where column reason_none_worker has ' ' (1493 instances of this). In order to clean the data we will need to remove these.

In [None]:
wf[(wf['occupation'] == ' ')].value_counts('reason_none_worker')

Remove all rows that have 0 and 99 noted in the column reason_none_worker.

In [None]:
wf = wf[wf['reason_none_worker'] != '99']
wf = wf[wf['reason_none_worker'] != '0']

In order to remove the rows that had a ' ' in columns occupation and reason_none_worker, two column checks were added to the dataset. Once the indexes of the True statements were found, the rows associated with these indexed numbers. In addition, the newly made columns were dropped because they were no longer useful.

In [None]:
wf['check'] = wf.occupation.apply([lambda x: True if x == ' ' else False])

In [None]:
wf['check2'] = wf.reason_none_worker.apply([lambda x: True if x == ' ' else False])

In [None]:
indexNames = wf[(wf['check'] == True) & (wf['check2'] == True)].index

In [None]:
print(indexNames)

In order to 

In [None]:
wf.drop(indexNames , inplace=True)
wf = wf.drop(['check'], axis=1)
wf = wf.drop(['check2'], axis=1)

To continue with the analysis, the descriptions were added to the dataset. The columns will eventually need to be broken into dummies that will allow the machine learning models to run.

In [None]:
wf.loc[wf['marriage'] == 1, 'marriage'] = 'NA(Under_18)'
wf.loc[wf['marriage'] == 2, 'marriage'] = 'married'
wf.loc[wf['marriage'] == 3, 'marriage'] = 'separated_by_death'
wf.loc[wf['marriage'] == 4, 'marriage'] = 'separated'
wf.loc[wf['marriage'] == 5, 'marriage'] = 'not_married_yet'
wf.loc[wf['marriage'] == 6, 'marriage'] = 'others'

In [None]:
wf.loc[wf['education_level'] == 1, 'education_level'] = 'no_education(under_7)'
wf.loc[wf['education_level'] == 2, 'education_level'] = 'no_education'
wf.loc[wf['education_level'] == 3, 'education_level'] = 'elementary'
wf.loc[wf['education_level'] == 4, 'education_level'] = 'middle_school'
wf.loc[wf['education_level'] == 5, 'education_level'] = 'high_school'
wf.loc[wf['education_level'] == 6, 'education_level'] = 'college'
wf.loc[wf['education_level'] == 7, 'education_level'] = 'university_degree'
wf.loc[wf['education_level'] == 8, 'education_level'] = 'MA'
wf.loc[wf['education_level'] == 9, 'education_level'] = 'doctoral_degree'

In [None]:
wf.loc[wf['region'] == 1, 'region'] = 'Seoul'
wf.loc[wf['region'] == 2, 'region'] = 'Kyeong-gi'
wf.loc[wf['region'] == 3, 'region'] = 'Kyoung-nam'
wf.loc[wf['region'] == 4, 'region'] = 'Kyong-buk'
wf.loc[wf['region'] == 5, 'region'] = 'Chong-nam'
wf.loc[wf['region'] == 6, 'region'] = 'Gang-won & Chung-buk'
wf.loc[wf['region'] == 7, 'region'] = 'Jeju'

In [None]:
wf.loc[wf['reason_none_worker'] == '1', 'reason_none_worker'] = 'not_capable'
wf.loc[wf['reason_none_worker'] == '2', 'reason_none_worker'] = 'in_military_service'
wf.loc[wf['reason_none_worker'] == '3', 'reason_none_worker'] = 'studying_in_school'
wf.loc[wf['reason_none_worker'] == '4', 'reason_none_worker'] = 'prepare_for_school'
wf.loc[wf['reason_none_worker'] == '5', 'reason_none_worker'] = 'prepare_to_apply_job'
wf.loc[wf['reason_none_worker'] == '6', 'reason_none_worker'] = 'house_worker'
wf.loc[wf['reason_none_worker'] == '7', 'reason_none_worker'] = 'caring_for_kids_at_home'
wf.loc[wf['reason_none_worker'] == '8', 'reason_none_worker'] = 'nursing'
wf.loc[wf['reason_none_worker'] == '9', 'reason_none_worker'] = 'giving_up_economic_activities'
wf.loc[wf['reason_none_worker'] == '10', 'reason_none_worker'] = 'no_intention_to_work'
wf.loc[wf['reason_none_worker'] == '11', 'reason_none_worker'] = 'other'
wf.loc[wf['reason_none_worker'] == ' ', 'reason_none_worker'] = 'employed'

In [None]:
wf.loc[wf['gender'] == 1, 'gender'] = 'male'
wf.loc[wf['gender'] == 2, 'gender'] = 'female'
wf.loc[wf['religion'] == 1, 'religion'] = 'religious'
wf.loc[wf['religion'] == 2, 'religion'] = 'non-religious'

In [None]:
wf.loc[wf['occupation'] == ' ', 'occupation'] = 20000

wf = wf.astype({'occupation': 'int64'})



In [None]:
wf = wf.merge(job_check, on='occupation', how='left')

**Data Visualization**

The first variable that we will look at is the income by education level. We can see that the majority of data is congregated around 0, with the largest depersement on income in people with an education level of high school, college and university. After the boxplot (shown below), we determined that 99.6 percent of the income is within the range of -47 and 24,484. Without additional information, I assumed that there cannot be negative income and removed all rows that had income under 0 and over 25,000.

In [None]:
plt.figure()
sns.boxplot(data=wf, x='education_level', y='income')
plt.title('Education to Income Comparison')
plt.xticks(rotation=45)
plt.show()

In [None]:
wf.income.quantile([.002, .998])

In [None]:
wf = wf[(wf['income'] <= 25000) & (wf['income'] >= 0)]

This data has a much larger proportion of males in the data. In addtion, there seemed to be a large increase in the number of the individuals included in the dataset. 

In [None]:

plt.figure(figsize = (15,8))
sns.countplot(data = wf, x='year', hue='gender')
plt.title('Total Individuals Per Year by Gender')
plt.xticks(rotation=45)
plt.show()

Next, we wanted to see if there was a breakdown by marriage to determine if there was a specific group that would caused the increase in the welfare survey in 2011. In the visualization, we can see the majority of the increase was mainly caused by individuals under 18. There is a possibility that there was a major increase in birth rates; however, there has been a decrease in birth rates in Korea so I find this reasoning unlikely. 

In [None]:
plt.figure(figsize = (15,8))
sns.countplot(data = wf, x='year', hue='marriage')
plt.title('Marriage by Year')
plt.xticks(rotation=45)
plt.show()

In the following visualization, we look at the number of individuals included in each Job Category (Top 10). Based on the survey, cleaners and car drivers are the categories that most people work in. In the visualization showing the average income for these categories, we see that cleaners are by far the least paid of all of the other top job categories. 

In [None]:
plt.figure(figsize = (8,6))
sns.countplot(data = wf, y='job_title', order=wf.job_title.value_counts().iloc[1:11].index)
plt.xticks(rotation=90)
plt.title('Workers by Job Category (Top 10)')
plt.show()

In [None]:
plt.figure(figsize=(12,5))
sns.barplot(data=wf, x='job_title', y='income', order=wf.job_title.value_counts().iloc[1:11].index)
plt.title('Job Category to Income Comparison (Top 10)')
plt.xticks(rotation=90)
plt.show()

Instead of using year born, I thought it would be easier to look at the age of an individual. Below is the code to add the column and remove the year_born.

In [None]:
wf['age'] = [y-z for y, z in zip(wf['year'], wf['year_born'])]

In [None]:
wf=wf.drop('year_born', axis=1)

Below is a visualization breaking down the age and gender of the dataset. Men is much more distributed over the total age range vs. women which were much more likely to be between 60 and 80 in the dataset.

In [None]:
target_0 = wf.loc[wf['gender'] == 'male']
target_1 = wf.loc[wf['gender'] == 'female']

sns.distplot(target_0["age"], kde=False, bins = 20)
sns.distplot(target_1["age"], kde=False, bins = 20)
plt.show()

Similar to the job category visualizations, I wanted to look at total count and average income by region in Korea. Kyeong-gi had the largest number of individuals in the dataset, but Seoul had the highest average income in the dataset. I chose to see if the data will be able to help us predict the region of an individual. 

In [None]:
plt.figure(figsize = (15,8))
sns.countplot(data = wf, y='region', hue='gender')
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure()
sns.barplot(data=wf, x='region', y='income')
plt.xticks(rotation=90)
plt.show()

In [None]:
from sklearn import preprocessing

x = np.array(wf['age']) #returns a numpy array
x = np.reshape(x,(-1,1))
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
wf['age'] = x_scaled

In [None]:
x = np.array(wf['income']) #returns a numpy array
x = np.reshape(x,(-1,1))
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
wf['income'] = x_scaled

In [None]:
wf.loc[wf['company_size'] == ' ', 'company_size'] = None

In [None]:
x = np.array(wf['company_size']) 
x = np.reshape(x,(-1,1))
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
wf['company_size'] = x_scaled

In [None]:
x = np.array(wf['family_member']) 
x = np.reshape(x,(-1,1))
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
wf['family_member'] = x_scaled

In [None]:
x = np.array(wf['year']) 
x = np.reshape(x,(-1,1))
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
wf['year'] = x_scaled

For our model, I chose to drop id and wave because they were no longer useful because id was only an identifier and wave provides same information as year. In addition, I chose to drop company_size and occupation because occupation because it will add too many columns to the dataset when broken down into the dummy variables and company size is tied to the occupation variable.

In [None]:
df=wf[['region', 'income', 'age', 'year','family_member', 'gender', 'education_level', 'marriage', 'religion', 'reason_none_worker']]

In order to include variables that are in a string form, we need to break out the variable by using dummy variables.

In [None]:
for col in df.columns[5:]:
    df = pd.get_dummies(df, columns=[col], prefix = [col])

Next, I broke down our dataset into a train and test variable. For ease, I used an 80/20 split. 

In [None]:
from sklearn.model_selection import train_test_split
target = df.iloc[:, 0:1].values.ravel()
data = df.iloc[:,1:len(df.columns)]

x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=.2, random_state=5)

The classification models that I'm using are: Logistic Regression, Decision Tree Classifier, and K Neighbors Classifier.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

In [None]:
model = LogisticRegression(multi_class="multinomial", max_iter=500)
model.fit(x_train, y_train)
pre = model.predict(x_test)

In [None]:
model = DecisionTreeClassifier()
model_dec = model.fit(x_train, y_train)
pre_dec = model.predict(x_test)

In [None]:
neigh = KNeighborsClassifier(n_neighbors = 7).fit(x_train,y_train)
pre_kn = neigh.predict(x_test)


Below we ran a classification report to determine which model ran the best. Below we can see that KNeighbors has the highest accuracy F1 score (only slightly higher than the other models.)

In [None]:
from sklearn.metrics import classification_report
print('Logistic Regression')
print(classification_report(y_test, pre))
print('Decision Tree')
print(classification_report(y_test, pre_dec))
print('KN Neighbors')
print(classification_report(y_test, pre_kn))

In the original model, the neighbors were chosen at 7. Next we will see if we can increase our accuracy score by increasing the number of neighbors in the model. We see that having 19 neighbors, we would be able to increase accuracy but it would only be by about 1.5%.

In [None]:
from sklearn import metrics
acc = {}
for i in range(1,20):
    neigh = KNeighborsClassifier(n_neighbors = i).fit(x_train,y_train)
    pre_kn = neigh.predict(x_test)
    acc[i] = metrics.accuracy_score(y_test, pre_kn)

In [None]:
print(acc)

In [None]:
import operator

max(acc, key=lambda key: acc[key])

We can see that a comparison of the predicted values to the actual values in the confusion matrix which resulted in poorly predicted values.

In [None]:
from sklearn.metrics import confusion_matrix 

final_cm = confusion_matrix(y_test, pre_kn)
knn_labels = neigh.classes_

plt.figure(figsize=(10,7))

ax= plt.subplot()
sns.heatmap(final_cm, annot=True, ax = ax, fmt="d");

# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix');
ax.yaxis.set_tick_params(rotation=360)
ax.xaxis.set_tick_params(rotation=90)

ax.xaxis.set_ticklabels(knn_labels); 
ax.yaxis.set_ticklabels(knn_labels);

**Conclusion**

Based on the above models, the information that I chose did not provide a strong predictor of where a person would be from. The models may have been improved if i had not have dropped the occupation and company size, but I believe that this may not have helped much due to the relative evenly distributed data associated with regions. In addition, I there many other areas variables that we may have been able to look at in the future for classification purposes. 