<a id="section-one"></a>
# Introduction
### The dataset contains information about people and their income.

### The data columns:

* age: continuous.
* workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
* fnlwgt: continuous.
* education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
* education-num: continuous.
* marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
* occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
* relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
* race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
* sex: Female, Male.
* capital-gain: continuous.
* capital-loss: continuous.
* hours-per-week: continuous.
* native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.



### Our goal 
**Our goal is to classify people into different income groups (more or less 50k/per year) depending on features above.**
This analysis is divided into the following steps:

1. Studying and preparing data;
2. Exploratory Data Analysis;
3. Classification;
4. The overall conclusion.

___________
P.S:

Hello everyone! I'm beginner in the Kaggle and try to become more skilled data scientist.

I hope, you will find my work  informative. Please, upwote the notebook if it could be usefull for you, thanks!

And now let's go ➡

# The table of Contents
* [Introduction](#section-one)
* [Step 1: Studying and preparing data](#section-two)
    - [Basic information](#sub-21)
    - [Data preproccessing](#sub-22)

* [Step 2: EDA](#section-three)
    - [Investigation of outliers in 'capital-gain','capital-loss' columns](#sub-31)
    - [Quntitative columns](#sub-31)
    - [Categorical columns](#sub-33)
    - [The correlation matrix](#sub-34)

* [Step 3: Feature Selection](#section-four)
    - [Feature Selection](#subsection-one)
    - [Feature Scaling](#anything-you-like)

* [Step 4: Building model](#section-five)
    - [Subsection 1](#subsection-one)
    - [Subsection 2](#anything-you-like)

* [Step 5: Overall conclusion](#section-six)


In [None]:
#importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

#removing warnings
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
    
# sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report 

# to depict tree_prediction
! pip install pydotplus
from pydotplus import graph_from_dot_data
from sklearn.tree import export_graphviz
from IPython.display import Image

# to remove warnings
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Step 1: Studying and preparing data

In [None]:
#reading data
try:
    df = pd.read_csv('income_evaluation.csv')
except:
    df = pd.read_csv('/kaggle/input/income-classification/income_evaluation.csv')
    
df.head()

<a id="sub-21"></a>
### Basic info
#### Studying data

In [None]:
def data_descr(data, data_name=''):
    print(f'The dataset is: {data_name}', end='\n\n')
    display(data.info())
    
    print(f'Statistical information', end='\n\n')
    display(data.describe(include='all'))
    
    duplicates = data.duplicated().sum()
    if duplicates > 0:
        print('The number of duplicates is', data.duplicated().sum(), '.')
    else:
        print('There are no duplicates in the data')
        
    print('The number of missing values per column in % ')
    report = data.isna().sum() / data.shape[0] * 100
    report = report.to_frame()
    report.rename(columns={0:'missing values in % of total'}, inplace =True)
    display(report)

data_descr(df, data_name='Income groups')

#### Checking outliers

In [None]:
plt.figure(figsize=(15,6))
plt.title(f'The boxcharts for the age and hours-per-week columns')
sns.boxplot( data=df[['age', ' hours-per-week']], orient='h')
plt.show()

In [None]:
plt.figure(figsize=(15,6))
plt.title(f'The histogram for the capital-gain and capital-loss columns')
sns.histplot( data=df[[' capital-gain',' capital-loss']])
plt.ylabel('The number of entries')
plt.xlabel('Capital in $')
plt.show()

<a id="sub-22"></a>
### Preproccessing stage
- There are now missing values, but we have to check duplicates and remove them;
- It could be better to rename data columns to have more comfortable df for the further research;
- Also it would be usefull to add the new columns and check&change dtypes.

In [None]:
#studying duplicates
display(df.loc[df.duplicated()].head())

#it seems like there is no relationship among values, that's why we will remove duplicates
df.drop_duplicates(inplace=True)
print('The number of duplicates in the data is ', df.duplicated().sum())

In [None]:
# rename columns name
df.columns = df.columns.str.replace(' ', '')
df.columns = df.columns.str.replace('-', '_')
df.columns

In [None]:
# adding new the categorical column
bins = [16, 24, 64, 90]
labels = ["young","adult","old"]
df['age_types'] = pd.cut(df['age'], bins=bins, labels=labels)
df['income_num'] = np.where(df['income'] == ' >50K', 1, 0).astype('int16')

# making two lists of columns name
numeric_columns = ['age', 'capital_gain','capital_loss', 'hours_per_week', 'income_num']
categorical = ['workclass', 'education','age_types',  'education_num', 'marital_status','occupation','relationship','race','sex', 'native_country','income']

In [None]:
#changing categorical dtypes
for i in categorical:
    df[i] = df[i].astype("category")
    
display(df.info())

### Conclusion:

Now we obtain the following impormation about our dataset:

General:
- There are 32561 entries and 15 columns;
- It seems like there are no errors in the data (for example, too high ot too low age values);
- There are no missing values in the data;
- There are 25 duplicates, which were removed from the data;
- The 'capital-gain','capital-loss' columns has some outliers, which has to be examined properly at the EDA stage;
- One new column (`age-types`) was added to the data; 
- We have ten categorical including the target column (income) and six quantitative columns. 

The main:

- The min age is 17, the max age is 90;
- The top worklass is 'Private';
- the most frequent education type is "Prof-specialty";
- White race and male sex are top in the corresponding columns;
- The mean value and 50% of hours-per-week are 40.44 and 40.00$ (it is like normal distribution);
- The most common type of income is less than or equals to 50k.


<a id="section-three"></a>
# Step 2: EDA 

<a id="sub-31"></a>
### Investigation of outliers in 'capital-gain','capital-loss' columns
'
Questions:
- Who has >90000 capital-gain?
- How do these columns impact on the target column?

In [None]:
plt.figure(figsize=(15,6))
plt.title(f'The boxplot for the capital_gain and capital_loss columns')
sns.boxplot( data=df[['capital_gain','capital_loss']], orient='h')
plt.ylabel('The number of entries')
plt.xlabel('Capital in $')
plt.show()

In [None]:
# Who has more than 90000 capital-gain?
Q3 = df['capital_gain'].quantile(0.75)
capital_g_high = df.query('capital_gain > @Q3')
capital_g_max = df.query('capital_gain > 99000')

Q3 = df['capital_loss'].quantile(0.9)
capital_l_high = df.query('capital_loss > @Q3')
capital_l_max = df.query('capital_loss >= 4356')

display(capital_g_max.describe(include='all'))

Observation: all entries with the highest capital gain have >50k income value. That's why we couldn't delete these values

In [None]:
# How do these columns impact on the target column?

data_list = {'Original DataFrame':df, 
             'Capital loss max (4356)' : capital_l_max,
             'High capital loss':capital_l_high,
             'High capital gain':capital_g_high,
             'Capital gain max (99 999)': capital_g_max}


fig, axs = plt.subplots(1, 5, sharey=True, figsize=(16,4))
fig.suptitle('Relation between the target and capital gain and loss')

for i, data in enumerate(data_list):
    data_list[data].groupby('income')['income'].count().plot.pie(autopct="%.1f%%", ax=axs[i])
    axs[i].set_title(data)

Conclusion: we see a strong relationship between considered features: the higher capital gain the higher income and vice versa.
- It is needed to scale all features before building a model,
- It is could be better to make a corresponding conclusion after checking correlation and distribution (the next step),
- We can't delete outliers here, as it can lead to the result distortion, but it could be useful to check the difference.

<a id="sub-32"></a>
### Quntitative columns

Now lets take a look at how the quntitative variables are distributed

In [None]:
# histograms
param_graphs = df.hist(numeric_columns, figsize=(16, 10), bins=20,)
for axis in param_graphs.flatten():
    axis.set_ylabel('frequency')
plt.show()

There are mainly 20 - 50 y.o. according to the `age` column hist, it is left skewed;

Looks like `capital_gain` is distributed as  `capital_loss` with bigger outliers (see above);

The majority af `hours per week` no more than 50 h/w, and the most value is about 40 (8 hours per day).

In [None]:
sns.pairplot(data=df, hue="income")
plt.title('Distributions for each variable')
plt.show()

People, who are less than 40 y.o. are most likely to have higher income.

All features have different ranges, so it is needed to skale it later.

<a id="sub-33"></a>
### Categorical columns

In [None]:
fig, axs = plt.subplots(3, 2, figsize=(20,20))
axs = axs.flatten()
fig.suptitle('Relation between the categorical features and income')

categorical2 = ['workclass', 'marital_status','occupation','relationship','race','sex']
for ax, i in enumerate(categorical2):
    plt.legend( bbox_to_anchor=(1.1, 1.1), loc='upper left')
    sns.countplot(x='income', alpha=0.7, hue=i, data=df, ax=axs[ax])

In [None]:
# check the mean and meadian ages for  marital_status, income groups
display(df.groupby('marital_status').agg({'age' : ['mean', 'median']}))
display(df.groupby('income').agg({'age' : ['mean', 'median']}))

- Federal and Local gov. occupations are more likely to get the higher salary;

- Occupation `Private house serving` is less likely to aim >50k income;

- `Never married` people has low chances to get the higher salary in comparison to the <50k count value;

- Husbands have big chances to get >50k income (also check `sex` barplot);

- Most people with high income are in family or were in family, it is related to the fact the mean age is higher and it is more likey that you're not `Not-in-Family`.


<a id="sub-34"></a>
### The correlation matrix

Correlation matrix depicts the correlation coefficients between all pairs of features in the data.


We use the Pearson correlation coefficient, which is a measure of the linear association between two variables. It has a value between -1 and 1 where:

- -1 indicates a perfectly negative linear correlation between two variables
- 0 indicates no linear correlation between two variables
- 1 indicates a perfectly positive linear correlation between two variables


In [None]:
# calculating the correlation matrix
corr = df.corr()
matrix = np.triu(corr)
sns.heatmap(corr, vmax=1.0, vmin=-1.0, 
            fmt='.1g', annot=True, mask = matrix)

plt.title('Correlation matrix')
plt.show()

### Conclusion: 

- outliers in columns `capital_loss` and `capital_gain` were investigated, now it isn't needed to drop them, as it can make sense;
- it is like husbands and men are more likely to have the higher income (see categorical investigation stage);
- there is a weak positive correlation between hours_per_week,capital_loss,capital_gain, age and the income;
- most features have no linear correlation between each other.

<a id="section-four"></a>
# Step 3: Feature Selection and Scaling

<a id="sub-41"></a>
### Feature Selection

>     `Feature selection is primarily focused on removing non-informative or redundant predictors from the model.`

        — Page 488, Applied Predictive Modeling, 2013.

Also the performance of some models can degrade when including input variables that are not relevant to the target variable. We have small dataset, but nevertheless it would be usefull to master this skill. There are two main ways to select variables:

- Unsupervised;
- Supervised.


We choose the last one, which is divided into 3 methods:

- Intrinsic (Trees);
- Wrapper methods (RFE);
- Filter methods (Stats & Feature Importance).

We use RandomForest to select features based on feature importance, which is the average of all decision tree feature importance in this method.

In [None]:
#dividing data
Y = df['income'].copy()
X = df.drop(categorical, axis=1).copy()

# defining feature selection
rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=80), max_features=5)
rf_selector.fit(X, Y)

rf_support = rf_selector.get_support()
rf_features = X.loc[:,rf_support].columns.tolist()
print(str(len(rf_features)), 'selected feature:', *rf_features)

Okay, maybe this step requires an additional research.

<a id="sub-34"></a>
### Feature Scaling

**Why Should we Use Feature Scaling?**

The first thing is some machine learning algorithms are sensitive to feature scaling while others are virtually invariant to it.

**There are two ways: normalization and standartization**

- Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

- Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.


Our data does not follow a Gaussian distribution, so we're going to normalize.

In [None]:
# scalling features
numeric_columns = ['age', 'capital_gain','capital_loss', 'fnlwgt', 'hours_per_week']
df_scalled = df[numeric_columns].copy()
# fit scaler on data
scale = StandardScaler().fit(df_scalled)
# transform data
df_scalled = scale.transform(df_scalled)
df_scalled = pd.DataFrame(df_scalled, columns=df[numeric_columns].columns)
df_scalled = df_scalled.merge(df[categorical], on=df.index)
df_scalled.drop('key_0', axis = 1, inplace=True)
df_scalled

<a id="conclusion-3"></a>
### Conclusion:

- there is no Gaussian distribution, so we chose normalization to scale the data;

<a id="section-five"></a>
# Step 4: Building model

Models to build:
- Linear;
- Decision Tree;
- Support Vector Machine;
- Random Forest.

### Building models

In [None]:
X, Y = df_scalled.drop('income', axis=1),  df_scalled['income'].copy()
categorical = ['workclass', 'education','age_types',  'education_num', 'marital_status','occupation','relationship','race','sex', 'native_country']
le = LabelEncoder()

le.fit(Y)
Y, Y_names = le.transform(Y), le.classes_
print(f'Income classes are {Y_names}')
for i in X[categorical]:
    try:
        le.fit(X[i])
        X[i] = le.transform(X[i])
    except:
        pass


X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.3, random_state=0)

In [None]:
# Decision tree
tree = DecisionTreeClassifier(criterion='gini', 
                              max_depth=4, 
                              random_state=1)
tree.fit(X_train, y_train)

y_pred_tree = tree.predict(X_test)
print(classification_report(y_test, y_pred_tree))


dot_data = export_graphviz(tree,
                           filled=True, 
                           rounded=True,
                           class_names=Y_names,
                           feature_names=list(X.columns),
                           out_file=None) 
graph = graph_from_dot_data(dot_data)
graph.write_png('tree.png')
Image(filename = 'tree.png')

In [None]:
# SVM

In [None]:
#RF

### Model Evaluation

<a id="conclusion-4"></a>
### Conclusion:

<a id="section-six"></a>
# Step 5: Overall conclusion

1. At the first step we have studied data in detail;
2. At the second step 

### Usefull links:

- Thank [this source](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/) for the information about feature importance;