### US Adult Income Prediction with _KNN and Decision Trees_
The present is a mini-project assigned by **Mr. El Younoussi Yassine** 
as part of Business Intelligence module.

The goal is to _build a model_ able to predict a _US person's annual income_ based on other features

Presented by:
**Chraibi Mohammed Yassine** and **Chaoui El Mehdi**

### Step 1: Data understanding

In [None]:
# Importing pandas
import pandas as pd
columns = ['Age','Workclass','fnlgwt','Education','Education Num','Marital Status','Occupation',
           'Relationship','Race','Sex','Capital Gain','Capital Loss','Hours/Week','Native Country','Target']
# Loading training set
train_data = pd.read_csv("../input/trainingandtest/train.csv", 
                        header=None, na_values='?', sep=', ', engine='python', names=columns).drop([0])
# Showing 10 first rows of training set
train_data.head(10)

In [None]:
# Loading test set
test_data = pd.read_csv("../input/trainingandtest/test.csv", 
                        header = None, skiprows=1, na_values='?', sep=', ', engine='python', names=columns)
# Showing 10 first rows of test set
test_data.head(10)

In [None]:
# Showing set volume and dimensions
print(train_data.shape)
print(test_data.shape)
# Describing data sets
display(train_data.describe().T)
display(test_data.describe().T)

##### Note: training set has 32561 recording, test set has half (16281). Both sets have 15 columns(features) including Target class

In [None]:
# Showing data types for each set
display(train_data.info())
display(test_data.info())

#### Conclusions
We notice that some of the features have different types in the two data sets. These values must be preprocessed.
Features to change from object to int64:
+ Age: from object to int64
Features to change from float64 to int64
+ Education Num
+ fnlgwt: from float64 to int64
+ capital gain
+ capital loss
+ hours/week
##### Note: We can already observe that there are missing values in both data sets

### Step 2: Data visualisation

In [None]:
# Importing matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np
# Visualing feature count
# Histograms for numerical values and bar charts for categorical values
for column in columns:
    if train_data.dtypes[column] == np.object:
        train_data[column].value_counts().plot(kind="bar", title=column)
    else:
        train_data[column].hist()
        plt.title(column)
    plt.show()

#### We notice that the target class is imbalanced

In [None]:
# Changing features to int values
numerical_features = [
    "Age", "Education Num", "fnlgwt", "Capital Gain", "Capital Loss", "Hours/Week"
]
for feature in numerical_features:
    train_data[feature] = train_data[feature].astype(int)
display(train_data.info())

#### Feature types were successfully changed to int

In [None]:
# Importing Seaborn for data visualisation
import seaborn as sns
# Correlations with numerical values in training and test sets
for feature in numerical_features:
    sns.FacetGrid(train_data, col='Target').map(plt.hist, feature, bins=20)
    sns.FacetGrid(test_data, col='Target').map(plt.hist, feature, bins=20)

#### Notes
+ Note 1: We notice that the age feature is highly correlated with Target class
+ Note 2: We also notice that people aged between 30 and 50 years have the most income
+ Note 3: We notice that fnlgwt, Capital gain and Capital loss features have no direct correlation with target class, so we can safely **drop these columns** 

In [None]:
# Visualing correlations between categorical values and target classes
corr = train_data.corr()
categorical_values = [
"Workclass",
"Education",
"Marital Status",
"Occupation",
"Relationship",
"Race",
"Sex",
"Native Country",
"Target"]
# Correlations with categorical values in trainig set
for feature in categorical_values:
    ax = train_data.plot(kind='bar', title=feature,figsize=(15,10),legend=True, fontsize=12)
    ax.set_xlabel("Target",fontsize=12)
    ax.set_ylabel(feature, fontsize=12)

###### Note: Target variable is imbalanced

In [None]:
import matplotlib.pyplot as plt
import missingno as msno
msno.bar(data)
plt.show()

###### There are very little values missing in the data. Since the three columns where there are missing values are categorical, looking at their distribution plots we can go for Mode value imputation.

In [None]:
modes = data.mode().iloc[0]
data.fillna(modes, inplace=True)

#Verifying
msno.bar(data)
plt.show()

In [None]:
cols_to_encode = [data.columns[i] for i in range(data.shape[1]) if data.dtypes[i] == np.object]
cols_to_encode

In [None]:
data.groupby('Education').nunique()['Education Num']

###### This implies Education and Education Num are the same; Education is already encoded

In [None]:
data.drop('Education', axis = 1, inplace = True)
cols_to_encode.remove('Education')
data.head()