### Deadline: 12:00 08.11 (Tuesday) 
**Group size: 2-3 persons.** <br>
**Assignments should be performed in this ipython notebook, saved and sent to vasilev@uni-koblenz.de, florian.lemmerich@gesis.org, philipp.singer@gesis.org with Subject: [ML-Assignment]**<br>
**You create groups on your own, names of all group participants should be mentioned in the letter.** 

###  Following cell is for data preparation, you should put 'adult.data' file in the same folder as current notebook and ran the cell in the beginning.

In [1]:
import pandas as pd
import numpy as np
names = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain'
    ,
        'capital-loss','hours-per-week','native-country','income']
df = pd.read_csv('adult.data',names=names,index_col=False)
df = df[['age','workclass','sex','hours-per-week','education','capital-gain','capital-loss','income']]
df.replace(' ?',np.nan,inplace=True)

### The dataset consists of 7 features of a person and an income class that they belong to: '>50K' or '<=50K'.

In [2]:
df.shape

(32561, 8)

In [3]:
[x-1 for x in range(0,9)]

[-1, 0, 1, 2, 3, 4, 5, 6, 7]

In [4]:
import random
df.iloc[random.sample(range(0,32561),26048)].head()

Unnamed: 0,age,workclass,sex,hours-per-week,education,capital-gain,capital-loss,income
19188,27,Private,Male,40,11th,0,0,<=50K
28257,47,Local-gov,Female,40,Doctorate,0,0,<=50K
18162,41,Self-emp-not-inc,Male,25,HS-grad,0,0,<=50K
17818,39,Private,Female,35,Assoc-acdm,0,0,<=50K
32424,35,Private,Male,45,Bachelors,0,0,<=50K


## Some data analysis

### What workclass is payed better?

In [5]:
df[df.income == " <=50K"].workclass.value_counts(normalize=True)

 Private             0.768494
 Self-emp-not-inc    0.078743
 Local-gov           0.063965
 State-gov           0.040953
 Federal-gov         0.025525
 Self-emp-inc        0.021408
 Without-pay         0.000607
 Never-worked        0.000303
Name: workclass, dtype: float64

In [6]:
df[df.income == " >50K"].workclass.value_counts(normalize=True)

 Private             0.648758
 Self-emp-not-inc    0.094641
 Self-emp-inc        0.081307
 Local-gov           0.080654
 Federal-gov         0.048497
 State-gov           0.046144
Name: workclass, dtype: float64

### Are men payed better?

In [7]:
male = df[df.sex == " Male"].income.value_counts()
print(male)
suma = male[0]+male[1]
print(male[0]/suma)
print(male[1]/suma)

 <=50K    15128
 >50K      6662
Name: income, dtype: int64
0.694263423589
0.305736576411


In [8]:
fem = df[df.sex == " Female"].income.value_counts()
print(fem)
suma = fem[0]+fem[1]
print(fem[0]/suma)
print(fem[1]/suma)

 <=50K    9592
 >50K     1179
Name: income, dtype: int64
0.890539411382
0.109460588618


In [9]:
df.mean()

age                 38.581647
hours-per-week      40.437456
capital-gain      1077.648844
capital-loss        87.303830
dtype: float64

## 1) Perform k-nearest neighbors algorithm with two k parameters of your choice on a given dataset.

### 1.1) Preprocessing: dataset contains missing values and categorical variables, you need to handle them, before applying an algorithm on the data.

In [10]:
# Number of null values in the dataset
df.workclass.isnull().sum()

1836

In [11]:
# Replace all null values with most common category - Private
df.workclass = df.workclass.fillna(" Private")
# df.dropna(0, how="any", inplace=True) #Droping nulls shows worse results!

In [12]:
# Check are there any null values left
df.isnull().sum()

age               0
workclass         0
sex               0
hours-per-week    0
education         0
capital-gain      0
capital-loss      0
income            0
dtype: int64

In [13]:
# Handling categorical values with pandas get_dummies
pd.get_dummies(df.workclass).head()

Unnamed: 0,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,Self-emp-not-inc,State-gov,Without-pay
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [14]:
df.head()

Unnamed: 0,age,workclass,sex,hours-per-week,education,capital-gain,capital-loss,income
0,39,State-gov,Male,40,Bachelors,2174,0,<=50K
1,50,Self-emp-not-inc,Male,13,Bachelors,0,0,<=50K
2,38,Private,Male,40,HS-grad,0,0,<=50K
3,53,Private,Male,40,11th,0,0,<=50K
4,28,Private,Female,40,Bachelors,0,0,<=50K


In [15]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
cat_columns = ['workclass', 'sex', 'education']
encodedn_df = df[cat_columns].apply(le.fit_transform)

In [16]:
cat_columns = ['workclass', 'sex', 'education']
for cat_col in cat_columns:
    df = df.join(pd.get_dummies(df[cat_col]))

In [17]:
df = df.drop(labels=cat_columns, axis=1)
df.head()

Unnamed: 0,age,hours-per-week,capital-gain,capital-loss,income,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,...,9th,Assoc-acdm,Assoc-voc,Bachelors,Doctorate,HS-grad,Masters,Preschool,Prof-school,Some-college
0,39,40,2174,0,<=50K,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,50,13,0,0,<=50K,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,38,40,0,0,<=50K,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,53,40,0,0,<=50K,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,28,40,0,0,<=50K,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### 1.2) You need to divide your dataset into two parts: training and test. Training subset should contain 80% of the whole dataset and target classes should be balanced in both subsets.

In [27]:
from sklearn.cross_validation import train_test_split
X1, X2, y1, y2 = train_test_split(df.drop("income", axis=1), df["income"], test_size=0.2, random_state=1, stratify=df['income'])
# X - features, y - labels, 1 - train, 2 - test
print(X1.shape)
print(y1.shape)

(26048, 30)
(26048,)
<class 'pandas.core.frame.DataFrame'>


In [19]:
x1_y1 = X1.join(y1)
x1_y1 = x1_y1.append(x1_y1[x1_y1.income == " >50K"], ignore_index=True)
x1_y1.shape
# X1 = x1_y1.drop('income', axis=1)
# y1 = x1_y1['income']

(32321, 31)

In [20]:
# x1_y1.shape

### Checking the distribution of the classes before and after the split

In [21]:
df["income"].value_counts(normalize=True)

 <=50K    0.75919
 >50K     0.24081
Name: income, dtype: float64

In [22]:
y1.value_counts(normalize=True)

 <=50K    0.759175
 >50K     0.240825
Name: income, dtype: float64

In [23]:
y2.value_counts(normalize=True)

 <=50K    0.759251
 >50K     0.240749
Name: income, dtype: float64

### 1.3) Apply k-nearest neighbors algorithm with two different k parameters of your choice:

In [24]:
from sklearn.neighbors import KNeighborsClassifier
knn1 = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn2 = KNeighborsClassifier(n_neighbors=29, weights='distance')

In [25]:
knn1.fit(X1, y1)
knn2.fit(X1, y1)
res1 = knn1.predict(X2)
res2 = knn2.predict(X2)

## 2) Evaluate and compare performance of two models:

### 2.1) Print performance metrics of your models:

In [26]:
from sklearn.metrics import classification_report, accuracy_score
rep1 = classification_report(y2,res1)
rep2 = classification_report(y2,res2)
acc1 = accuracy_score(y2,res1)
acc2 = accuracy_score(y2,res2)
print(rep1)
print(rep2)
print(acc1)
print(acc2)

             precision    recall  f1-score   support

      <=50K       0.86      0.92      0.88      4945
       >50K       0.66      0.51      0.58      1568

avg / total       0.81      0.82      0.81      6513

             precision    recall  f1-score   support

      <=50K       0.85      0.93      0.89      4945
       >50K       0.70      0.48      0.57      1568

avg / total       0.82      0.83      0.81      6513

0.81882389068
0.826193766314


### 2.2) In a few sentences argue which k performed better, based on performance metrics from the previous task. 

### 2.3) Classification for which class were performed better? Why do you think this is the case?