# K nearest neighbors

KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled dataset consiting of training observations (x, y) and would like to capture the relationship between x and y. More formally, our goal is to learn a function h: X→Y so that given an unseen observation x, h(x) can confidently predict the corresponding output y.

In this module we will explore the inner workings of KNN, choosing the optimal K values and using KNN from scikit-learn.

## Overview

1. Read the problem statement.

2. Get the dataset.

3. Explore the dataset.

4. Pre-processing of dataset.

5. Visualization

6. Transform the dataset for building machine learning model.

7. Split data into train, test set.

8. Build Model.

9. Apply the model.

10. Evaluate the model.

11. Finding Optimal K value

12. Repeat 7, 8, 9 steps.

### Dataset

adult dataset https://www.kaggle.com/wenruliu/adult-income-dataset

## Load data

    Import the data and print the first 10 rows

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [2]:
adult_df=pd.read_csv('../input/adult.csv')
adult_df.sample(3)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
43854,50,Local-gov,164127,HS-grad,9,Never-married,Other-service,Not-in-family,Black,Female,0,0,40,United-States,<=50K
10167,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
47304,23,Private,234302,HS-grad,9,Never-married,Handlers-cleaners,Not-in-family,Black,Male,0,0,40,United-States,<=50K


## Data Pre-processing

### Question 2 - Estimating missing values

Its not good to remove the records having missing values all the time. We may end up loosing some data points. So, we will have to see how to replace those missing values with some estimated values (median)

Calculate the number of missing values per column
- don't use loops

In [3]:
adult_df.isna().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

In [4]:
adult_df.shape

(48842, 15)

In [5]:
adult_df['workclass'].value_counts()

Private             33906
Self-emp-not-inc     3862
Local-gov            3136
?                    2799
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: workclass, dtype: int64

Fill missing values with median of that particular column

In [6]:
adult_df['workclass'].value_counts()

Private             33906
Self-emp-not-inc     3862
Local-gov            3136
?                    2799
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: workclass, dtype: int64

In [7]:
# replace the missing values '?' with the top value of the column present
adult_df['workclass']=adult_df['workclass'].replace('?','Private')
adult_df['native-country']=adult_df['native-country'].replace('?','United-States')
adult_df['workclass'].value_counts()

Private             36705
Self-emp-not-inc     3862
Local-gov            3136
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: workclass, dtype: int64

In [8]:
#occupation column also has missing values and lets see the values that it has
adult_df[adult_df['workclass']=='Private']['occupation'].value_counts()

Craft-repair         4748
Sales                4439
Adm-clerical         4208
Other-service        4057
Exec-managerial      3995
Prof-specialty       3409
Machine-op-inspct    2882
?                    2799
Handlers-cleaners    1923
Transport-moving     1880
Tech-support         1154
Farming-fishing       670
Protective-serv       299
Priv-house-serv       242
Name: occupation, dtype: int64

as we have top 3 values in the same range and if we replace the missing values with top most rows, we might be in trouble as it will add more then 2.8k rows, the preffered way is to add the top 3 values in missing rows so that, the split is even

In [9]:
# get the top 3 occupations into a df
occupation_top3_df=adult_df[adult_df['workclass']=='Private']['occupation'].value_counts().head(3).index
# replace all the ? into nulls for better processing 
adult_df['occupation']=adult_df['occupation'].replace({'?':np.nan})
#generate a new DF with the null values in the columns
nans = adult_df['occupation'].isna()
##Key logic stats here
import random
## use random.choices and give the top 3 as input with respective distribution
replacement=random.choices(occupation_top3_df,weights=[.333, .333, .333], k=adult_df['occupation'].isnull().sum())
## use the above random values to keep in df again
adult_df.loc[nans,'occupation'] = replacement

In [10]:
#After replacements
adult_df['occupation'].value_counts()

Craft-repair         7020
Adm-clerical         6563
Sales                6453
Prof-specialty       6172
Exec-managerial      6086
Other-service        4923
Machine-op-inspct    3022
Transport-moving     2355
Handlers-cleaners    2072
Farming-fishing      1490
Tech-support         1446
Protective-serv       983
Priv-house-serv       242
Armed-Forces           15
Name: occupation, dtype: int64

### Question 3 - Dealing with categorical data

In [11]:
adult_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
age                48842 non-null int64
workclass          48842 non-null object
fnlwgt             48842 non-null int64
education          48842 non-null object
educational-num    48842 non-null int64
marital-status     48842 non-null object
occupation         48842 non-null object
relationship       48842 non-null object
race               48842 non-null object
gender             48842 non-null object
capital-gain       48842 non-null int64
capital-loss       48842 non-null int64
hours-per-week     48842 non-null int64
native-country     48842 non-null object
income             48842 non-null object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [12]:
#for all columns with minimal different values mark them as category
# adult_df['income']=adult_df['income'].astype('category')
# adult_df['gender']=adult_df['gender'].astype('category')
# adult_df['race']=adult_df['race'].astype('category')
# adult_df['relationship']=adult_df['relationship'].astype('category')
# adult_df['marital-status']=adult_df['marital-status'].astype('category')
# UNABLE TO FIND THE CORRELATION SO WE WILL USE PD_DUMMIES ONLY

In [13]:
#make label encoding for income
# Import label encoder 
from sklearn import preprocessing 
  
label_encoder = preprocessing.LabelEncoder() 
  
adult_df['income']= label_encoder.fit_transform(adult_df['income']) 
  
adult_df['income'].value_counts() 

0    37155
1    11687
Name: income, dtype: int64

In [14]:
y = pd.DataFrame(adult_df['income'],columns=['income'])
X = adult_df.drop(['income'],axis=1,inplace=False)
X_enc = pd.get_dummies(X)
# X_enc.head(3)
adult_enc = pd.concat([X_enc,y],axis=1)
adult_enc.head(3)

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,income
0,25,226802,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
1,38,89814,9,0,0,50,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
2,28,336951,12,0,0,40,0,1,0,0,...,0,0,0,0,0,0,1,0,0,1


### Question 4

Observe the association of each independent variable with target variable and drop variables from feature set having correlation in range -0.1 to 0.1 with target variable.

Hint: use **corr()**

In [15]:
corr_df=(adult_enc.corr()['income'] < .1) & (adult_enc.corr()['income'] > -.1)

corr_df
for x in list(corr_df.index):
    if corr_df[x]==True:
        adult_enc.drop(x,axis=1,inplace=True)

In [16]:
adult_enc.head(5)

Unnamed: 0,age,educational-num,capital-gain,capital-loss,hours-per-week,workclass_Private,workclass_Self-emp-inc,education_Bachelors,education_Doctorate,education_HS-grad,...,occupation_Other-service,occupation_Prof-specialty,relationship_Husband,relationship_Not-in-family,relationship_Own-child,relationship_Unmarried,relationship_Wife,gender_Female,gender_Male,income
0,25,7,0,0,40,1,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,38,9,0,0,50,1,0,0,0,1,...,0,0,1,0,0,0,0,0,1,0
2,28,12,0,0,40,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,1
3,44,10,7688,0,40,1,0,0,0,0,...,0,0,1,0,0,0,0,0,1,1
4,18,10,0,0,30,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


no need to drop any columns as we dont have any correlation in the range

### Question 5

Observe the independent variables variance and drop such variables having no variance or almost zero variance (variance < 0.1). They will be having almost no influence on the classification

Hint: use **var()**

In [17]:
var_df=adult_enc.var()<0.1
# iris_df.drop(x,axis=1,inplace=True)
# for x in list(var_df.index):
#     if var_df[x]==True:
#         iris_df.drop(x,axis=1,inplace=True)

In [18]:
var_df

age                                  False
educational-num                      False
capital-gain                         False
capital-loss                         False
hours-per-week                       False
workclass_Private                    False
workclass_Self-emp-inc                True
education_Bachelors                  False
education_Doctorate                   True
education_HS-grad                    False
education_Masters                     True
education_Prof-school                 True
marital-status_Divorced              False
marital-status_Married-civ-spouse    False
marital-status_Never-married         False
occupation_Adm-clerical              False
occupation_Exec-managerial           False
occupation_Other-service              True
occupation_Prof-specialty            False
relationship_Husband                 False
relationship_Not-in-family           False
relationship_Own-child               False
relationship_Unmarried                True
relationshi

no valirables have the variance satisfying the question, and hence we dont need to drop them

### Question 6
## Takes time so ignore this

Plot the scatter matrix for all the variables.

Hint: use **pandas.plotting.scatter_matrix()**

you can also use pairplot()

In [19]:
import seaborn as sns

In [20]:
# sns.pairplot(adult_enc)

## Scaling the variables

In [21]:
from scipy.stats import zscore
adult_enc_df_z = adult_enc.apply(zscore)  # convert all attributes to Z scale 

adult_enc_df_z.describe()

Unnamed: 0,age,educational-num,capital-gain,capital-loss,hours-per-week,workclass_Private,workclass_Self-emp-inc,education_Bachelors,education_Doctorate,education_HS-grad,...,occupation_Other-service,occupation_Prof-specialty,relationship_Husband,relationship_Not-in-family,relationship_Own-child,relationship_Unmarried,relationship_Wife,gender_Female,gender_Male,income
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,...,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,1.584958e-16,1.594573e-17,2.294458e-16,7.617582e-17,9.071110000000001e-17,-3.241123e-15,-1.850714e-15,-1.028896e-15,1.585385e-15,-3.324714e-16,...,-7.220473e-17,1.055032e-15,-7.351221e-16,-6.152075e-16,-4.202286e-16,-6.455851e-16,-7.388273e-16,2.906829e-16,-2.757714e-16,-1.234516e-16
std,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001,...,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001
min,-1.578629,-3.53103,-0.1448035,-0.2171271,-3.181452,-1.739029,-0.1896085,-0.4434064,-0.1109567,-0.6909876,...,-0.3348025,-0.3803222,-0.8227521,-0.5890934,-0.4286407,-0.3423905,-0.2238687,-0.7042205,-1.42001,-0.560845
25%,-0.7763164,-0.4193353,-0.1448035,-0.2171271,-0.03408696,0.5750334,-0.1896085,-0.4434064,-0.1109567,-0.6909876,...,-0.3348025,-0.3803222,-0.8227521,-0.5890934,-0.4286407,-0.3423905,-0.2238687,-0.7042205,-1.42001,-0.560845
50%,-0.119879,-0.03037346,-0.1448035,-0.2171271,-0.03408696,0.5750334,-0.1896085,-0.4434064,-0.1109567,-0.6909876,...,-0.3348025,-0.3803222,-0.8227521,-0.5890934,-0.4286407,-0.3423905,-0.2238687,-0.7042205,0.7042205,-0.560845
75%,0.6824334,0.7475502,-0.1448035,-0.2171271,0.3694214,0.5750334,-0.1896085,-0.4434064,-0.1109567,1.447204,...,-0.3348025,-0.3803222,1.215433,1.697524,-0.4286407,-0.3423905,-0.2238687,1.42001,0.7042205,-0.560845
max,3.745808,2.303397,13.27438,10.59179,4.727312,0.5750334,5.274025,2.255267,9.012524,1.447204,...,2.986835,2.62935,1.215433,1.697524,2.332956,2.920641,4.466905,1.42001,0.7042205,1.783024


## Split the dataset into training and test sets


### Question 7

Split the dataset into training and test sets with 80-20 ratio

Hint: use **train_test_split()**

In [22]:
from sklearn.model_selection import train_test_split
# split data into X and y
X=adult_enc_df_z.drop(columns='income')
# y=adult_enc_df_z["income"]
y = pd.DataFrame(adult_df['income'],columns=['income'])

In [23]:
# Break the data into training and test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=27)

## Build Model

### Question 8

Build the model and train and test on training and test sets respectively using **scikit-learn**.

Print the Accuracy of the model with different values of **k = 5

Hint: For accuracy you can check **accuracy_score()** in scikit-learn

In [24]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

Implementation one

In [25]:
#initialization
NNH = KNeighborsClassifier(n_neighbors= 5, weights = 'uniform', metric='euclidean')

In [26]:
# fir the models
NNH = KNeighborsClassifier(n_neighbors= 5, weights = 'uniform', metric='euclidean')
NNH.fit(X_train, y_train)
y_train_pred = NNH.predict(X_train)
y_test_pred = NNH.predict(X_test)
accuracy_score_train = accuracy_score(y_train, y_train_pred)
accuracy_score_test = accuracy_score(y_test, y_test_pred)
print("train accuracy for k =5 is ", accuracy_score_train)
print("test accuracy for k =5 is ", accuracy_score_test)

  This is separate from the ipykernel package so we can avoid doing imports until


train accuracy for k =5 is  0.878411184884027
test accuracy for k =5 is  0.8306148911485702


In [27]:
# get predictions for accuracy testing
y_train_pred = NNH.predict(X_train)
y_test_pred = NNH.predict(X_test)

In [28]:
# get accurcy scores
accuracy_score_train = accuracy_score(y_train, y_train_pred)
accuracy_score_test = accuracy_score(y_test, y_test_pred)

In [29]:
print("train accuracy for k =5 is ", accuracy_score_train)
print("test accuracy for k =5 is ", accuracy_score_test)

train accuracy for k =5 is  0.878411184884027
test accuracy for k =5 is  0.8306148911485702


## Find optimal value of K

### Question 9 - Finding Optimal value of k

- Run the KNN with no of neighbours to be 1, 3, 5 ... 19
- Find the **optimal number of neighbours** from the above list

In [30]:
k_range = range(1,30,2)#odd numbers as aksed
scores={}
# scores=[]
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k, weights = 'uniform',n_jobs=5)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores[k]=(accuracy_score(y_test, y_pred))
#     scores.append(accuracy_score(y_test, y_pred))
scores    

  
  
  
  
  
  
  
  
  
  
  
  
  
  
  


{1: 0.8024977820241589,
 3: 0.8219477240155599,
 5: 0.8306148911485702,
 7: 0.831433836074524,
 9: 0.8337541800313929,
 11: 0.8343001433153621,
 13: 0.835733296935781,
 15: 0.8364839964512386,
 17: 0.8344366341363544,
 19: 0.8359380331672696,
 21: 0.8362110148092541,
 23: 0.8354603152937965,
 25: 0.8365522418617348,
 27: 0.8368934689142156,
 29: 0.8375759230191769}

## improve score

In [31]:
adult_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
age                48842 non-null int64
workclass          48842 non-null object
fnlwgt             48842 non-null int64
education          48842 non-null object
educational-num    48842 non-null int64
marital-status     48842 non-null object
occupation         48842 non-null object
relationship       48842 non-null object
race               48842 non-null object
gender             48842 non-null object
capital-gain       48842 non-null int64
capital-loss       48842 non-null int64
hours-per-week     48842 non-null int64
native-country     48842 non-null object
income             48842 non-null int32
dtypes: int32(1), int64(6), object(8)
memory usage: 5.4+ MB


In [32]:
# selection only needed columns 
adult_features=adult_df[['age','workclass','educational-num','marital-status','occupation','race','gender','hours-per-week','native-country']]

In [33]:
adult_enc_df=pd.get_dummies(adult_features)
adult_fea_z_df=adult_enc_df.apply(zscore)
adult_fea_z_df.head(6)

Unnamed: 0,age,educational-num,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,-0.995129,-1.197259,-0.034087,-0.173795,-0.26194,-0.01431,0.575033,-0.189609,-0.293019,-0.205606,...,-0.037063,-0.061494,-0.02074,-0.048581,-0.036505,-0.024791,-0.023518,0.304846,-0.041999,-0.021705
1,-0.046942,-0.419335,0.77293,-0.173795,-0.26194,-0.01431,0.575033,-0.189609,-0.293019,-0.205606,...,-0.037063,-0.061494,-0.02074,-0.048581,-0.036505,-0.024791,-0.023518,0.304846,-0.041999,-0.021705
2,-0.776316,0.74755,-0.034087,-0.173795,3.817672,-0.01431,-1.739029,-0.189609,-0.293019,-0.205606,...,-0.037063,-0.061494,-0.02074,-0.048581,-0.036505,-0.024791,-0.023518,0.304846,-0.041999,-0.021705
3,0.390683,-0.030373,-0.034087,-0.173795,-0.26194,-0.01431,0.575033,-0.189609,-0.293019,-0.205606,...,-0.037063,-0.061494,-0.02074,-0.048581,-0.036505,-0.024791,-0.023518,0.304846,-0.041999,-0.021705
4,-1.505691,-0.030373,-0.841104,-0.173795,-0.26194,-0.01431,0.575033,-0.189609,-0.293019,-0.205606,...,-0.037063,-0.061494,-0.02074,-0.048581,-0.036505,-0.024791,-0.023518,0.304846,-0.041999,-0.021705
5,-0.338691,-1.586221,-0.841104,-0.173795,-0.26194,-0.01431,0.575033,-0.189609,-0.293019,-0.205606,...,-0.037063,-0.061494,-0.02074,-0.048581,-0.036505,-0.024791,-0.023518,0.304846,-0.041999,-0.021705


In [34]:
X= adult_fea_z_df
y= pd.DataFrame(adult_df['income'],columns=['income'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=27)
y.head(5)

Unnamed: 0,income
0,0
1,0
2,1
3,1
4,0


In [35]:
# fir the models
NNH = KNeighborsClassifier(n_neighbors= 5, weights = 'uniform', metric='euclidean',n_jobs=10)
NNH.fit(X_train, y_train)
y_train_pred = NNH.predict(X_train)
y_test_pred = NNH.predict(X_test)
accuracy_score_train = accuracy_score(y_train, y_train_pred)
accuracy_score_test = accuracy_score(y_test, y_test_pred)
print("train accuracy for k =5 is ", accuracy_score_train)
print("test accuracy for k =5 is ", accuracy_score_test)

  This is separate from the ipykernel package so we can avoid doing imports until


train accuracy for k =5 is  0.8687882067331598
test accuracy for k =5 is  0.8121886303146113


# using only part of dataset as we have too many types of ppl in the data

In [36]:
df_exec=adult_df[adult_df['occupation']=='Exec-managerial']

In [37]:
df_exec.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
age,6086,,,,42.1985,12.032,17.0,33.0,41.0,50.0,90.0
workclass,6086,7.0,Private,3995.0,,,,,,,
fnlwgt,6086,,,,186125.0,104649.0,13769.0,115412.0,175247.0,231480.0,1490400.0
education,6086,16.0,Bachelors,2025.0,,,,,,,
educational-num,6086,,,,11.4497,2.16714,1.0,10.0,12.0,13.0,16.0
marital-status,6086,7.0,Married-civ-spouse,3600.0,,,,,,,
occupation,6086,1.0,Exec-managerial,6086.0,,,,,,,
relationship,6086,6.0,Husband,3231.0,,,,,,,
race,6086,5.0,White,5474.0,,,,,,,
gender,6086,2.0,Male,4338.0,,,,,,,


In [38]:
#DAta overview
print ("Rows     : " ,df_exec.shape[0])
print ("Columns  : " ,df_exec.shape[1])
print ("\nFeatures : \n" ,df_exec.columns.tolist())
print ("\nMissing values :  ", df_exec.isnull().sum().values.sum())
print ("\nUnique values :  \n",df_exec.nunique())

Rows     :  6086
Columns  :  15

Features : 
 ['age', 'workclass', 'fnlwgt', 'education', 'educational-num', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

Missing values :   0

Unique values :  
 age                  71
workclass             7
fnlwgt             5180
education            16
educational-num      16
marital-status        7
occupation            1
relationship          6
race                  5
gender                2
capital-gain         76
capital-loss         58
hours-per-week       78
native-country       38
income                2
dtype: int64


In [39]:
# dropping fnlwgt and education
# 
# df_exec.drop(columns=['fnlwgt','education'],inplace=True)
df_exec

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
15,43,Private,346189,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States,1
30,46,State-gov,106444,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,7688,0,38,United-States,1
34,26,Private,43311,HS-grad,9,Divorced,Exec-managerial,Unmarried,White,Female,0,0,40,United-States,0
49,56,Self-emp-inc,131916,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,0,50,United-States,0
54,38,Private,219446,9th,5,Married-spouse-absent,Exec-managerial,Not-in-family,White,Male,0,0,54,Mexico,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48800,46,Private,364548,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,48,United-States,1
48814,54,Private,337992,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Asian-Pac-Islander,Male,0,0,50,Japan,1
48817,34,Private,160216,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,55,United-States,1
48835,53,Private,321865,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,1


In [40]:
df_exec['income'].replace({'>50K':'Yes','<=50K':'No'},inplace=True)

TypeError: Cannot compare types 'ndarray(dtype=int32)' and 'str'

In [None]:

#Separating the incomes
bigger     = df_exec[df_exec["income"] == "Yes"]
not_bigger = df_exec[df_exec["income"] == "No"]
bigger

In [None]:
target_col = ["income"]
cat_cols   = df_exec.nunique()[df_exec.nunique() < 6].keys().tolist()
cat_cols   = [x for x in cat_cols if x not in target_col]
num_cols   = [x for x in df_exec.columns if x not in cat_cols + target_col ]

In [None]:
import matplotlib.pyplot as plt#visualization
from PIL import  Image
%matplotlib inline
import pandas as pd
import seaborn as sns#visualization
import itertools
import warnings
warnings.filterwarnings("ignore")
import io
import plotly.offline as py#visualization
py.init_notebook_mode(connected=True)#visualization
import plotly.graph_objs as go#visualization
import plotly.tools as tls#visualization
import plotly.figure_factory as ff#visualization

In [None]:
import plotly.graph_objs as go#visualization
py.init_notebook_mode(connected=True)#visualization
#diffirent salaris in dataset
#labels
lab = df_exec["income"].value_counts().keys().tolist()
#values
val = df_exec["income"].value_counts().values.tolist()

trace = go.Pie(labels = lab ,
               values = val ,
               marker = dict(colors =  [ 'royalblue' ,'lime'],
                             line = dict(color = "white",
                                         width =  1.3)
                            ),
               rotation = 90,
               hoverinfo = "label+value+text",
               hole = .5
              )
layout = go.Layout(dict(title = "Salaries of ppl based on data",
                        plot_bgcolor  = "rgb(243,243,243)",
                        paper_bgcolor = "rgb(243,243,243)",
                       )
                  )

data = [trace]
fig = go.Figure(data = data,layout = layout)
py.iplot(fig)

In [None]:
def plot_pie(column) :
    
    trace1 = go.Pie(values  = bigger[column].value_counts().values.tolist(),
                    labels  = bigger[column].value_counts().keys().tolist(),
                    hoverinfo = "label+percent+name",
                    domain  = dict(x = [0,.48]),
                    name    = "bigger Customers",
                    marker  = dict(line = dict(width = 2,
                                               color = "rgb(243,243,243)")
                                  ),
                    hole    = .6
                   )
    trace2 = go.Pie(values  = not_bigger[column].value_counts().values.tolist(),
                    labels  = not_bigger[column].value_counts().keys().tolist(),
                    hoverinfo = "label+percent+name",
                    marker  = dict(line = dict(width = 2,
                                               color = "rgb(243,243,243)")
                                  ),
                    domain  = dict(x = [.52,1]),
                    hole    = .6,
                    name    = "Non churn customers" 
                   )


    layout = go.Layout(dict(title = column + " distribution in customer attrition ",
                            plot_bgcolor  = "rgb(243,243,243)",
                            paper_bgcolor = "rgb(243,243,243)",
                            annotations = [dict(text = "bigger customers",
                                                font = dict(size = 13),
                                                showarrow = False,
                                                x = .15, y = .5),
                                           dict(text = "Non churn customers",
                                                font = dict(size = 13),
                                                showarrow = False,
                                                x = .88,y = .5
                                               )
                                          ]
                           )
                      )
    data = [trace1,trace2]
    fig  = go.Figure(data = data,layout = layout)
    py.iplot(fig)


#function  for histogram for customer attrition types
def histogram(column) :
    trace1 = go.Histogram(x  = bigger[column],
                          histnorm= "percent",
                          name = "bigger Customers",
                          marker = dict(line = dict(width = .5,
                                                    color = "black"
                                                    )
                                        ),
                         opacity = .9 
                         ) 
    
    trace2 = go.Histogram(x  = not_bigger[column],
                          histnorm = "percent",
                          name = "Non churn customers",
                          marker = dict(line = dict(width = .5,
                                              color = "black"
                                             )
                                 ),
                          opacity = .9
                         )
    
    data = [trace1,trace2]
    layout = go.Layout(dict(title =column + " distribution in customer attrition ",
                            plot_bgcolor  = "rgb(243,243,243)",
                            paper_bgcolor = "rgb(243,243,243)",
                            xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
                                             title = column,
                                             zerolinewidth=1,
                                             ticklen=5,
                                             gridwidth=2
                                            ),
                            yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
                                             title = "percent",
                                             zerolinewidth=1,
                                             ticklen=5,
                                             gridwidth=2
                                            ),
                           )
                      )
    fig  = go.Figure(data=data,layout=layout)
    
    py.iplot(fig)
    
#function  for scatter plot matrix  for numerical columns in data
def scatter_matrix(df)  :
    
    df  = df.sort_values(by = "income" ,ascending = True)
    classes = df["income"].unique().tolist()
    classes
    
    class_code  = {classes[k] : k for k in range(1,2)}
    class_code

    color_vals = [class_code[cl] for cl in df["income"]]
    color_vals

    pl_colorscale = "Portland"

    pl_colorscale

    text = [df.loc[k,"income"] for k in range(len(df))]
    text

    trace = go.Splom(dimensions = [dict(label  = "tenure",
                                       values = df["tenure"]),
                                  dict(label  = 'MonthlyCharges',
                                       values = df['MonthlyCharges']),
                                  dict(label  = 'TotalCharges',
                                       values = df['TotalCharges'])],
                     text = text,
                     marker = dict(color = color_vals,
                                   colorscale = pl_colorscale,
                                   size = 3,
                                   showscale = False,
                                   line = dict(width = .1,
                                               color='rgb(230,230,230)'
                                              )
                                  )
                    )
    axis = dict(showline  = True,
                zeroline  = False,
                gridcolor = "#fff",
                ticklen   = 4
               )
    
    layout = go.Layout(dict(title  = 
                            "Scatter plot matrix for Numerical columns for customer attrition",
                            autosize = False,
                            height = 800,
                            width  = 800,
                            dragmode = "select",
                            hovermode = "closest",
                            plot_bgcolor  = 'rgba(240,240,240, 0.95)',
                            xaxis1 = dict(axis),
                            yaxis1 = dict(axis),
                            xaxis2 = dict(axis),
                            yaxis2 = dict(axis),
                            xaxis3 = dict(axis),
                            yaxis3 = dict(axis),
                           )
                      )
    data   = [trace]
    fig = go.Figure(data = data,layout = layout )
    py.iplot(fig)

#for all categorical columns plot pie
for i in cat_cols :
    plot_pie(i)

#for all categorical columns plot histogram    
for i in num_cols :
    histogram(i)

#scatter plot matrix
scatter_matrix(df_exec)

In [None]:
# Capital loss/ capital gain and country is not provding any vital informmation so can be dropped
df_exec.info()

In [None]:
tel_df = df_exec.copy()
#Drop hours-per-week column

#df_exec = df_exec.drop(columns = "hours-per-week_group",axis = 1)

trace1 = go.Scatter3d(x = bigger["age"],
                      y = bigger["educational-num"],
                      z = bigger["hours-per-week"],
                      mode = "markers",
                      name = "bigger customers",
#                       text = ,
                      marker = dict(size = 1,color = "red")
                     )
trace2 = go.Scatter3d(x = not_bigger["age"],
                      y = not_bigger["educational-num"],
                      z = not_bigger["hours-per-week"],
                      name = "Non bigger customers",
#                       text = ,
                      mode = "markers",
                      marker = dict(size = 1,color= "green")
                     )



layout = go.Layout(dict(title = "Monthly charges,total charges & hours-per-week in customer attrition",
                        scene = dict(camera = dict(up=dict(x= 0 , y=0, z=0),
                                                   center=dict(x=0, y=0, z=0),
                                                   eye=dict(x=1.25, y=1.25, z=1.25)),
                                     xaxis  = dict(title = "age",
                                                   gridcolor='rgb(255, 255, 255)',
                                                   zerolinecolor='rgb(255, 255, 255)',
                                                   showbackground=True,
                                                   backgroundcolor='rgb(230, 230,230)'),
                                     yaxis  = dict(title = "educational-num",
                                                   gridcolor='rgb(255, 255, 255)',
                                                   zerolinecolor='rgb(255, 255, 255)',
                                                   showbackground=True,
                                                   backgroundcolor='rgb(230, 230,230)'
                                                  ),
                                     zaxis  = dict(title = "hours-per-week",
                                                   gridcolor='rgb(255, 255, 255)',
                                                   zerolinecolor='rgb(255, 255, 255)',
                                                   showbackground=True,
                                                   backgroundcolor='rgb(230, 230,230)'
                                                  )
                                    ),
                        height = 700,
                       )
                  )
                  

data = [trace1,trace2]
fig  = go.Figure(data = data,layout = layout)
py.iplot(fig)