# Classification using TensorFlow

We have some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belong in (>50k or <=50k). 

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>age</td>
<td>Continuous</td>
<td>The age of the individual</td>
</tr>
<tr>
<td>workclass</td>
<td>Categorical</td>
<td>The type of employer the  individual has (government,  military, private, etc.).</td>
</tr>
<tr>
<td>fnlwgt</td>
<td>Continuous</td>
<td>The number of people the census  takers believe that observation  represents (sample weight). This  variable will not be used.</td>
</tr>
<tr>
<td>education</td>
<td>Categorical</td>
<td>The highest level of education  achieved for that individual.</td>
</tr>
<tr>
<td>education_num</td>
<td>Continuous</td>
<td>The highest level of education in  numerical form.</td>
</tr>
<tr>
<td>marital_status</td>
<td>Categorical</td>
<td>Marital status of the individual.</td>
</tr>
<tr>
<td>occupation</td>
<td>Categorical</td>
<td>The occupation of the individual.</td>
</tr>
<tr>
<td>relationship</td>
<td>Categorical</td>
<td>Wife, Own-child, Husband,  Not-in-family, Other-relative,  Unmarried.</td>
</tr>
<tr>
<td>race</td>
<td>Categorical</td>
<td>White, Asian-Pac-Islander,  Amer-Indian-Eskimo, Other, Black.</td>
</tr>
<tr>
<td>gender</td>
<td>Categorical</td>
<td>Female, Male.</td>
</tr>
<tr>
<td>capital_gain</td>
<td>Continuous</td>
<td>Capital gains recorded.</td>
</tr>
<tr>
<td>capital_loss</td>
<td>Continuous</td>
<td>Capital Losses recorded.</td>
</tr>
<tr>
<td>hours_per_week</td>
<td>Continuous</td>
<td>Hours worked per week.</td>
</tr>
<tr>
<td>native_country</td>
<td>Categorical</td>
<td>Country of origin of the  individual.</td>
</tr>
<tr>
<td>income</td>
<td>Categorical</td>
<td>"&gt;50K" or "&lt;=50K", meaning  whether the person makes more  than \$50,000 annually.</td>
</tr>
</tbody>
</table>

### THE DATA

** Read in the census_data.csv data with pandas**

In [147]:
import pandas as pd

In [148]:
census_data = pd.read_csv('census_data.csv')

In [149]:
census_data.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


** TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s. **

** Convert the Label column to 0s and 1s instead of strings.**

In [150]:
def convertor(income):
    if income == ' <=50K':
        return 0
    else:
        return 1

In [151]:
census_data['income_bracket'] = census_data['income_bracket'].apply(convertor)

In [152]:
census_data.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


### Perform a Train Test Split on the Data

In [153]:
from sklearn.model_selection import train_test_split

In [154]:
X = census_data.drop('income_bracket',axis=1)
y = census_data['income_bracket']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

### Create the Feature Columns for tf.esitmator

** Take note of categorical vs continuous values! **

In [155]:
X_train.columns

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country'],
      dtype='object')

** Import Tensorflow **

In [156]:
import tensorflow as tf

** Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets. **

In [157]:
X_train.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country
20895,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Female,0,0,28,United-States
3384,47,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Wife,Black,Female,15024,0,40,United-States
1832,46,Local-gov,Some-college,10,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,24,United-States
18919,46,State-gov,Some-college,10,Divorced,Adm-clerical,Unmarried,White,Female,0,0,48,United-States
31685,60,Private,HS-grad,9,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States


In [158]:
workclass = tf.feature_column.categorical_column_with_hash_bucket('workclass',10)
education = tf.feature_column.categorical_column_with_hash_bucket('education',20)
marital_status =  tf.feature_column.categorical_column_with_hash_bucket('marital_status',7)
occupation =  tf.feature_column.categorical_column_with_hash_bucket('occupation',20)
relationship =  tf.feature_column.categorical_column_with_hash_bucket('relationship',10)
race =  tf.feature_column.categorical_column_with_hash_bucket('race',5)
gender =  tf.feature_column.categorical_column_with_hash_bucket('gender',2)
native_country = tf.feature_column.categorical_column_with_hash_bucket('native_country',50)

In [159]:
workclass = tf.feature_column.embedding_column(workclass,dimension = 10)
education = tf.feature_column.embedding_column(education,dimension = 20)
marital_status = tf.feature_column.embedding_column(marital_status,dimension = 7)
occupation = tf.feature_column.embedding_column(occupation,dimension = 20)
relationship = tf.feature_column.embedding_column(relationship,dimension = 10)
race = tf.feature_column.embedding_column(race,dimension = 5)
gender = tf.feature_column.embedding_column(gender,dimension = 2)
native_country = tf.feature_column.embedding_column(native_country,dimension = 50)

** Create the continuous feature_columns for the continuous values using numeric_column **

In [160]:

age = tf.feature_column.numeric_column('age')
education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

** Put all these variables into a single list with the variable name feat_cols **

In [161]:
feat_cols = [age, workclass, education, education_num, marital_status,
       occupation, relationship, race, gender, capital_gain,
       capital_loss, hours_per_week, native_country]

### Create Input Function

** Batch_size is up to you. We set shuffle equal to true!**

In [162]:
input_function = tf.estimator.inputs.pandas_input_fn(X_train,y_train,batch_size = 10,num_epochs = 1000,shuffle=True)

#### Create your model with tf.estimator

**Create a DNNClassifier, for this we need to create embedded columns out of the cateogrical feature that use strings.**

In [163]:
model = tf.estimator.DNNClassifier([20,20,20],feat_cols,n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\Surbhi\\AppData\\Local\\Temp\\tmp0stj2tkz', '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_tf_random_seed': 1, '_keep_checkpoint_max': 5, '_save_summary_steps': 100, '_save_checkpoints_steps': None}


** Train your model on the data, for at least 5000 steps. **

In [164]:
model.train(input_fn=input_function,steps=25000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\Surbhi\AppData\Local\Temp\tmp0stj2tkz\model.ckpt.
INFO:tensorflow:loss = 34.7322, step = 1
INFO:tensorflow:global_step/sec: 181.955
INFO:tensorflow:loss = 1.35862, step = 101 (0.550 sec)
INFO:tensorflow:global_step/sec: 235.685
INFO:tensorflow:loss = 3.51996, step = 201 (0.424 sec)
INFO:tensorflow:global_step/sec: 238.483
INFO:tensorflow:loss = 5.70743, step = 301 (0.419 sec)
INFO:tensorflow:global_step/sec: 225.877
INFO:tensorflow:loss = 3.61647, step = 401 (0.451 sec)
INFO:tensorflow:global_step/sec: 231.332
INFO:tensorflow:loss = 2.49765, step = 501 (0.441 sec)
INFO:tensorflow:global_step/sec: 230.303
INFO:tensorflow:loss = 3.96214, step = 601 (0.427 sec)
INFO:tensorflow:global_step/sec: 232.533
INFO:tensorflow:loss = 3.34283, step = 701 (0.421 sec)
INFO:tensorflow:global_step/sec: 233.244
INFO:tensorflow:loss = 1.3932, step = 801 (0.446 sec)
INFO:tensorflow:global_step/sec: 230.224
IN

INFO:tensorflow:global_step/sec: 282.045
INFO:tensorflow:loss = 1.78724, step = 8401 (0.364 sec)
INFO:tensorflow:global_step/sec: 251.722
INFO:tensorflow:loss = 2.45861, step = 8501 (0.382 sec)
INFO:tensorflow:global_step/sec: 251.677
INFO:tensorflow:loss = 2.57327, step = 8601 (0.406 sec)
INFO:tensorflow:global_step/sec: 244.663
INFO:tensorflow:loss = 3.97022, step = 8701 (0.400 sec)
INFO:tensorflow:global_step/sec: 241.881
INFO:tensorflow:loss = 2.0077, step = 8801 (0.410 sec)
INFO:tensorflow:global_step/sec: 248.108
INFO:tensorflow:loss = 3.53512, step = 8901 (0.409 sec)
INFO:tensorflow:global_step/sec: 267.908
INFO:tensorflow:loss = 2.38481, step = 9001 (0.378 sec)
INFO:tensorflow:global_step/sec: 250.585
INFO:tensorflow:loss = 4.19483, step = 9101 (0.388 sec)
INFO:tensorflow:global_step/sec: 260.919
INFO:tensorflow:loss = 1.82461, step = 9201 (0.387 sec)
INFO:tensorflow:global_step/sec: 232.252
INFO:tensorflow:loss = 2.88831, step = 9301 (0.437 sec)
INFO:tensorflow:global_step/sec

INFO:tensorflow:global_step/sec: 292.048
INFO:tensorflow:loss = 4.06383, step = 16801 (0.344 sec)
INFO:tensorflow:global_step/sec: 285.771
INFO:tensorflow:loss = 4.35222, step = 16901 (0.346 sec)
INFO:tensorflow:global_step/sec: 317.625
INFO:tensorflow:loss = 9.133, step = 17001 (0.322 sec)
INFO:tensorflow:global_step/sec: 210.41
INFO:tensorflow:loss = 2.18911, step = 17101 (0.475 sec)
INFO:tensorflow:global_step/sec: 176.259
INFO:tensorflow:loss = 3.25862, step = 17201 (0.579 sec)
INFO:tensorflow:global_step/sec: 175.301
INFO:tensorflow:loss = 4.80247, step = 17301 (0.556 sec)
INFO:tensorflow:global_step/sec: 244.824
INFO:tensorflow:loss = 4.71873, step = 17401 (0.410 sec)
INFO:tensorflow:global_step/sec: 278.977
INFO:tensorflow:loss = 1.12797, step = 17501 (0.354 sec)
INFO:tensorflow:global_step/sec: 285.77
INFO:tensorflow:loss = 0.904111, step = 17601 (0.353 sec)
INFO:tensorflow:global_step/sec: 154.699
INFO:tensorflow:loss = 6.96764, step = 17701 (0.642 sec)
INFO:tensorflow:global_

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x2aac8a97ba8>

### Evaluation

** Create a prediction input function. We only supprt X_test data and keep shuffle=False. **

In [165]:
test_input_func = tf.estimator.inputs.pandas_input_fn(X_test,batch_size =10,num_epochs =1,shuffle = False)

** Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list() **

In [166]:
pred = list(model.predict(test_input_func))

INFO:tensorflow:Restoring parameters from C:\Users\Surbhi\AppData\Local\Temp\tmp0stj2tkz\model.ckpt-25000


** Each item in your list will look like this: **

In [167]:
pred[0]

{'class_ids': array([0], dtype=int64),
 'classes': array([b'0'], dtype=object),
 'logistic': array([ 0.24739254], dtype=float32),
 'logits': array([-1.11256742], dtype=float32),
 'probabilities': array([ 0.75260746,  0.24739255], dtype=float32)}

** Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions used to compare against the real y_test values. **

In [168]:
prediction = []
for each in pred:
    prediction.append(each['class_ids'][0])

** Import classification_report from sklearn.metrics.**

In [172]:
from sklearn.metrics import classification_report,confusion_matrix

In [173]:
print(classification_report(y_test,prediction))

             precision    recall  f1-score   support

          0       0.88      0.94      0.91      7436
          1       0.75      0.59      0.66      2333

avg / total       0.85      0.85      0.85      9769



In [174]:
print(confusion_matrix(y_test,prediction))

[[6970  466]
 [ 965 1368]]


# Conclusion:
We observe an accuracy of 85% which looks to be good enough. We can perform feature scaling to see further improvements in the performance.