# Classification Exercise

We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k). 

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>age</td>
<td>Continuous</td>
<td>The age of the individual</td>
</tr>
<tr>
<td>workclass</td>
<td>Categorical</td>
<td>The type of employer the  individual has (government,  military, private, etc.).</td>
</tr>
<tr>
<td>fnlwgt</td>
<td>Continuous</td>
<td>The number of people the census  takers believe that observation  represents (sample weight). This  variable will not be used.</td>
</tr>
<tr>
<td>education</td>
<td>Categorical</td>
<td>The highest level of education  achieved for that individual.</td>
</tr>
<tr>
<td>education_num</td>
<td>Continuous</td>
<td>The highest level of education in  numerical form.</td>
</tr>
<tr>
<td>marital_status</td>
<td>Categorical</td>
<td>Marital status of the individual.</td>
</tr>
<tr>
<td>occupation</td>
<td>Categorical</td>
<td>The occupation of the individual.</td>
</tr>
<tr>
<td>relationship</td>
<td>Categorical</td>
<td>Wife, Own-child, Husband,  Not-in-family, Other-relative,  Unmarried.</td>
</tr>
<tr>
<td>race</td>
<td>Categorical</td>
<td>White, Asian-Pac-Islander,  Amer-Indian-Eskimo, Other, Black.</td>
</tr>
<tr>
<td>gender</td>
<td>Categorical</td>
<td>Female, Male.</td>
</tr>
<tr>
<td>capital_gain</td>
<td>Continuous</td>
<td>Capital gains recorded.</td>
</tr>
<tr>
<td>capital_loss</td>
<td>Continuous</td>
<td>Capital Losses recorded.</td>
</tr>
<tr>
<td>hours_per_week</td>
<td>Continuous</td>
<td>Hours worked per week.</td>
</tr>
<tr>
<td>native_country</td>
<td>Categorical</td>
<td>Country of origin of the  individual.</td>
</tr>
<tr>
<td>income</td>
<td>Categorical</td>
<td>"&gt;50K" or "&lt;=50K", meaning  whether the person makes more  than \$50,000 annually.</td>
</tr>
</tbody>
</table>

## Follow the Directions in Bold. If you get stuck, check out the solutions lecture.

### THE DATA

** Read in the census_data.csv data with pandas**

In [177]:
import pandas as pd

In [178]:
census = pd.read_csv("census_data.csv")

In [179]:
census.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


** TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s. This might be hard if you aren't very familiar with pandas, so feel free to take a peek at the solutions for this part.**

** Convert the Label column to 0s and 1s instead of strings.**

In [180]:
census['income_bracket'].unique()

array([' <=50K', ' >50K'], dtype=object)

In [181]:
def label_fix(label):
    if label==' <=50K':
        return 0
    else:
        return 1

In [182]:
census['income_bracket']=census['income_bracket'].apply(label_fix)

In [183]:
census.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [184]:
x_data = census.drop('income_bracket',axis=1)
y_labels = census['income_bracket']

In [185]:
from sklearn.model_selection import train_test_split

### Perform a Train Test Split on the Data

In [186]:
X_train, X_test, y_train, y_test = train_test_split(x_data,y_labels,test_size=0.3,random_state=101)

### Create the Feature Columns for tf.esitmator

** Take note of categorical vs continuous values! **

In [187]:
census.columns

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income_bracket'],
      dtype='object')

** Import Tensorflow **

In [188]:
import tensorflow as tf

** Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets. **

In [189]:
gender = tf.feature_column.categorical_column_with_vocabulary_list("gender", ["Female", "Male"])
occupation = tf.feature_column.categorical_column_with_hash_bucket("occupation", hash_bucket_size=1000)
marital_status = tf.feature_column.categorical_column_with_hash_bucket("marital_status", hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket("relationship", hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket("education", hash_bucket_size=1000)
workclass = tf.feature_column.categorical_column_with_hash_bucket("workclass", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket("native_country", hash_bucket_size=1000)
race = tf.feature_column.categorical_column_with_hash_bucket("race", hash_bucket_size=1000)

** Create the continuous feature_columns for the continuous values using numeric_column **

In [190]:
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

** Put all these variables into a single list with the variable name feat_cols **

In [191]:
feat_cols = [gender,occupation,marital_status,relationship,education,workclass,native_country,
            age,education_num,capital_gain,capital_loss,hours_per_week]

### Create Input Function

** Batch_size is up to you. But do make sure to shuffle!**

In [192]:
input_func=tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train,batch_size=100,num_epochs=None,shuffle=True)

#### Create your model with tf.estimator

**Create a LinearClassifier.(If you want to use a DNNClassifier, keep in mind you'll need to create embedded columns out of the cateogrical feature that use strings, check out the previous lecture on this for more info.)**

In [193]:
# Linear Reg.
model = tf.estimator.LinearClassifier(feature_columns=feat_cols)
# DNNRegressor Model
# model = tf.estimator.DNNRegressor(hidden_units=[14,20,20,20,14],feature_columns=feat_cols)

W1224 18:18:42.617615  1848 estimator.py:1811] Using temporary folder as model directory: C:\Users\caiyi\AppData\Local\Temp\tmp7c_nghuz


** Train your model on the data, for at least 5000 steps. **

In [194]:
model.train(input_fn=input_func,steps=5000)

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifier at 0x1ecb4b4a588>

### Evaluation

** Create a prediction input function. Remember to only supprt X_test data and keep shuffle=False. **

In [195]:
pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,batch_size=len(X_test),shuffle=False)

** Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list() **

In [196]:
predictions = list(model.predict(input_fn=pred_fn))

** Each item in your list will look like this: **

In [197]:
predictions[0]

{'logits': array([-0.872437], dtype=float32),
 'logistic': array([0.29474747], dtype=float32),
 'probabilities': array([0.7052525 , 0.29474744], dtype=float32),
 'class_ids': array([0], dtype=int64),
 'classes': array([b'0'], dtype=object),
 'all_class_ids': array([0, 1]),
 'all_classes': array([b'0', b'1'], dtype=object)}

** Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values. **

In [198]:
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])

In [199]:
final_preds[:10]

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

** Import classification_report from sklearn.metrics and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data. **

In [200]:
from sklearn.metrics import classification_report

In [201]:
print(classification_report(y_test,final_preds))

              precision    recall  f1-score   support

           0       0.89      0.92      0.90      7436
           1       0.70      0.64      0.67      2333

    accuracy                           0.85      9769
   macro avg       0.80      0.78      0.79      9769
weighted avg       0.85      0.85      0.85      9769



# dnn_model 

In [78]:
dnn_model = tf.estimator.DNNClassifier(hidden_units=[13,20,20,13],feature_columns=feat_cols,n_classes=2)

W1224 08:54:48.498434  1848 estimator.py:1811] Using temporary folder as model directory: C:\Users\caiyi\AppData\Local\Temp\tmpfhbrmnso


In [157]:
#census['gender'].nunique()

#census['occupation'].nunique()
# census['marital_status'].nunique()
# census['relationship'].nunique()
# census['education'].nunique()
# census['workclass'].nunique()
# census['native_country'].nunique()
census['race'].nunique()

5

In [None]:
census['marital_status'].nunique()
census['relationship'].nunique()
census['education'].nunique()
census['workclass'].nunique()
census['native_country'].nunique()


In [134]:
census.columns

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income_bracket'],
      dtype='object')

In [202]:
embedded_gender = tf.feature_column.embedding_column("gender", dimension=2)
embedded_occupation = tf.feature_column.embedding_column("occupation",dimension=15)
embedded_marital_status = tf.feature_column.embedding_column("marital_status",dimension=7)
embedded_relationship = tf.feature_column.embedding_column("relationship",dimension=6)
embedded_education = tf.feature_column.embedding_column("education",dimension=16)
embedded_workclass = tf.feature_column.embedding_column("workclass",dimension=9)
embedded_native_country =  tf.feature_column.embedding_column("native_country",dimension=42)
embedded_race = tf.feature_column.embedding_column("race", dimension=5)

In [207]:
census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 14 columns):
age               32561 non-null int64
workclass         32561 non-null object
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
gender            32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    32561 non-null object
income_bracket    32561 non-null int64
dtypes: int64(6), object(8)
memory usage: 3.5+ MB


In [203]:
feat_cols = [age,embedded_workclass,embedded_education,education_num,embedded_marital_status,embedded_occupation,embedded_relationship,embedded_race,embedded_gender,
                         capital_gain,capital_loss,hours_per_week,embedded_native_country]

In [204]:
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train,batch_size=100,num_epochs=1000,shuffle=True)

In [205]:
dnn_model = tf.estimator.DNNClassifier(hidden_units=[13,20,20,13],feature_columns=feat_cols,n_classes=2)

W1224 18:19:46.339753  1848 estimator.py:1811] Using temporary folder as model directory: C:\Users\caiyi\AppData\Local\Temp\tmpgqt_va2z


In [206]:
dnn_model.train(input_fn=input_func,steps=1000)

AttributeError: in converted code:
    relative to C:\Users\caiyi\Anaconda3\lib\site-packages:

    tensorflow_estimator\python\estimator\canned\dnn.py:250 call *
        net = self._input_layer(features)
    tensorflow\python\feature_column\feature_column.py:337 __call__
        from_template=True)
    tensorflow\python\ops\template.py:392 __call__
        return self._call_func(args, kwargs)
    tensorflow\python\ops\template.py:354 _call_func
        result = self._func(*args, **kwargs)
    tensorflow\python\feature_column\feature_column.py:181 _internal_input_layer
        feature_columns = _normalize_feature_columns(feature_columns)
    tensorflow\python\feature_column\feature_column.py:2262 _normalize_feature_columns
        if column.name in name_to_column:
    tensorflow\python\feature_column\feature_column_v2.py:3003 name
        return '{}_embedding'.format(self.categorical_column.name)

    AttributeError: 'str' object has no attribute 'name'
    
    originally defined at:
      File "C:\Users\caiyi\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\canned\dnn.py", line 106, in dnn_logit_fn
        name='dnn')
      File "C:\Users\caiyi\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\canned\dnn.py", line 189, in __init__
        create_scope_now=False)
      File "C:\Users\caiyi\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column.py", line 327, in __init__
        self._name, _internal_input_layer, create_scope_now_=create_scope_now)
      File "C:\Users\caiyi\Anaconda3\lib\site-packages\tensorflow\python\ops\template.py", line 160, in make_template
        **kwargs)
    


# Great Job!