## Some examples of classification
In this Notebook, you will walk through a few examples using [UCI's *Census Income* dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income). This dataset contains 48,842 samples of data, with 14 attributes each. Full descriptions of these features are available on the UCI website, but for the most part, these descriptions aren't important for the discussion in this checkpoint.

This is a real-world dataset containing data that has been prelabeled with two categories: people making more than $50,000 per year and people making $50,000 or less. The 14 attributes include some categorical data stored as character strings, so some feature engineering is necessary before the dataset can be used in the models.

There are 32,561 samples in the training data and 16,281 in the test data.

You'll start by downloading and cleaning up the data. Then you'll create several models, train them, and measure their accuracy.

## The dataset description

Briefly review the dataset's 14 variables and their values:

 - **target:** >50K, <=50K
 - **age:** continuous
 - **workclass:** Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
 - **fnlwgt:** continuous
 - **education:** Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
 - **education-num:** continuous
 - **marital-status:** Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
 - **occupation:** Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
 - **relationship:** Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
 - **race:** White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
 - **sex:** Female, Male
 - **capital-gain:** continuous
 - **capital-loss:** continuous
 - **hours-per-week:** continuous
 - **native-country:** United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands



In [0]:

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

First, download and clean up the training data.

In [0]:

# Create a list of column names
cols = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 
        'occupation', 'relationship', 'race', 'sex', 'capital-gain', 
        'capital-loss', 'hours-per-week', 'native-country', 'target']

# Read the raw data from the source into a DataFrame
df_raw = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None, names = cols)

# Encode character string categorical data into numeric data using one-hot encoding
df_encoded = pd.concat([df_raw, pd.get_dummies(df_raw["workclass"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_raw["education"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_raw["marital-status"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_raw["occupation"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_raw["relationship"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_raw["race"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_raw["sex"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_raw["native-country"], drop_first=True)], axis=1)

# Drop the character columns
df_encoded.drop(['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'], axis = 1, inplace = True)

# Encode the target column
df_encoded.loc[df_encoded['target'] == ' <=50K', 'target'] = 0
df_encoded.loc[df_encoded['target'] == ' >50K', 'target'] = 1

# Separate features and target
X_train = df_encoded.drop(['target'], axis = 1)
y_train = df_encoded['target']

print(X_train.shape)
print(y_train.shape)

(32561, 100)
(32561,)


Now, load up the test dataset and process it in a similar manner.

In [0]:
df_test = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test', header=None, names = cols, skiprows=[0])

# Encode character string categorical data into numeric data using one-hot encoding
df_encoded = pd.concat([df_test, pd.get_dummies(df_test["workclass"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_test["education"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_test["marital-status"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_test["occupation"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_test["relationship"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_test["race"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_test["sex"], drop_first=True)], axis=1)
df_encoded = pd.concat([df_encoded, pd.get_dummies(df_test["native-country"], drop_first=True)], axis=1)

# Drop the character columns
df_encoded.drop(['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'], axis = 1, inplace = True)

# Encode the target column
df_encoded.loc[df_encoded['target'] == ' <=50K.', 'target'] = 0
df_encoded.loc[df_encoded['target'] == ' >50K.', 'target'] = 1

# This is necessary because the training data contains this country, and the test data does not
df_encoded[' Holand-Netherlands'] = 0

# Separate features and target
X_test = df_encoded.drop(['target'], axis = 1)
y_test = df_encoded['target']

print(X_test.shape)
print(y_test.shape)


(16281, 100)
(16281,)


## A quick exploration of the data
Before you run any algorithms, take a quick look at the data.

In [0]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 100 columns):
age                            32561 non-null int64
fnlwgt                         32561 non-null int64
education-num                  32561 non-null int64
capital-gain                   32561 non-null int64
capital-loss                   32561 non-null int64
hours-per-week                 32561 non-null int64
 Federal-gov                   32561 non-null uint8
 Local-gov                     32561 non-null uint8
 Never-worked                  32561 non-null uint8
 Private                       32561 non-null uint8
 Self-emp-inc                  32561 non-null uint8
 Self-emp-not-inc              32561 non-null uint8
 State-gov                     32561 non-null uint8
 Without-pay                   32561 non-null uint8
 11th                          32561 non-null uint8
 12th                          32561 non-null uint8
 1st-4th                       32561 non-null uint8
 5th-6

As you can see, you have 100 numeric columns with no missing values.

In [0]:
X_train.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,Self-emp-not-inc,State-gov,Without-pay,11th,12th,1st-4th,5th-6th,7th-8th,9th,Assoc-acdm,Assoc-voc,Bachelors,Doctorate,HS-grad,Masters,Preschool,Prof-school,Some-college,Married-AF-spouse,Married-civ-spouse,Married-spouse-absent,Never-married,Separated,Widowed,Adm-clerical,Armed-Forces,Craft-repair,Exec-managerial,Farming-fishing,...,Canada,China,Columbia,Cuba,Dominican-Republic,Ecuador,El-Salvador,England,France,Germany,Greece,Guatemala,Haiti,Holand-Netherlands,Honduras,Hong,Hungary,India,Iran,Ireland,Italy,Jamaica,Japan,Laos,Mexico,Nicaragua,Outlying-US(Guam-USVI-etc),Peru,Philippines,Poland,Portugal,Puerto-Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Train a model and calculate accuracy
To train a model, you'll use a built-in model from scikit-learn. For this checkpoint, you will mostly use the default parameters; later, you will learn how to fine-tune the model with parameters. Also, you will use an accuracy score as a quick evaluation of how well the model is working. But, as you will see later, the accuracy score isn't the only measure of performance and doesn't always give a full picture.

Notice that you're using the training data to train the model in each of the test cases below. Then you're using the test data to evaluate the model's performance.

As you proceed, keep track of the predictions given by each model. This way, you'll be able to compare them after.

### Some linear classifiers
#### Logistic regression

In [0]:
lr = LogisticRegression(solver='lbfgs', max_iter=1000)
lr.fit(X_train, y_train)

lr_score = lr.score(X_test, y_test)
lr_predictions = lr.predict(X_test)

print('Accuracy of Logistic Regression: {:.3f}'.format(lr_score))

Accuracy of Logistic Regression: 0.798


#### Support-vector machines
Support-vector machines (SVM) are another type of linear classifier. SVM attempts to construct a *hyperplane* (like a straight, 1D line in a 2D space), such that it separates the objects into classes, similarly to logistic regression. SVM, however, is concerned with finding the line that has the greatest distance to the nearest training data point. 

SVM tends to perform well if the categories are linearly separable and there is clear separation between the classes. Overlap between the classes (or noise) tends to degrade performance. Although it is a memory-efficient algorithm, it tends to take a while—especially on large datasets.

In [0]:
svm = LinearSVC(max_iter=10000)
svm.fit(X_train, y_train)
svm_score = svm.score(X_test, y_test)
svm_predictions = svm.predict(X_test)

print('Accuracy of SVM: {:.3f}'.format(svm_score))

Accuracy of SVM: 0.798




### Nonlinear classifiers
#### KNN
This classifier works by comparing each data point to its $k$ nearest neighbors, where $k$ is some arbitrary integer. For example, if `k = 5`, an unknown data point will be compared to the five closest known points. The data point will be classified as belonging to the most populous group among the five neighbors. It's the "if it quacks like a duck" algorithm of machine learning.

Technically, this algorithm doesn't "learn" the way that the others do. Rather, it looks for similarities between objects. KNN makes no assumptions about the data, and doesn't require quite as much data preparation as other algorithms. That is, it is not affected by outliers as much as other algorithms. However, it is a computationally expensive algorithm, and requires memory to hold all the data. As a result, as datasets grow, memory requirements grow too.

Below, two separate models are trained, one with `k = 7` and one with `k = 5`. Typically, you need to try several values for $k$ to find the optimal value for your dataset. So before moving on, try some other values, such as `3` or `9`, and compare them to the accuracy of the models created below.

In [0]:

knn = KNeighborsClassifier(n_neighbors=7)

# Then fit the model
knn.fit(X_train, y_train)

# How well did it do?
knn_7_score = knn.score(X_test, y_test)
knn_7_predictions = knn.predict(X_test)

print('Accuracy of KNN (k = 7): {:.3f}'.format(knn_7_score))

Accuracy of KNN (k = 7): 0.785


In [0]:
knn = KNeighborsClassifier(n_neighbors=5)

# Then fit the model
knn.fit(X_train, y_train)

# How well did it do?
knn_5_score = knn.score(X_test, y_test)
knn_5_predictions = knn.predict(X_test)

print('Accuracy of KNN (k = 5): {:.3f}'.format(knn_5_score))

Accuracy of KNN (k = 5): 0.777


### Decision tree
Decision trees are fairly simple. They make decisions by splitting the data into two or more sets based on some differentiator in the data. This process repeats until each data sample is in a leaf of the tree. This split works kind of like the game of twenty questions. When you ask, "Is it an animal?" you are splitting all the objects in the world into two sets: the set of things that are animals and the set of things that are not. Think of these two sets as two branches of the tree. Say that you go down the animal branch and ask another question: "Does it live on land?" You again divide the set of all animals into two. If you continue this process, you can narrow down to some very specific object.

Deciding on the split is the hardest part. You want to find criteria that divide the set of all objects into roughly equal parts. If you ask very specific questions too early, it wastes time. For example, if your first question is "Is it a type of cake?" you end up with two very uneven groups of objects: a small group of objects that are cake and a very large group of objects that are not cake. This algorithm examines your data and tries to make the decision as efficient as possible at each step.

Decision trees don't require any particular assumption about the data, and they require less data cleaning than other algorithms. They are subject to overfitting and are not guaranteed to give the most optimal result.


In [0]:

dt = DecisionTreeClassifier()

dt.fit(X_train,y_train)

dt_score = dt.score(X_test, y_test)
dt_predictions = dt.predict(X_test)
print('Accuracy of Decision Tree: {:.3f} '.format(dt_score))


Accuracy of Decision Tree: 0.798 


### Random forest
The random forest classifier is built on a collection of decision trees. Because any one tree may be overfitting the data or may not give optimal results, random forest generates many trees and compares the results from many trees to make a final decision. This is usually a better choice than just a single decision tree.

In [0]:
rf = RandomForestClassifier(n_estimators = 22, random_state = 40)

rf.fit(X_train,y_train)

rf_score = rf.score(X_test, y_test)
rf_predictions = rf.predict(X_test)

print('Accuracy of Random Forest: {:.3f}'.format(rf_score))


Accuracy of Random Forest: 0.856


## Compare predictions
Now, compare the predictions made by each of the classifiers above. For easy comparison, you can put them together by constructing a DataFrame with a column per classifier. Include a column for the actual label. Then you can look at each row and see which classifier correctly classified that object.



In [0]:
predictions_dictionary = {'Logistic Regression' : lr_predictions, 'KNN_7' : knn_7_predictions, 
                          'KNN_5': knn_5_predictions, 'SVM' : svm_predictions, 'Decision Tree' : dt_predictions, 
                          'Random Forest' : rf_predictions, 'Actual': y_test}

predictions_df = pd.DataFrame(predictions_dictionary)
predictions_df

Unnamed: 0,Logistic Regression,KNN_7,KNN_5,SVM,Decision Tree,Random Forest,Actual
0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0
2,0,0,0,0,1,1,1
3,1,1,1,0,1,1,1
4,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0
7,1,0,0,0,0,1,1
8,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0
