<h1 style="text-align:center;">Classification</h1>

# Introduction

In classification, our challenge is tackling the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. For example, a child asked to separate collection of items on a beach into stones, sea shells and others. 

With this ability, we could make use of machine learning algorithms that learn how to assign a class labels to our emails so they could help us sort our emails into two categories: "spam" and "not spam" 

There are many different types of classification tasks that you may encounter in machine learning and specialized approaches to modeling that may be used for each.

## Data and data wrangling

In this section, we will choose a different dataset for this purpose and we try to move swiftly through it. We will use the Census Income dataset to predict personal income from https://archive.ics.uci.edu/dataset/20/census+income. It can be downloaded directly from the from the UCI Machine Learning Repository thus:

In [1]:
import pandas as pd

In [2]:
# Upload Census dataset (adult) from UCI Machine Learning Repository
url_census = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

Lets use the method of chaining to walk through our understanding of the dataset we have just downloaded.

In [3]:
(pd
 .read_csv(url_census)
 .head()
)

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


We can see from the output above that the column headings represent the entries of the first row. We can reload the with the `header=None` parameter

In [4]:
# Upload Census dataset with no header & Display first 5 rows

(pd
 .read_csv(url_census, header=None)
 .head()
)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Our out put has missing headers. A closer look under the Attribute Information of the website where the data was got from gives us the names of the headers, which we now will use. We now introduce you to the `.pipe` method. We choose to use it beacause it lends itself well with the method of chaining. What it does is it takes in the DataFrame fed to it and transforms it (in our case by adding the headers) the outputs the transformed DataFrame.

In [5]:
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation',
                  'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                   'income']

(pd
 .read_csv(url_census, header=None)
 .pipe(lambda x: x.rename(columns={i: name for i, name in enumerate(col_names)}))
 .head()
)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**Null values**

Lets us the `.info` method to seek out null values. But first we will create a function that we will be building as we continue on our journey.

In [6]:
def prep_census(url):
    col_names = ['age', 'workclass', 'fnlwgt', 
                 'education', 'education-num', 
                 'marital-status', 'occupation',
                  'relationship', 'race', 'sex', 
                 'capital-gain', 'capital-loss', 
                 'hours-per-week', 'native-country', 
                   'income']
    return (pd
     .read_csv(url, header=None)
     .pipe(lambda x: x.rename(columns={i: name for i, name in enumerate(col_names)}))
    )

df_census = prep_census(url_census)
df_census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


It seems like we have no null values in this dataset. The clue was the fact that all the columns appear to have the same number of Non-Null count as far as the eyes can see.

We could check out another way. We had earlier on created a function called `total_nulls` and stored it in our `data_wrangle module`.

In [7]:
from wrangle_bike_rentals import *

In [8]:
total_nulls(df_census)

0

The output verifies our assumptions.

**Non-numerical columns**

We have columns that of the `dtype` as `objects`. We need to convert all of them into numerical columns. We could use the pandas [`.get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) method for that. This method will take the non-numerical unique values of every column and convert them into their own column, with 1 indicating presence and 0 indicating absence.

We'll use a the [Category Encoders](https://contrib.scikit-learn.org/category_encoders/) library since it allows us to integrate the one hot encoder as a transformer in a scikit-learn Pipeline.

Because this will create many new columns, it is worthwhile checking to see whether any columns may be eliminated before passing the  method on the entire dataset. A quick review of the `df_census` data reveals an `'education'` column and an `'education_num'` column, the latter being the numerical conversion of the former. We will just drop the the `'education'` column.

In [9]:
def prep_census(url):
    col_names = ['age', 'workclass', 'fnlwgt', 
                 'education', 'education-num', 
                 'marital-status', 'occupation',
                  'relationship', 'race', 'sex', 
                 'capital-gain', 'capital-loss', 
                 'hours-per-week', 'native-country', 
                   'income']
    return (pd
            .read_csv(url, header=None)
            .pipe(lambda x: x.rename(columns={i: name for i, name in enumerate(col_names)}))
            .drop(['education'], axis=1)
    )

df_census = prep_census(url_census)
df_census.head()

Unnamed: 0,age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [10]:
df_census = pd.get_dummies(df_census)

df_census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 94 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   age                                         32561 non-null  int64
 1   fnlwgt                                      32561 non-null  int64
 2   education-num                               32561 non-null  int64
 3   capital-gain                                32561 non-null  int64
 4   capital-loss                                32561 non-null  int64
 5   hours-per-week                              32561 non-null  int64
 6   workclass_ ?                                32561 non-null  bool 
 7   workclass_ Federal-gov                      32561 non-null  bool 
 8   workclass_ Local-gov                        32561 non-null  bool 
 9   workclass_ Never-worked                     32561 non-null  bool 
 10  workclass_ Private                

From the above output, it seems like using the `pd.get_dummies` has increased the memory usage (seen in the last line of the previous output)

## Target and predictor columns
Aall columns are numerical with no null value;s time to split the data intoa  targetvector  anda feature matrixr columns

The target column is whether or not someone makes $\$50,000$. Two columns, `'income_<=50K'` and `'income_>50K'`, are used to determine whether someone makes $\$50,000$. Seeing that either column will work, we delete `'income_ <=50K'`..

In [11]:
df_census = df_census.drop('income_ <=50K', axis=1)
df_census.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,income_ >50K
0,39,77516,13,2174,0,40,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
1,50,83311,13,0,0,13,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
2,38,215646,9,0,0,40,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
3,53,234721,7,0,0,40,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,28,338409,13,0,0,40,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Since we are fond of creating our X and y in a particular way all the time, it think we will just create a function we call to help with this process.

In [12]:
def splitX_y(df, trgt):
    features = [col for col in df.columns if col not in trgt]
    return (df[features], df[trgt])

In [13]:
X, y = splitX_y(df_census, "income_ >50K")

print(f"shape of target vector, y: {y.shape}")
print(f"shape of feature matrix, X: {X.shape}")

shape of target vector, y: (32561,)
shape of feature matrix, X: (32561, 92)


**Logistic regression**

Logistic regression is the most fundamental classification algorithm. Mathematically, logistic regression works in a manner similar to linear regression. For each column, logistic regression finds an appropriate weight, or coefficient, that maximizes model accuracy. The primary difference is that instead of summing each term, as in linear regression, logistic regression uses the sigmoid function. Here all values greater than a certain threshold (say 0.5) are matched to 1 (or true, meaning the person earns an income greater than $\$50,000$) and all values less than the threshold are matched to 0.

The main differences between implementing logistic regression and linear regression with scikit-learn are:

1. The feature matrix or X should fit into categories; and,
2. The error should be in terms of accuracy, which is already by default.

In [14]:
from sklearn.linear_model import LogisticRegression

### The cross-validation function

Instead of copying and pasting, let's build a cross-validation classification function that takes a machine learning algorithm as input and has the accuracy score as output using cross_val_score.

In [15]:
# Import cross_val_score
from sklearn.model_selection import cross_val_score
import numpy as np

# Define cross_val function with classifer and num_splits as input
def cross_valid(classifier, num_splits=10):
    
    # Initialize classifier
    model = classifier

    # Obtain scores of cross-validation
    scores = cross_val_score(model, X, y, cv=num_splits)

    # Display accuracy
    print(f'Accuracy: {np.round(scores, 2)}')

    # Display mean accuracy
    print(f'Accuracy mean: {scores.mean():.2f}')

In [16]:
# Use cross_val function to score LogisticRegression
cross_valid(LogisticRegression())

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: [0.8  0.8  0.79 0.8  0.79 0.81 0.79 0.79 0.8  0.8 ]
Accuracy mean: 0.80


80% accuracy looks good right out of the gate.

Let's see what XGBoost has to offer.

**XGBoost**

In [17]:
# Import XGBoost Classifier
from xgboost import XGBClassifier

In [18]:
# Use cross_val function to score XGBoost
cross_valid(XGBClassifier(n_estimators=5))

Accuracy: [0.85 0.86 0.87 0.85 0.86 0.86 0.86 0.87 0.86 0.86]
Accuracy mean: 0.86


This looks great already.