# Working with Numerical Data

In the previous notebook we oversimplified the procedure by loading and analyzing a data set that contained exlusively numerical data. Besides we used datasets, whic were alredy split int train-test sets.  

In this notebook, we use just one heterogeneous dataset for testing and training.

In [2]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Disable jedi autocompleter
%config Completer.use_jedi = False

In [12]:
df = pd.read_csv("data/adult-census.csv")
df = df.drop(columns=["education-num", "fnlwgt"])
df.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


The next step separates the target from the data. We performed the same procedure in the previous notebook.

In [5]:
data, target = df.drop(columns='class'), df['class']
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,103497,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [6]:
target

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object

## Identify Numerical Data

Predictive models are natively designed to work with numerical data. Moreover, numerical data usually requires very little work before getting started with training.

First task here is to identify numerical data in our dataset.

In [8]:
data.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

It seems like we have only two data types. Let's check that

In [9]:
data.dtypes.unique()

array([dtype('int64'), dtype('O')], dtype=object)

Indeed, the only two types in the dataset are integer and object.

In [10]:
data.sample(3)

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
22170,39,Private,146091,Bachelors,Married-civ-spouse,Prof-specialty,Wife,White,Female,0,0,20,United-States
29310,30,Private,381153,HS-grad,Never-married,Craft-repair,Not-in-family,White,Male,0,0,40,United-States
24687,65,Local-gov,125768,Some-college,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,28,United-States


In [15]:
numeric_cols = data.select_dtypes('number').columns
data[numeric_cols].head()

Unnamed: 0,age,fnlwgt,capital-gain,capital-loss,hours-per-week
0,25,226802,0,0,40
1,38,89814,0,0,50
2,28,336951,0,0,40
3,44,160323,7688,0,40
4,18,103497,0,0,30


In [17]:
data_numeric = data[numeric_cols]

## Train-Test Split the Dataset

Scikit-learn provides a helper function `sklearn.model_selection.train_test_split` which is used to automatically split the dataset into two subsets.

In [18]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, target, random_state=23, test_size=.25)           

We specified that we would like to have 25\% of samples in the testing set while the remaining samples will be available in the training set.

In [19]:
print(f"Number of samples in testing: {data_test.shape[0]} => "
      f"{data_test.shape[0] / data_numeric.shape[0] * 100:.1f}% of the"
      f" original set")

Number of samples in testing: 12211 => 25.0% of the original set


This time we are going to use a logistic regression model.

In [20]:
# to better display the model diagram
from sklearn import set_config
set_config(display='diagram')

# import the logistic regression model
from sklearn.linear_model import LogisticRegression

In [21]:
# Create the model
model = LogisticRegression()

Now that the model has been created, we can use it exactly the same way as we use the k-nearest neighbors model in the previous notebook. In particular, we can use the `fit` method to train the model using the training data and labels:

In [22]:
model.fit(data_train, target_train)

We can also use the score method to check the model generalization performance on the test set.

In [23]:
accuracy = model.score(data_test, target_test)

In [24]:
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.802


Now the real qustion is: is this generalization performance relevant of a good predictive model?
Answer in the Exercise_103.ipynb