# Project 2 Model Classification
### Serena Shah | ss94574
## Part 1

We first import necessary libraries and and loading the data into a Pandas dataframe.

In [134]:
import pandas as pd
import numpy as np

# read in data
bc = pd.read_csv('data/project2.data')

We now identify the shape and size of the raw data.

In [135]:
# data shape
bc.shape

(286, 10)

In [136]:
# data size
bc.size

2860

Next we look at information about the data types of the data columns.

In [137]:
bc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   class        286 non-null    object
 1   age          286 non-null    object
 2   menopause    286 non-null    object
 3   tumor-size   286 non-null    object
 4   inv-nodes    286 non-null    object
 5   node-caps    286 non-null    object
 6   deg-malig    286 non-null    int64 
 7   breast       286 non-null    object
 8   breast-quad  286 non-null    object
 9   irradiat     286 non-null    object
dtypes: int64(1), object(9)
memory usage: 22.5+ KB


The non-null counts for all variables aligns with the length of the dataframe: 286. This indicates there are no NaN or missing values to account for in preprocessing on first look. Further investigation shows questionable values in the `node-caps` and `breast-quad` columns.

In [138]:
bc['node-caps'].unique()

array(['no', 'yes', '?'], dtype=object)

In [139]:
bc['breast-quad'].unique()

array(['left_low', 'right_up', 'left_up', 'right_low', 'central', '?'],
      dtype=object)

There are `?` values in both the `node-caps` and `breast-quad` columns. These need to be replaced with their most frequent value, or the mode of the column.

In [140]:
# replace ? with the mode for breast-quad and node-caps
bc['breast-quad'] = bc['breast-quad'].replace('?', bc['breast-quad'].mode()[0])
bc['node-caps'] = bc['node-caps'].replace('?', bc['node-caps'].mode()[0])

Now we can see that the `?` values no longer exist for each column below.

In [141]:
bc['breast-quad'].unique()

array(['left_low', 'right_up', 'left_up', 'right_low', 'central'],
      dtype=object)

In [142]:
bc['node-caps'].unique()

array(['no', 'yes'], dtype=object)

Now we'll look into variable datatypes.

In [143]:
bc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   class        286 non-null    object
 1   age          286 non-null    object
 2   menopause    286 non-null    object
 3   tumor-size   286 non-null    object
 4   inv-nodes    286 non-null    object
 5   node-caps    286 non-null    object
 6   deg-malig    286 non-null    int64 
 7   breast       286 non-null    object
 8   breast-quad  286 non-null    object
 9   irradiat     286 non-null    object
dtypes: int64(1), object(9)
memory usage: 22.5+ KB


Other than the `deg-malig` variable, which is of `int` type, all variables in the breast cancer dataset are of `object` types. Let's get a better idea of the reported value formats for each variable.

In [144]:
bc.head()

Unnamed: 0,class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


Because variables with numeric values are reported in ranges, and are therefore ordinal, there is no need for datatype conversion of these variables to `float`,`int`, or `category` types. However the `class`, `menopause`, `node-caps`, `breast`, `breast-quad`, and `iraddiat` variables are non-ordinal categorical and therefore require a conversion to `category` type through one-hot encoding.

In [145]:
# cast column vals to type category
bc['class'] = bc['class'].astype("category")
bc['menopause'] = bc['menopause'].astype("category")
bc['node-caps'] = bc['node-caps'].astype("category")
bc['breast'] = bc['breast'].astype("category")
bc['breast-quad'] = bc['breast-quad'].astype("category")
bc['irradiat'] = bc['irradiat'].astype("category")

bc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   class        286 non-null    category
 1   age          286 non-null    object  
 2   menopause    286 non-null    category
 3   tumor-size   286 non-null    object  
 4   inv-nodes    286 non-null    object  
 5   node-caps    286 non-null    category
 6   deg-malig    286 non-null    int64   
 7   breast       286 non-null    category
 8   breast-quad  286 non-null    category
 9   irradiat     286 non-null    category
dtypes: category(6), int64(1), object(3)
memory usage: 11.6+ KB


We can see that all the columns to be one-hot encoded have been converted to type `category`.

In [146]:
# categorical to bit conversion
bc = pd.get_dummies(bc, columns=["class", "menopause", "node-caps", "breast", "breast-quad", "irradiat"], drop_first=True)
bc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   age                      286 non-null    object
 1   tumor-size               286 non-null    object
 2   inv-nodes                286 non-null    object
 3   deg-malig                286 non-null    int64 
 4   class_recurrence-events  286 non-null    bool  
 5   menopause_lt40           286 non-null    bool  
 6   menopause_premeno        286 non-null    bool  
 7   node-caps_yes            286 non-null    bool  
 8   breast_right             286 non-null    bool  
 9   breast-quad_left_low     286 non-null    bool  
 10  breast-quad_left_up      286 non-null    bool  
 11  breast-quad_right_low    286 non-null    bool  
 12  breast-quad_right_up     286 non-null    bool  
 13  irradiat_yes             286 non-null    bool  
dtypes: bool(10), int64(1), object(3)
memory us

We can see that each variable with *n* categories has been split into *n-1* columns.