This data contains 961 instances of masses detected in mammograms, and contains the following attributes:

BI-RADS assessment: 1 to 5 (ordinal)
Age: patient's age in years (integer)
Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
Severity: benign=0 or malignant=1 (binominal)
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [2]:
data_path = 'mammographic+mass'
col_names = ['BI-RADS', 'Age', 'Shape', 'Margin', 'Density', 'Severity']
df_data = pd.read_csv(os.path.join(data_path, 'mammographic_masses.data'), names=col_names, na_values='?')
df_data.head()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [3]:
df_data.shape

(961, 6)

In [4]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 961 entries, 0 to 960
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   BI-RADS   959 non-null    float64
 1   Age       956 non-null    float64
 2   Shape     930 non-null    float64
 3   Margin    913 non-null    float64
 4   Density   885 non-null    float64
 5   Severity  961 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 45.2 KB


In [5]:
df_data.describe()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


In [6]:
df_data = df_data.dropna()
df_data = df_data.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))

In [7]:
df_data.describe()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,0.07989,0.484384,0.593976,0.453313,0.638554,0.485542
std,0.034334,0.1881,0.41412,0.391794,0.116979,0.500092
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.072727,0.358974,0.333333,0.0,0.666667,0.0
50%,0.072727,0.5,0.666667,0.5,0.666667,0.0
75%,0.090909,0.615385,1.0,0.75,0.666667,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0


In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [9]:
X = df_data[['Age', 'Shape', 'Margin', 'Density']]
y = df_data['Severity']

In [10]:
C = 1.0
reg = LogisticRegression()
reg.fit(X, y)

In [11]:
scores = cross_val_score(reg, X, y, cv=10)
print(scores)
print(scores.mean())

[0.74698795 0.79518072 0.85542169 0.81927711 0.84337349 0.73493976
 0.79518072 0.81927711 0.86746988 0.75903614]
0.8036144578313253
