## Load Dependencies

In [13]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import plotly.express as px

## Load Raw Data

In [3]:
data = pd.read_csv('Resources/original-data.csv')

## Data Exploration

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave_points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

The data contains 569 instances of 30 features and one target column along with an id column. The id column is not useful for our analysis so we will drop it. 

We will also encode the target column, diagnosis, as 1's and 0's.

First observations are that there are no null values which makes things easier, yet there are only 569 obervations. We will likely need to upsample in order to utilize more advanced ML models.

Also, each of the features appears to be a numeric value computed from a radiology image. Their individual meanings is obscure to us at this point. We will therefore leave them alone and not pursue feature engineering.



In [5]:
data = pd.get_dummies(data=data, dtype=int, columns=['diagnosis'])
data.drop(['id', 'diagnosis_B'], axis=1, inplace=True)
data.rename(columns={'diagnosis_M':'malignant'}, inplace=True)
data.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst,malignant
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,1


## Data Upscaling

First, check if the data set is balanced or unbalanced.

In [6]:
data.malignant.value_counts()

malignant
0    357
1    212
Name: count, dtype: int64

The data is mildly unbalanced with 212 malignant instances and 357 benign. We will upsample both groups to 500 instances and have a 1,000 instance balanced data set.

In [7]:
# Split the data into target vector and feature matrix
y = data.pop('malignant')
X = data.copy()

# Apply smote to balance the data and upscale to N = 1,000
sm = SMOTE(sampling_strategy={0:500, 1:500}, k_neighbors=2, random_state=42)
X_res, y_res = sm.fit_resample(X, y)

print(f'Shape of y_res: {y_res.shape}') 
print(f'Shape of X_res: {X_res.shape}')


Shape of y_res: (1000,)
Shape of X_res: (1000, 30)


## Split into Training and Test Sets

With 1,000 instances being just enough to have confidence in random forest utilization, we will use a 75/25% ratio of training to test data.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, random_state=42)

## Principle Component Analysis

The following is an exploration into the efficacy of PCA with this data set. 

We plan to create PCA groups and test them against a linear regression to see if there is improved performance with a specified number of PCA components.

In [14]:
n_components = []
lr_score = []

for i in range(1, 31):
    pca_model = PCA(n_components = i)
    pca_model.fit(X_train)
    X_train_pca = pd.DataFrame(pca_model.transform(X_train))
    X_test_pca = pd.DataFrame(pca_model.transform(X_test))
    
    model = LinearRegression()
    model.fit(X_train_pca, y_train)
    
    n_components.append(i)
    lr_score.append(model.score(X_test_pca, y_test))
    
pca_results = pd.DataFrame({'n_components':n_components, 'lr_score':lr_score})

fig = px.line(pca_results, x='n_components', y='lr_score', title='PCA Components vs. Linear Regression Score')
fig.show()

There appears to be a pleateau at 17 components. Now we will compare that to the result of a linear regression of the raw data.

In [15]:
# Linear Regression of Raw data
model = LinearRegression()
model.fit(X_train, y_train)
print(f'Linear Regression Score of Raw Data: {model.score(X_test, y_test)}')

# Linear Regression of PCA data
pca_model = PCA(n_components = 17)
pca_model.fit(X_train)
X_train_pca = pd.DataFrame(pca_model.transform(X_train))
X_test_pca = pd.DataFrame(pca_model.transform(X_test))

model = LinearRegression()
model.fit(X_train_pca, y_train)
print(f'Linear Regression Score of PCA Data: {model.score(X_test_pca, y_test)}')

Linear Regression Score of Raw Data: 0.7730529547920932
Linear Regression Score of PCA Data: 0.7750694136729325


### Conclusion

A principle component analysis with 17 components shows roughly equilivalent predictive accuracy to the same model with the raw data. 

Because there are only 13 additional features to the dataset, we choose to proceed with the raw data as it is computationally feasible and we do not wish to exlude any predictive power.

## Finalize Data

In [16]:
training_data = pd.concat([X_train, y_train], axis=1)
testing_data = pd.concat([X_test, y_test], axis=1)

training_data.to_csv('Resources/training_data.csv', index=False)
testing_data.to_csv('Resources/testing_data.csv', index=False)