## Data Source

The Wisconsin Breast Cancer dataset is an important resource that results from digitalized images of fine-needle aspirates (FNA) of breast masses. This dataset containing cases of breast cancer. taken from the Wisconsin Hospitals, Madison by Wolberg (2018), and show clearly and in detail the cell nuclei within the breast mass.

In [10]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd

def load_dataset() -> pd.DataFrame:  
  # fetch dataset 
  breast_cancer_wisconsin_original = fetch_ucirepo(id=15) 
  df = breast_cancer_wisconsin_original.data.features.join(breast_cancer_wisconsin_original.data.targets )
  return df

df = load_dataset()
display(df.head())

Unnamed: 0,Clump_thickness,Uniformity_of_cell_size,Uniformity_of_cell_shape,Marginal_adhesion,Single_epithelial_cell_size,Bare_nuclei,Bland_chromatin,Normal_nucleoli,Mitoses,Class
0,5,1,1,1,2,1.0,3,1,1,2
1,5,4,4,5,7,10.0,3,2,1,2
2,3,1,1,1,2,2.0,3,1,1,2
3,6,8,8,1,3,4.0,3,7,1,2
4,4,1,1,3,2,1.0,3,1,1,2


### Expected Data Schema

The Breast Cancer Wisconsin dataset is expected to have a simple schema including 11 attributes as below:
- **Sample code number**: The sample code number is a unique identifier that distinguishes one sample from another. This attribute is not meaningful to our data analysis so was excluded in data frame.
- **Clump thickness**: This feature measures how thick the clump of cells is. It can range in value from 1 to 10, with 1 being the smallest and 10 being the largest. The thickness of this clump can be suggestive about the seriousness of the cancer.
- **Uniformity of cell size**: This attribute aims at assessing uniformity among cancer cell sizes. A range of values between 1 and 10 indicates low or high uniformity respectively. High uniformity could indicate more aggressive cancer.
- **Uniformity of cell shape**: This attribute aims at assessing uniformity among cancer cell shapes. A range of values between 1 and 10 indicates low or high uniformity respectively. High uniformity could indicate more aggressive cancer.
- **Marginal adhesion**: The attribute measures how much attraction cancer cells have for one another and their own tissue i.e stromal invasion capacity, it ranges from one to ten; where low adhesion is indicated by “one” while high adhesion is represented by “ten”, a highly invasive malignancy may result in high neoplastic adhesion.
- **Single epithelial cell size**: This feature determines the size of individual epithelial cells on a scale from one to ten where the lowest value signifies smallest (smallest epithelial cell) while highest value represents largest (largest epithelial cell). Larger cells could suggest more aggressive cancers.
- **Bare nuclei**: Number of nucleuses less cells vary from one to ten representing lows to highs as well respectively, such as having many bare nuclei would mean existence of highly malignant tumour.
- **Bland chromatin**: This refers to genetic content texture in nuclear area, ranging between smoothness which is given when it’s scored “1” up to roughness/ blandness when scored “10”. For example, cancerous cells with ‘bland’ chromatin may be more aggressive.
- **Normal nucleoli**: Number of normal (non-cancerous) nucleoli in a cell is rated on a scale of one to ten, where low values mean few while high values imply many. A small number of normal nucleoli may suggest more highly malignant tumour.
- **Mitoses**: Mitotic activity refers to the number of cells currently undergoing cell division and it is scored from 1 to 10; at its minimum score it means that no considerable mitotic activity is present while at maximum score there are many mitoses [24]. The higher the value of mitosis is, the greater likelihood for cancer to cause aggressiveness.
- **Class**: This variable determines if a given sample will be benign or malignant based on features other than itself.


In [11]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Clump_thickness              699 non-null    int64  
 1   Uniformity_of_cell_size      699 non-null    int64  
 2   Uniformity_of_cell_shape     699 non-null    int64  
 3   Marginal_adhesion            699 non-null    int64  
 4   Single_epithelial_cell_size  699 non-null    int64  
 5   Bare_nuclei                  683 non-null    float64
 6   Bland_chromatin              699 non-null    int64  
 7   Normal_nucleoli              699 non-null    int64  
 8   Mitoses                      699 non-null    int64  
 9   Class                        699 non-null    int64  
dtypes: float64(1), int64(9)
memory usage: 54.7 KB


None

## Data Profiling

Data profiling is a technique used to observe the data distribution of each feature. It provides an overview of the data by providing information about min/max values, mean, null or infinite values, and count. This helps us to identify any outliers in our dataset which can then be removed through data cleansing techniques. Data profiling serves as an important starting point for further analysis and understanding what type of cleaning might be required for our dataset before we move on with other tasks like machine learning models building.

Tukey’s (1977) seminal work "Exploratory Data Analysis" (EDA) is a ground-breaking book in the field of data analysis. It introduces and explains the principles of exploratory data analysis, which involves analyzing datasets to summarize their main characteristics using statistical graphics and other visualization methods.

YData Profiling API (n.d.), one of framework implementing Tukey’s (1977) EDA methodology, is adopted to generate EDA report and provide quick insights into preliminary features and labels within a dataset before any further ML tasks are undertaken.


### EDA Framework

Data profiling is a technique used to observe the data distribution of each feature. It provides an overview of the data by providing information about min/max values, mean, null or infinite values, and count. This helps us to identify any outliers in our dataset which can then be removed through data cleansing techniques. Data profiling serves as an important starting point for further analysis and understanding what type of cleaning might be required for our dataset before we move on with other tasks like machine learning models building.

Tukey’s (1977) seminal work "Exploratory Data Analysis" (EDA) is a ground-breaking book in the field of data analysis. It introduces and explains the principles of exploratory data analysis, which involves analyzing datasets to summarize their main characteristics using statistical graphics and other visualization methods.

EDA Framework helps describe a set of data features, expose its inner structure, get out important variables, identify any anomalies and outliers and test for the underlying assumptions. Here are some problems that may be discovered in an EDA report:
- **Missing Values** - By using EDA it is possible to find columns with missing values. In this regard, you will probably replace or rather remove them depending on what proportion of values were not found.
- **Outliers** - In case there are outliers in your data, performing EDA can help you to detect them. Such cases differ greatly from other observations. These strange values might be true or erroneous.
- **Distribution of Data** - When one does exploratory analysis of data, he/she can understand its distribution too well. If it’s skewed, then it might not work as expected by some machine learning algorithms.
- **Correlation** - Furthermore, through conducting EDA on your dataset you can also determine if there are any correlated features among them which will lead multi-collinearity among linear regression models if they have highly correlated features.
- **Constant Features** - Moreover, one may equally use this step to determine if there are any constant features in a given dataset that lacks useful information hence, they need to be deleted as well.
- **Categorical Variables** - Additionally, when engaged in exploratory analysis of data one may also find out how many categorical variables exist and their distinct categories too since a few categories might have minimal counts thus requiring special treatment separately.
- **Feature Magnitude** - Furthermore another thing that comes into the picture during EDA is whether the different measures are being used for scaling features for such algorithms like learning machines where scale needs to be uniform across all these arrays

## Data Cleansing

Data Cleansing stage handles missing values, outliers, feature engineering, etc. This could involve techniques such as imputation or removal of instances with missing values, depending on the proportion of missing data and the specific requirements of your analysis or model.

A strategic change is being introduced in our data processing pipeline to ensure efficiency and coherence. We will merge Exploratory Data Analysis (EDA) with the data cleaning step. As a result, we can now spot missing values, outliers or other red flags easily during clean-up. This way, EDA helps us to ensure that our data is not only clean but also understood as well leading to more precise and dependable subsequent analysis and models. In this way, one can be sure that all the relevant information regarding the used dataset has been found before any decision on further steps for pre-processing is made.


### Constant Features

In Exploratory Data Analysis (EDA) the first step is to identify features that are constant in order to remove them as they do not contribute any information and cannot be used in predictive modeling. The “constant feature” is defined as a characteristic having the same value for every record in the dataset.

In [12]:
# Find constant features
def detect_constant_features(df: pd.DataFrame) -> list:
    return [col for col in df.columns if df[col].nunique() == 1]
  
constant_features = detect_constant_features(df)

# Print constant features
print('Constant features:', constant_features)


Constant features: []


There is no constant features issue in the dataset.

### Categorical Variables

It is important to identify and analyze categorical variables as they can greatly impact one’s analysis and predictive models. Some categorical variables may need special treatment like encoding, grouping or even excluding them altogether from the analysis if they contain too many categories (high cardinality) or very few observations (low frequency).

The Breast Cancer Wisconsin Original dataset primarily consists of integer features, with the target variable being categorical (2 for benign, 4 for malignant). However, the categorical target variable should be properly encoded (e.g., 0 for benign and 1 for malignant) for later data analyisis models.


In [13]:
df['Class'] = df['Class'].map({2: 0, 4: 1})

### Missing Values

By using EDA it is possible to find columns with missing values. In this regard, you will probably replace or rather remove them depending on what proportion of values were not found.

In the dataset, there are 16 missing values are identified in attribute “Bare nuclei” with 2.3% missing ratio.

In [14]:
import pandas as pd

# Find missing values
def missing_values(df: pd.DataFrame) -> pd.Series:
    return df.isnull().sum()
  
missing_values = missing_values(df)
print("Missing values in each column:\n", missing_values)

Missing values in each column:
 Clump_thickness                 0
Uniformity_of_cell_size         0
Uniformity_of_cell_shape        0
Marginal_adhesion               0
Single_epithelial_cell_size     0
Bare_nuclei                    16
Bland_chromatin                 0
Normal_nucleoli                 0
Mitoses                         0
Class                           0
dtype: int64


Clinical datasets often have high missingness which may jeopardize modeling if not well managed. There is a chance of overcoming these problems by imputing the variable values, which have disappeared for any reasons. Traditional imputation techniques tend to rely heavily on simple statistical analysis, such as mean imputation and complete case scenarios which are typified by numerous limitations that eventually degrade learning performance.

It is better to use imputation methods instead of discarding these rows. Such a step will hold on to important data and enhance the learning algorithms’ accuracy. The most common ones include k nearest neighbours (k-NN), random forests (RF) and support vector machines (SVM) since they have high robustness and performance with missing data (Wu et al, 2020).

k-NN imputation method was used to fill the missing value of attribute “Bare Nuclei”. One such case study was carried out on the Breast Cancer Wisconsin dataset by Huang and Cheng (2020) to show the efficiency of k-NN imputation. The study revealed that models trained with k-NN imputed data had high sensitivity (0.8810) and specificity (0.9859), meaning the imputation method did not undermine the model’s capacity to correctly identify positive and negative cases. Furthermore, this balanced accuracy (0.9334) confirms that k-NN imputation can be safely relied upon with respect to this dataset.

In [15]:
from sklearn.impute import KNNImputer

def fix_missing_values_using_knn(df: pd.DataFrame, neighbors: int) -> pd.DataFrame:
    # Using k-NN Imputation to fix missing value
    imputer = KNNImputer(n_neighbors=neighbors, weights='uniform', metric='nan_euclidean')

    # Perform imputation
    df_imputed = imputer.fit_transform(df)
    df = pd.DataFrame(df_imputed, columns=df.columns)
    return df
  

df = fix_missing_values_using_knn(df, neighbors=5)

# Find missing values
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

Missing values in each column:
 Clump_thickness                0
Uniformity_of_cell_size        0
Uniformity_of_cell_shape       0
Marginal_adhesion              0
Single_epithelial_cell_size    0
Bare_nuclei                    0
Bland_chromatin                0
Normal_nucleoli                0
Mitoses                        0
Class                          0
dtype: int64


###	Feature Magnitude

Feature magnitude issues can happen when some features in a dataset have different scales. This may impact the efficiency of data analysis algorithms that are scale sensitive. 


To begin with, one can define a boxplot as a graphical representation of statistical data used to detect outliers. Outliers are usually values that fall below Q1 - 1.5IQR or above Q3 + 1.5IQR where the first and third quartiles are Q1 and Q3 respectively, and IQR is the Interquartile Range (Q3 - Q1).

A box plot is made up of a box (that’s why it’s called a box plot) which represents the IQR, a line that shows the median value within the box, and two whiskers extending from the box to represent the range of values covered by 1.5*IQR for data. Those lying beyond the whiskers are outliers.

In [18]:
import altair as alt

def show_boxplot(df: pd.DataFrame) -> alt.Chart:
    # Boxplot for visualizing feature magnitudes
    boxplot = alt.Chart(df).transform_fold(
        df.columns.tolist(),
        as_=['Feature', 'Value']
    ).mark_boxplot().encode(
        x='Feature:N',
        y='Value:Q'
    ).properties(
        width=600,
        height=400,
        title='Feature Magnitude Boxplot'
    )
    return boxplot

show_boxplot(df.drop(columns="Class")).show()

Based on the above boxplot, various problems with feature magnitudes that can be encountered include:

- **Scale differences**: The absence of standardization or normalization may cause a machine learning algorithm to malfunction as some features can have values larger than others or smaller scale.
- **Outliers**: They could lead to extreme values skewing the results of an analysis or model training.
- **Variability**: Variation between features in terms of data spread might affect their significance in several statistical models or machine learning algorithms.

Feature scaling techniques like normalization, standardization, and outlier detection/removal methods may be applied to handle such cases before using the data for further analysis or model training.

In [19]:
from sklearn.preprocessing import StandardScaler
import altair as alt

def fix_feature_magnitude(df: pd.DataFrame) -> pd.DataFrame:
    X = df.drop(columns='Class')
    y = df['Class']
    
    # Standardize the features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    df_standardscaled = pd.DataFrame(X, columns=df.columns[:-1])
    df_standardscaled['Class'] = y
    
    # Ensure all values are zero or positive
    df_standardscaled = df_standardscaled - df_standardscaled.min()
    
    return df_standardscaled
  
def visualize_outliers(df: pd.DataFrame):
    # Boxplot for visualizing feature magnitudes
    boxplot = alt.Chart(df).transform_fold(
        df.columns.tolist(),
        as_=['Feature', 'Value']
    ).mark_boxplot().encode(
        x='Feature:N',
        y='Value:Q'
    ).properties(
        width=600,
        height=400,
        title='Feature Magnitude Boxplot'
    )
    boxplot.show()

df_scaled = fix_feature_magnitude(df)
display(df_scaled.head())
visualize_outliers(df_scaled.drop(columns="Class"))

Unnamed: 0,Clump_thickness,Uniformity_of_cell_size,Uniformity_of_cell_shape,Marginal_adhesion,Single_epithelial_cell_size,Bare_nuclei,Bland_chromatin,Normal_nucleoli,Mitoses,Class
0,1.421603,0.0,0.0,0.0,0.451933,0.0,0.820809,0.0,0.0,0.0
1,1.421603,0.98384,1.010174,1.401868,2.7116,2.482765,0.820809,0.327713,0.0,0.0
2,0.710801,0.0,0.0,0.0,0.451933,0.275863,0.820809,0.0,0.0,0.0
3,1.777003,2.295627,2.357072,0.0,0.903867,0.827588,0.820809,1.966279,0.0,0.0
4,1.066202,0.0,0.0,0.700934,0.451933,0.0,0.820809,0.0,0.0,0.0


After scaling, feature magnitude issue has been improved although there are still outlier issues.

### Outliers

Outliers are observations that deviate significantly from other data points. They can be detected in several ways during Exploratory Data Analysis (EDA). Based on the previous boxplot, it uses 5th and 95th percentile to visualize those outliers. 


Agarwal & Gupta (2021) compared different outlier detection techniques to help data scientists select an algorithm for building a better model. They concluded that Angel-based Outlier Detection (ABOD) and One-class SVM (OCSVM) techniques improved data analysis and machine learning model performance most across classifiers. In addition, each classifier had specific outlier detection techniques performing best.

Given the characteristics of the Breast Cancer Wisconsin Original dataset, OSCVM appears to be a more suitable outlier detection technique compared to ABOD. OCSVM's ability to handle imbalanced datasets and its computational efficiency make it a better fit for this specific clinical dataset.

The following code is tried to use OCSVM for further outlier detection.

In [20]:
from sklearn.svm import OneClassSVM

def detect_outliers_using_ocsvm(df: pd.DataFrame) -> pd.Series:
    X = df.drop(columns='Class')
    y = df['Class']
    
    # Fit One-Class SVM
    ocsvm = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.05)
    ocsvm.fit(X)
    
    # Predict outliers
    y_pred = ocsvm.predict(X)
    df_ocsvm = pd.DataFrame(X, columns=X.columns)
    df_ocsvm['Outlier'] = y_pred == -1
    df_ocsvm['Class'] = y
    return df_ocsvm

Visualize outliners

In [21]:
import altair as alt

charts = []
feature_columns = df_scaled.drop(columns=["Class"]).columns
for col in feature_columns:
    chart = alt.Chart(detect_outliers_using_ocsvm(df_scaled)).mark_circle(size=60).encode(
        x=alt.X('Class:N', title='Class'), 
        y=alt.Y(col, title=col), 
        color = alt.Color( 'Outlier:N', scale= alt.Scale(domain=[True, False], range=['blue', 'yellow']),
                            legend=alt.Legend(title='Outlier')), 
        tooltip=[col, 'Class']
    ).properties(
        width=200,
        height=200,
        title=f'{col} vs Class'
    )
    charts.append(chart)
combined_charts = alt.vconcat( *[ alt.hconcat(*charts[i:i+3]) for i in range(0, len(charts), 3)])
combined_charts.show()
                              

Through the above scatter charts, we can easily identify three attributes (Clump Thickness, Single Epithelial Cell Size, Bare Nuclei, and Mitoses) having outliers.

The Breast Cancer Wisconsin Original dataset is one of many medical datasets that can provide outliers. These can give new ways of treatment for patients, so this should never be forgotten when dealing with them. Winsorizing method is being adopted to reduce outlier impact for further data analysis instead of outlier removal. 


In [22]:
from scipy.stats.mstats import winsorize

def fix_outliers_using_winsorizing(df: pd.DataFrame, outlier_features, limits) -> pd.DataFrame:
    df_winsorized = df.copy()
    for feature in outlier_features:
        # Use 5th and 95th percentile as lower and upper limits
        df_winsorized[feature] = winsorize(df_winsorized[feature], limits=limits)
    return df_winsorized
    
df_winsorized = fix_outliers_using_winsorizing(df_scaled, outlier_features=['Clump_thickness', 'Single_epithelial_cell_size', 'Bare_nuclei', 'Mitoses'], 
                                               limits=[0.05, 0.05])
display(df_winsorized.head())
visualize_outliers(df_winsorized.drop(columns="Class"))

Unnamed: 0,Clump_thickness,Uniformity_of_cell_size,Uniformity_of_cell_shape,Marginal_adhesion,Single_epithelial_cell_size,Bare_nuclei,Bland_chromatin,Normal_nucleoli,Mitoses,Class
0,1.421603,0.0,0.0,0.0,0.451933,0.0,0.820809,0.0,0.0,0.0
1,1.421603,0.98384,1.010174,1.401868,2.7116,2.482765,0.820809,0.327713,0.0,0.0
2,0.710801,0.0,0.0,0.0,0.451933,0.275863,0.820809,0.0,0.0,0.0
3,1.777003,2.295627,2.357072,0.0,0.903867,0.827588,0.820809,1.966279,0.0,0.0
4,1.066202,0.0,0.0,0.700934,0.451933,0.0,0.820809,0.0,0.0,0.0


### Distribution of Data

The deviation of data can have an effect on how statistical analysis and machine learning models perform. To address this, the data should be transformed to center at zero and have a standard deviation of unity. This paper will discuss different approaches to dealing with skewed data using Python.

In [23]:
X = df_winsorized.drop(columns='Class')
y = df_winsorized['Class']

skewness = X.skew(axis=0, skipna=True)
print(skewness)

Clump_thickness                0.592859
Uniformity_of_cell_size        1.233137
Uniformity_of_cell_shape       1.161859
Marginal_adhesion              1.524468
Single_epithelial_cell_size    1.380546
Bare_nuclei                    0.998480
Bland_chromatin                1.099969
Normal_nucleoli                1.422261
Mitoses                        2.540104
dtype: float64


Except for clump thickness which has a skewness value of 0.592859, where moderately skewed all other characteristics are highly skewed. Logarithmic transformation is one such technique; however, this is only possible with positive data. The Box-Cox Transformation calculates the best power transformation for your data that reduces its skewness to make it as close to normal distribution as possible.

In [24]:
# Use box-cox transformation
from scipy.stats import boxcox

def fix_skewness_using_boxcox(df: pd.DataFrame) -> pd.DataFrame:
    skewed_features = df.columns
    for feature in skewed_features:
        df[feature], _ = boxcox(df[feature] + 1) # Adding 1 to avoid log(0)
    return df

df_boxcoxed = fix_skewness_using_boxcox(df_winsorized.drop(columns='Class'))
skewness = df_boxcoxed.skew(axis=0, skipna=True)
print(skewness)

df_boxcoxed["Class"] = df_winsorized["Class"]

Clump_thickness               -0.017209
Uniformity_of_cell_size        0.473817
Uniformity_of_cell_shape       0.380186
Marginal_adhesion              0.583534
Single_epithelial_cell_size    0.088808
Bare_nuclei                    0.576200
Bland_chromatin                0.087685
Normal_nucleoli                0.725528
Mitoses                        1.747431
dtype: float64


The dataset can still be skewed after applying Box-Cox transformation. The Box-Cox transformation is a method that helps in stabilizing variance and normalizing datasets (Box & Cox 1964). Nonetheless, it doesn’t ensure that the result will always be perfect. 

### Correlation Coefficint

The Breast Cancer Wisconsin Original Dataset is a popular data set in the domain of medical research and machine learning for breast cancer diagnosis. The purpose of this report is to find out which correlation coefficient between Pearson’s, Spearman’s or Kendall’s Tau is most useful when analyzing this dataset.

Moreover, we have seen from the previous sections that outliers exist in this dataset. It measures monotonically the relationship between two ranked variables. It is applicable to both continuous and ordinal data and it is less affected by outliers more than Pearson correlation.


In [88]:
# Render Spearman correlation coefficient matrix

import altair as alt

def render_correlation_matrix(df: pd.DataFrame) -> alt.Chart:
    corr = df.corr(method='spearman').reset_index().melt(id_vars='index')
    corr.columns = ['x', 'y', 'value']
    heatmap = alt.Chart(corr).mark_rect().encode(
        x='x:O',
        y='y:O',
        color='value:Q'
    ).properties(
        width=400,
        height=400
    )
    return heatmap
  
render_correlation_matrix(df_boxcoxed.drop(columns="Class")).show()

In [81]:
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.decomposition import PCA

# Calculate VIF for each feature
def calculate_vif(df):
    vif_data = pd.DataFrame()
    vif_data["feature"] = df.columns
    vif_data["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif_data

# Assuming df is your DataFrame with features
vif_df = calculate_vif(df_boxcoxed.drop(columns='Class'))

display(vif_df)



Unnamed: 0,feature,VIF
0,Clump_thickness,5.322459
1,Uniformity_of_cell_size,12.007036
2,Uniformity_of_cell_shape,10.510065
3,Marginal_adhesion,4.197831
4,Single_epithelial_cell_size,6.801485
5,Bare_nuclei,5.11277
6,Bland_chromatin,6.350243
7,Normal_nucleoli,4.276653
8,Mitoses,1.732085


In [92]:

# def fix_multicollinearity_using_pca(df: pd.DataFrame, vif: pd.DataFrame, threshold) -> pd.DataFrame:
#     df_reduced = df.copy()
#     variables_to_remove = vif[vif['VIF'] > threshold]['feature']
#     df_reduced = df.drop(columns=variables_to_remove)
#     return df_reduced

# df_reduced = fix_multicollinearity_using_pca(df_boxcoxed.drop(columns='Class'), vif_df, threshold=9)
df_reduced = df_boxcoxed.drop(columns=['Uniformity_of_cell_size', 'Mitoses'])

vif_df = calculate_vif(df_reduced.drop(columns='Class'))
display(vif_df)
render_correlation_matrix(df_reduced).show()

df_reduced['Class'] = df_boxcoxed['Class']
display(df_reduced)



Unnamed: 0,feature,VIF
0,Clump_thickness,5.278802
1,Uniformity_of_cell_shape,7.029705
2,Marginal_adhesion,4.013416
3,Single_epithelial_cell_size,6.798505
4,Bare_nuclei,4.869203
5,Bland_chromatin,6.289892
6,Normal_nucleoli,3.887196


Unnamed: 0,Clump_thickness,Uniformity_of_cell_shape,Marginal_adhesion,Single_epithelial_cell_size,Bare_nuclei,Bland_chromatin,Normal_nucleoli,Class
0,0.927529,0.000000,0.000000,0.305635,0.000000,0.511227,0.000000,0.0
1,0.927529,0.459686,0.425713,0.692622,0.511572,0.511227,0.210846,0.0
2,0.552651,0.000000,0.000000,0.305635,0.198826,0.511227,0.000000,0.0
3,1.079137,0.611532,0.000000,0.460754,0.374500,0.511227,0.412662,0.0
4,0.754565,0.000000,0.333926,0.305635,0.000000,0.511227,0.000000,0.0
...,...,...,...,...,...,...,...,...
694,0.552651,0.000000,0.000000,0.460754,0.198826,0.000000,0.000000,0.0
695,0.309087,0.000000,0.000000,0.305635,0.000000,0.000000,0.000000,0.0
696,0.927529,0.645497,0.333926,0.692622,0.307815,0.957497,0.432031,1.0
697,0.754565,0.557351,0.391119,0.460754,0.374500,1.044726,0.400480,1.0


## Data Analysis