# Table of Contents 

- The BREAST-CANCER dataset:
    - Load the dataset
    - Explore the dataset: Descriptive statistics
    - Explore the dataset: Visualization
    


In [1]:
import os
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

## The BREAST-CANCER dataset

available at [UCI database](https://archive.ics.uci.edu/ml/datasets/breast+cancer).


This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature. (See also lymphography and primary-tumor.)
 
This data set includes 286 intances (201 of one class, 85 of another class).  The instances are described by 9 attributes, some of which are ordinal and some are nominal.
 
Attribute information

| column | values |
| --- | --- |
| Class | no-recurrence-events, recurrence-events |
| age | 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99|
| menopause | lt40, ge40, premeno|
| tumor-size | 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59|
| inv-nodes | 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39|
| node-caps | yes, no|
| deg-malig | 1, 2, 3|
| breast | left, right|
| breast-quad | left-up, left-low, right-up, right-low, central|
| irradiat | yes, no|
 
There are 9 Missing Attribute Values (denoted by "?") 


## Load the Dataset

In [2]:
df = pd.read_csv(os.path.join('dataset','breast-cancer.csv'))

In [3]:
df

Unnamed: 0,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,Class
0,'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
1,'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
2,'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
3,'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'
4,'40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'
...,...,...,...,...,...,...,...,...,...,...
281,'50-59','ge40','30-34','6-8','yes','2','left','left_low','no','no-recurrence-events'
282,'50-59','premeno','25-29','3-5','yes','2','left','left_low','yes','no-recurrence-events'
283,'30-39','premeno','30-34','6-8','yes','2','right','right_up','no','no-recurrence-events'
284,'50-59','premeno','15-19','0-2','no','2','right','left_low','no','no-recurrence-events'


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          286 non-null    object
 1   menopause    286 non-null    object
 2   tumor-size   286 non-null    object
 3   inv-nodes    286 non-null    object
 4   node-caps    286 non-null    object
 5   deg-malig    286 non-null    object
 6   breast       286 non-null    object
 7   breast-quad  286 non-null    object
 8   irradiat     286 non-null    object
 9   Class        286 non-null    object
dtypes: object(10)
memory usage: 22.5+ KB


**Issue #1**: values are *quoted* using character " **'** "

In [None]:
df = pd.read_csv(os.path.join('dataset','breast-cancer.csv'),quotechar = "'")

In [None]:
df

In [None]:
df.head(3).T

**Issue #2**: are we handling missing values as such?

In [None]:
df["node-caps"].value_counts()

In [None]:
df = pd.read_csv(os.path.join('dataset','breast-cancer.csv'),quotechar = "'", na_values = '?')


In [None]:
df["node-caps"].value_counts()

In [None]:
df["node-caps"].value_counts(dropna=False)

## Explore the dataset: Descriptive statistics


In [None]:
df

In [None]:
df.head(10)

In [None]:
df.head(2).T

In [None]:
df.shape

Check if there is any missing value

In [None]:
df.isna().sum()

In [None]:
df[df.isna().any(axis=1)]

In [None]:
df.age.unique()

In [None]:
df.info()

Most columns are recognized as generic "object" (text or mixed numeric and non-numeric values).
Actually, we know from the documentation that we are dealing with categorical variables.

`Pandas` has a **categorical** data type, which may be useful in the following cases:
- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

`Categoricals` are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values (categories). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.

In contrast to statistical categorical variables, categorical data might have an order (e.g. 'strongly agree' vs 'agree' or 'first observation' vs. 'second observation'), but numerical operations (additions, divisions, …) are not possible.



In [None]:
df.describe()

By default, describe considers only numeric data

In [None]:
df.describe(include = 'all') # mixed data type: description still supported

We can cast our dataframe to categorical, as we know that this is the case.

In [None]:
df = df.astype('category')

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
pd.set_option('display.max_rows', None)
df.sort_values(by = 'inv-nodes').tail(80)

**Another issue**: we are not using the proper order as the lexical one is not the same as the logical one: for example (0-2 < 3-5 < 10-12) but ("0-2" < "10-12" < "3-5")

In [None]:
from pandas.api.types import CategoricalDtype 
# the following categories are available in the dataset description ("attribute information") at the beginning of this notebook
categories = CategoricalDtype(["0-2", "3-5", "6-8", 
                               "9-11", "12-14", "15-17", 
                               "18-20", "21-23", "24-26", 
                               "27-29", "30-32", "33-35",
                               "36-39"], ordered = True)
# Inv-nodes: the number (range 0 - 39) of axillary lymph nodes
# that contain metastatic breast cancer visible on histological examination.
df['inv-nodes'] = df['inv-nodes'].astype(categories)


In [None]:
df.info()

In [None]:
df.sort_values(by = 'inv-nodes').tail(80)

## Explore the dataset: Visualization

*Barplot* and *pie-chart* in pandas (shown just for an attribute, but applies to all)

In [None]:
value_count = df['breast-quad'].value_counts(dropna = False) 
value_count.plot(kind='bar')
plt.title('breast-quad')
plt.show()
value_count.plot(kind = "pie")
plt.show()

*Barplot* in seaborn.

`seaborn` axes-level functions for [plotting categorical data](https://seaborn.pydata.org/tutorial/categorical.html):
- categorical scatter plots
    - `stripplot()`
    - `swarmplot()`
- distribution plots
    - `boxplot()`
    - `violinplot()`
    - `boxenplot()`
- estimate plots
    - `pointplot()`
    - `barplot()`
    - `countplot()`

In [None]:
sns.countplot(x="age", data=df, palette = "pastel")
plt.show()

In [None]:
sns.countplot(x="age", data=df, hue = "Class", palette = "pastel")
plt.show()

`seaborn` also provides a figure-level interface, `catplot()`, that gives unified higher-level access to the axes-level functions.

In [None]:
sns.catplot(x="age", hue="Class", kind="count", # x = 'age' --> vertical
            palette="pastel", edgecolor=".5",
            data=df)

In [None]:
sns.catplot(y="node-caps", hue="Class", kind="count", # y = 'node-caps' --> horizontal
            palette="pastel", edgecolor=".5",
            data=df)

In [None]:
sns.catplot(y="inv-nodes", hue="Class", kind="count",
            palette="pastel", edgecolor=".5",
            data=df)

Notice the effect of customizing the order and domain of the categorical variable.
- restore the original dataset
- plot with the same catplot statement

In [None]:
df = pd.read_csv(os.path.join('dataset','breast-cancer.csv'),quotechar = "'", na_values = '?')

In [None]:
sns.catplot(y="inv-nodes", hue="Class", kind="count",
            palette="pastel", edgecolor=".5", 
            data=df)