# Analysis of Breast Cancer Wisconsin (Diagnostic) Data Set

### Data Visulation and EDA

# INTRODUCTION
* In this portfolio project, we will anaysis breast cancer dataset and we will try to determine why the cancer cell is melignant or benign.
    * We downloaded dataset from kaggle website.
        * https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
        * Our dataset is 49KB and it exists one dataset.
        * We will use data.csv file and show important results.

## Content:
1. [Importing Packages and Loading Dataset](#1)
1. [Describing Dataset](#2)
1. [Explaining Features and Determine Target Value](#3)
1. [Missing Analysis and Drop Missing Values](#4)
1. [Correlation](#5)
1. [Categorizing and Standardization Data Features](#6)
1. [Visulation and Analysing](#7)
1. [Conclusion](#8)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.express as px

from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
init_notebook_mode(connected=True)
import plotly.graph_objs as go
from wordcloud import WordCloud

# import warnings
import warnings
# filter warnings
warnings.filterwarnings('ignore')


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")

In [None]:
data.head()

<a id="2"></a> 
## 2. Describing Dataset

In [None]:
data.shape

In [None]:
data.info()

Our data has 33 columns and 569 rows. It has one integer type column, one class type column and 31 float type columns.

In [None]:
data.describe(include= "all")

<a id="3"></a> 
## 3. Explaining Features and Determine Target Value

In [None]:
data.columns

#### We have 31 columns. Those features are about cancer cell. We will explain features one by one.
* 1) ID number
* 2) Diagnosis (M = malignant, B = benign)
* 3) Ten real-valued features are computed for each cell nucleus:

   * a) radius (mean of distances from center to points on the perimeter)
   * b) texture (standard deviation of gray-scale values)
   * c) perimeter
   * d) area
   * e) smoothness (local variation in radius lengths)
   * f) compactness (perimeter^2 / area - 1.0)
   * g) concavity (severity of concave portions of the contour)
   * h) concave points (number of concave portions of the contour)
   * i) symmetry
   * j) fractal dimension ("coastline approximation" - 1)

 Diagnosis values are our target values. We have two classes. In this analysis, we will explain which features effect diagnosis values.
   * M = Malignant is bad news for cancer patient beacuse malignant cell is fatal.
   * B = Benign is good news for cancer patient because bening cell is not dangerous and do not harm patient.

In [None]:
data.diagnosis.value_counts()

In [None]:
colors = ["red", "green"]
sns.barplot(x = data.diagnosis.unique(),y = data.diagnosis.value_counts(),palette= colors)
plt.xlabel("type of cancer cell")
plt.title("Counts of M and B Cancer Cell")
plt.show()

#### We have 357 "M" and 212 "B" values. Most of analyzes people use 1 and 0 for class values. So we will convert out target values;  
             M = 1 is cancer cell  and B = 0 is noncancer cell

#### Now, we converted our data 1 and 0 instead of M and B.

In [None]:
data.diagnosis = [1 if each == "M" else 0 for each in data.diagnosis]
print(data.diagnosis.values)

In [None]:
labels = data.diagnosis.value_counts().index
colors = ['red','green']
explode = [0,0]
sizes = data.diagnosis.value_counts().values

# visual
plt.figure(figsize = (7,7))
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%')
plt.title('Percentage of Diagnosis Cancers',color = 'blue',fontsize = 15)
plt.show()

#### This pie chart show us diagnosis of cancers. %62.7 are cancer(1), %37.3 are noncancer(0) amaunt of patients. So nearly 2/3 all of patients are cancer.

<a id="4"></a> 
## 4. Missing Analysis and Drop Missing Values

In [None]:
data.isnull().sum().values

In [None]:
import missingno as msno
msno.bar(data)
plt.show()

#### Unnamed: 32 column has 569 null values. This column is completly null. So we do not need this columns. ID data has number ID so this column do not effect my target values. We will drop ID and Unnamed: 32 columns and will continue analysis without those dropping values.

In [None]:
data.drop(["id","Unnamed: 32"],axis=1, inplace = True)

<a id="5"></a> 
## 5. Correlation in Data

In [None]:
corr = data.corr().diagnosis
corr[np.argsort(corr, axis=0)[:-1]]

#### In above table is correlation between our target feature(diagnosis) and other features
#### As we can see, our most relative features are  compactness, concavity, area, radius, concave points and perimeter (_mean and worst).
#### Let's look at all correlation between all features

In [None]:
f,ax = plt.subplots(figsize=(20, 20))
sns.heatmap(data.corr(), annot= True, linewidths= 0.3, linecolor= "red", fmt= ".0%", ax= ax, cmap = 'coolwarm')
plt.show()

<a id="6"></a> 
## 6. Standardization and Categorizing Data Features

In [None]:
y = data.diagnosis # y is our target values/ dependent variable
x = data.drop(columns= "diagnosis") # x is undependent variable

for i in x:  # Distribution and Skewness
    g = sns.distplot(data[i], color="b", label="Skewness : %.2f"%(data[i].skew()))
    g = g.legend(loc="best")
    plt.show()


#### As we can see, some of the variables are little skewed. As all the values are below 1.5, we can ommit it.
#### The other clear conclusion is that the data is not scaled and standarized. We will try to standarized.

In [None]:
x_ = (x - x.mean()) / x.std()  #standardization

In [None]:
x_.describe(include= "all")

#### mean near zero, std is one
#### our dataset is ready, we can begin visulation and analysis after categorising

In [None]:
plt.subplots(figsize=(8,8))
wordcloud = WordCloud(
                          background_color='white',
                          width=512,
                          height=384
                         ).generate(" ".join(data.columns))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('graph.png')

plt.show()

#### As we can see, our dataset has 3 categoriel features. We have mean, se and worst categories. So we will divide 3 categories all of features. 
#### The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
   * #### mean_data = radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean
   * #### se_data = radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se
   * #### worst_data = radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, fractal_dimension_worst

#### After categorising, we will add our target values(diagnosis/ y) in 3 equels

In [None]:
mean_data = pd.concat([y,x_.iloc[:,0:10]],axis=1)
mean_data = pd.melt(mean_data,id_vars="diagnosis", var_name="features", value_name='value')

se_data = pd.concat([y,x_.iloc[:,10:20]],axis=1)
se_data = pd.melt(se_data,id_vars="diagnosis", var_name="features", value_name='value')

worst_data = pd.concat([y,x_.iloc[:,20:30]],axis=1)
worst_data = pd.melt(worst_data,id_vars="diagnosis", var_name="features", value_name='value')

<a id="7"></a> 
## 7. Visulation and Analysing

In [None]:
plt.figure(figsize=(10,10))
sns.violinplot(x ="features", y ="value", hue ="diagnosis",palette = colors, data = mean_data, split = True, inner = "quart")
plt.xticks(rotation=90)

#### In mean features, we can see which most of  melignant cancer cells are dominant in  [-2, 0] range 

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(x ="features", y ="value", hue ="diagnosis", data = se_data, palette = colors)
plt.xticks(rotation=90)

#### In se features, we can see what most of benign cancer cells are dominant in  [0, 2] range 

In [None]:
plt.figure(figsize=(10,10))
sns.swarmplot(x="features", y="value", hue="diagnosis", data= worst_data, palette = colors)
plt.xticks(rotation=90)

#### In worst features, we can see what most of melignant cancer cells are dominant in  [-2, 0] range

In [None]:
mean_data = pd.concat([y,x_.iloc[:,0:10]],axis=1)
sns.pairplot(mean_data, hue = "diagnosis", palette = colors)
plt.show()

In [None]:
se_data = pd.concat([y,x_.iloc[:,10:20]],axis=1)
sns.pairplot(se_data, hue = "diagnosis", palette = colors)
plt.show()

In [None]:
worst_data = pd.concat([y,x_.iloc[:,10:20]],axis=1)
sns.pairplot(worst_data, hue = "diagnosis", palette = colors)
plt.show()

### In pair plot we can see  multivariate alaysis of all categorise variable. Some plot contradict our previous alaysis and estimates but generally we have some idea about target variable.

<a id="8"></a> 
## 8. Conclusion

#### In a conclusion, we generally have some idea. Our idea about breast cancer dataset:
   #### between [-2, 0] values are Malignant cancer cell which mean fatal
   #### between [0, 2] values are Benign cancer cell which mean harmless 