In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
from pandas_profiling import ProfileReport
from plotly.offline import iplot
!pip install joypy
import joypy

plt.rcParams['figure.figsize'] = 8, 5
plt.style.use("fivethirtyeight")

data = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

Cancer occurs when changes called mutations take place in genes that regulate cell growth. The mutations let the cells divide and multiply in an uncontrolled way.

Breast cancer is cancer that develops in breast cells. Typically, the cancer forms in either the lobules or the ducts of the breast. Lobules are the glands that produce milk, and ducts are the pathways that bring the milk from the glands to the nipple. Cancer can also occur in the fatty tissue or the fibrous connective tissue within your breast.

The uncontrolled cancer cells often invade other healthy breast tissue and can travel to the lymph nodes under the arms. The lymph nodes are a primary pathway that help the cancer cells move to other parts of the body. [Source](https://www.healthline.com/health/breast-cancer).

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTv4VRY344bf6NKTdhY1DU-eBaWS-WQt7mmIQ&usqp=CAU" height="300px" width="500px">

<h1>Types of Breast Cancer</h1>

There are several types of breast cancer, and they are broken into two main categories: “invasive” and “noninvasive,” or in situ. While invasive cancer has spread from the breast ducts or glands to other parts of the breast, noninvasive cancer has not spread from the original tissue.

These two categories are used to describe the most common types of breast cancer, which include:

1. Ductal carcinoma in situ. Ductal carcinoma in situ (DCIS) is a noninvasive condition. With DCIS, the cancer cells are confined to the ducts in your breast and haven’t invaded the surrounding breast tissue.
2. Lobular carcinoma in situ. Lobular carcinoma in situ (LCIS) is cancer that grows in the milk-producing glands of your breast. Like DCIS, the cancer cells haven’t invaded the surrounding tissue.
3. Invasive ductal carcinoma. Invasive ductal carcinoma (IDC) is the most common type of breast cancer. This type of breast cancer begins in your breast’s milk ducts and then invades nearby tissue in the breast. Once the breast cancer has spread to the tissue outside your milk ducts, it can begin to spread to other nearby organs and tissue.
4. Invasive lobular carcinoma. Invasive lobular carcinoma (ILC) first develops in your breast’s lobules and has invaded nearby tissue.

# Description of Data

In [None]:
#describing data

data.describe()

In [None]:
#covariance in data

data.cov()

In [None]:
#correlation in data

data.corr()

In [None]:
sns.heatmap(data.corr())
plt.show()

# Checking for missing and duplicate values

In [None]:
print('Missing Values Plot')
plt.figure(figsize=(8,8))
sns.barplot(data=data.isnull().sum().reset_index(), y='index', x=0)
plt.ylabel('Variables')
plt.xlabel('Missing value Count')
plt.show()

# Dropping unnecessary variables

In [None]:
drop_var = ['Unnamed: 32', 'id']

data.drop(drop_var, axis=1, inplace=True)

# Analyzing the features

There are a total of 10 features for each of which the safe, worst and mean values are given.

For the analysis dropping the mean values seems to be right, because it is just a new feature added from the former two.


The features are:-
1. radius
2. texture
3. perimeter
4. area
5. smoothness
6. compactness
7. concavity
8. concave points
9. symmetry
10. fractal_dimension

# Distribution of the features

In [None]:
features = ['radius','texture','perimeter','area','smoothness','compactness','concavity','concave points','symmetry','fractal_dimension']

for feature in features:
    print("{} distribution".format(feature))
    sns.boxplot(data=data[['{}_mean'.format(feature), '{}_se'.format(feature), '{}_worst'.format(feature)]])
    plt.title('Distribution of {}'.format(feature))
    plt.show()

# Distribution based on diagnosis

In [None]:
for feature in features:
    print("{} distribution based on diagnosis".format(feature))
    sns.violinplot(data=data, x="diagnosis", y="{}_mean".format(feature), size=8)
    plt.show()

The most outliers can be noticed in the benign plot of the concavity feature.

# Correlation of the variables

In [None]:
print('Pairplot')
sns.pairplot(data=data[['diagnosis','area_mean','texture_mean','smoothness_mean','concavity_mean','symmetry_mean']], hue="diagnosis", height=3, diag_kind="hist")
plt.show()

# Classification Model

In [None]:
#separating features and labels

X = data.drop('diagnosis',axis=1)
y = data['diagnosis']

In [None]:
#scaling the data

from sklearn import preprocessing

X_scaled = preprocessing.scale(X)

In [None]:
#splitting the data

from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.4, random_state=13)

In [None]:
#creating model and fitting the data

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
#checking the cross val score

scores = cross_val_score(model, X_scaled, y, cv=5)
print(np.mean(scores))

In [None]:
#prediction

pred = model.predict(X_test)

In [None]:
#checking the classificaton report

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, pred))

In [None]:
#confurion matrix

from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(model, X_test, y_test)
plt.show()

In [None]:
#checking roc curve

from sklearn.metrics import roc_curve

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred_prob = model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob, pos_label='M')
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()

## Accuracy: 98%

Now the accuracy percentage seems to be getting overweight. But in this case the target feature has almost balanced data for each of the labels.

So overfitting should not be a problem.

Some more data would have helped in understanding. Will check if I find any similar dataset.

### For now this is the end of the analysis. Do leave an upvote if you like it and let me know in the comments if there is anything wrong or how to improve it.