# **DIAGNOSTIC DEPARTMENT**

<img src="https://media.giphy.com/media/ZZiLDJ98R2GOY/giphy.gif">

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')

In [None]:
print(df.shape)
df.head()

In [None]:
df.info()

Great news! The data has no null values. Also all variables except *diagnosis* are in float form.

# PREPROCESS

We will not be needing these.

In [None]:
df = df.drop(['Unnamed: 32', 'id'], axis=1)

In [None]:
df['diagnosis'].value_counts()

There are two diagnosis results and they are in object form. I will replace M with 1 and B with 0. So if the *diagnosis* is 1, which means it is cancerous.
* M, malignant: 1, Cancerous
* B, benign: 0, Not Cancerous

In [None]:
df['diagnosis']= df['diagnosis'].replace('M', 1)
df['diagnosis']= df['diagnosis'].replace('B', 0)

In [None]:
df.head()

* There are 30 columns relevant to *diagnosis*. Since there are too many columns and I am not an expert in this field, I will use correlation function.
* Correlation function will tell me the relationship between *diagnosis* and the other columns, and their effect/importance to the diagnosis.
* I will determine a correlation threshold and everything below that will dropped.

In [None]:
corr = df.corr()

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(corr, cmap='coolwarm', annot = True)
plt.show()

In [None]:
corr[abs(corr['diagnosis']) > 0.59].index

In [None]:
df = df[['diagnosis', 'radius_mean', 'perimeter_mean', 'area_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'radius_worst', 'perimeter_worst', 'area_worst', 'compactness_worst',
       'concavity_worst', 'concave points_worst']]

In [None]:
print(df.shape)
df.head()

We are ready to go.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import metrics

In [None]:
x = df.drop(['diagnosis'], axis=1)
y = df['diagnosis']

In [None]:
x.shape, y.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# LOGISTIC REGRESSION

In [None]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(x_train, y_train)
prediction_lr = logistic.predict(x_test)
print(classification_report(y_test,prediction_lr))
metrics.plot_roc_curve(logistic, x_test, y_test)

# DECISION TREE

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(x_train, y_train)
prediction_dt = tree.predict(x_test)
print(classification_report(y_test, prediction_dt))
metrics.plot_roc_curve(tree, x_test, y_test)

# RANDOM FOREST

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(x_train, y_train)
prediction_rf = forest.predict(x_test)
print(classification_report(y_test, prediction_rf))
metrics.plot_roc_curve(forest, x_test, y_test)

# XGBOOST

In [None]:
import xgboost
xgb = xgboost.XGBClassifier()
xgb.fit(x_train,y_train)
prediction_xgb = xgb.predict(x_test)
print(classification_report(y_test, prediction_xgb))
metrics.plot_roc_curve(xgb, x_test, y_test)

This is it for this notebook. We implemented simple Machine Learning models and get successful results. I hope you like it. If you do, upvotes are appreciated. Take care.

<img src="https://media.giphy.com/media/xUPOqo6E1XvWXwlCyQ/giphy.gif">