## Identify the problem

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. 
[database](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) 

Attribute Information:

1) ID number 

2) Diagnosis (M = malignant, B = benign) 

3-32)

Ten real-valued features are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter) 
- texture (standard deviation of gray-scale values) 
- perimeter 
- area 
- smoothness (local variation in radius lengths) 
- compactness (perimeter^2 / area - 1.0) 
- concavity (severity of concave portions of the contour) 
- concave points (number of concave portions of the contour) 
- symmetry 
- fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius. All feature values are recoded with four significant digits. Missing attribute values: none. Class distribution: 357 benign, 212 malignant.

In [None]:
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))
import numpy as np # linear algebra
print("NumPy version: {}". format(np.__version__))
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
print("pandas version: {}". format(pd.__version__))
import matplotlib # collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings # ignore warnings
warnings.filterwarnings('ignore')
# machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Exploratory Data Analysis

In [None]:
df = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')
df.shape

In [None]:
df.head(3)

In [None]:
df.drop('Unnamed: 32', axis =1, inplace = True)

In [None]:
df['diagnosis'].value_counts()

In [None]:
sns.countplot(df.diagnosis)
plt.title("diagnosis benign or malign?",color = 'black',fontsize=15)

In [None]:
benign = len(df[df['diagnosis'] == 'B'])
malign = len(df[df['diagnosis'] == 'M'])
import matplotlib.pyplot as plt
y = ('B', 'M')
y_pos = np.arange(len(y))
x = (benign, malign)
labels = 'B', 'M'
sizes = [benign, malign]
fig1, ax1 = plt.subplots()
ax1.pie(sizes,  labels=labels, autopct='%1.1f%%', startangle=90) 
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Percentage', size=16)
plt.show() # Pie chart, where the slices will be ordered and plotted counter-clockwise:

Conclusion: Benign is more common than Malign

In [None]:
sns.scatterplot(x= 'area_mean', y= 'perimeter_mean', hue= 'diagnosis', data=df) # smoothness_mean 

Conclusion: The bigger the area_mean and perimeter_mean the more malign.

## Feature Engineering

In [None]:
#df['cancer'] = 0 # Benign
#df['cancer'].loc[df['diagnosis'] == 'M'] = 1 # Malign
#df.drop('diagnosis', axis =1, inplace = True)
df['cancer'] = df['diagnosis'].map({'B':0,'M':1})
df.drop('diagnosis', axis =1, inplace = True)

## Feature Importances

In [None]:
corrmat = df.corr()
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(15, 15))
ax.set_title("Correlation Matrix", fontsize=12)
filter = df.columns != 'id'
sns.heatmap(df[df.columns[filter]].corr(), vmin=-1, vmax=1, cmap='coolwarm', annot=True)

Conclusion: The feature with the most correlation is perimeter mean 0.99

## Distribution

In [None]:
ax = sns.distplot(df['perimeter_mean']) # histogram distribution

In [None]:
data_mean = df[['radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
sns.heatmap(data_mean.corr(), vmin=-1, vmax=1, cmap='coolwarm', annot=True)

In [None]:
X = df[['id','radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst']] # 'id',
y = df['cancer'] # diagnosis

## Model Selection

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=100)

In [None]:
ids = X_test.id

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train.drop('id',axis=1), y_train)
score = model.score(X_train.drop('id',axis=1), y_train)
print('Logistic Regression score (train):', score)
score = model.score(X_test.drop('id',axis=1), y_test)
print('Logistic Regression score (test):', score)

Y_pred = model.predict(X_test.drop('id',axis=1))
output = pd.DataFrame({'id':ids,'cancer':Y_pred})
print(output.head())
people = output.loc[output.cancer == 1]["cancer"]
rate_people = 0
if len(people) > 0 :
    rate_people = len(people)/len(output)
print("Logistic Regression % of people with cancer:", rate_people)
from sklearn.metrics import classification_report
print(classification_report(y_test,Y_pred))

In [None]:
# Confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test,Y_pred)
class_names = [0,1]
fig,ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks,class_names)
plt.yticks(tick_marks,class_names)
sns.heatmap(pd.DataFrame(confusion_matrix), annot = True, cmap = 'Greens', fmt = 'g')
ax.xaxis.set_label_position('top')
plt.tight_layout()
plt.title('Confusion matrix for logistic regression')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
# ROC Curve
from sklearn.metrics import roc_auc_score,roc_curve
y_probabilities = model.predict_proba(X_test.drop('id',axis=1))[:,1]
false_positive_rate_knn, true_positive_rate_knn, threshold_knn = roc_curve(y_test,y_probabilities)
plt.figure(figsize=(10,6))
plt.title('ROC for logistic regression')
plt.plot(false_positive_rate_knn, true_positive_rate_knn, linewidth=5, color='green')
plt.plot([0,1],ls='--',linewidth=5)
plt.plot([0,0],[1,0],c='.5')
plt.plot([1,1],c='.5')
plt.text(0.2,0.6,'AUC: {:.2f}'.format(roc_auc_score(y_test,y_probabilities)),size= 16)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

In [None]:
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")

In [None]:
output.shape

In [None]:
output.head()

In [None]:
output.tail()

## Conclusion

Score 95 % confidence and accuracy.