# Breast Cancer Wisconsin (Original) Data Set

Attribute Information:

1. Sample code number: id number 
2. Clump Thickness: 1 - 10 
3. Uniformity of Cell Size: 1 - 10 
4. Uniformity of Cell Shape: 1 - 10 
5. Marginal Adhesion: 1 - 10 
6. Single Epithelial Cell Size: 1 - 10 
7. Bare Nuclei: 1 - 10 
8. Bland Chromatin: 1 - 10 
9. Normal Nucleoli: 1 - 10 
10. Mitoses: 1 - 10 
11. Class: (2 for benign, 4 for malignant)

we can refer the below url for more details about the project 

https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

In [9]:
# lets start with data import and some exploratory data analysis. 

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 

%matplotlib inline 



In [7]:
# create the data using the pandas librarry 


columns = ['Sample code number', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']

data = pd.read_csv('breast-cancer-wisconsin.data',names=columns)



In [8]:
# will check the head of the data 
data.head()


Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


Lets Do some Data Exploratory Analysis

In [30]:
X = data.loc[:,:'Mitoses']
y = data['Class']

data


Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
5,1017122,8,10,10,8,7,10,9,7,1,4
6,1018099,1,1,1,1,2,10,3,1,1,2
7,1018561,2,1,2,1,2,1,3,1,1,2
8,1033078,2,1,1,1,2,1,1,1,5,2
9,1033078,4,2,1,1,2,1,2,1,1,2


In [65]:
data = data.replace(to_replace='?' , value=0)


In [90]:
# CEHCK the info , if nay columns is object 
data = pd.get_dummies(data,drop_first=True)

y= data['Class']
X = data.drop(['Class'],axis = 1)



Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Normal Nucleoli,Mitoses,Bare Nuclei_1,Bare Nuclei_10,Bare Nuclei_2,Bare Nuclei_3,Bare Nuclei_4,Bare Nuclei_5,Bare Nuclei_6,Bare Nuclei_7,Bare Nuclei_8,Bare Nuclei_9
0,1000025,5,1,1,1,2,3,1,1,1,0,0,0,0,0,0,0,0,0
1,1002945,5,4,4,5,7,3,2,1,0,1,0,0,0,0,0,0,0,0
2,1015425,3,1,1,1,2,3,1,1,0,0,1,0,0,0,0,0,0,0
3,1016277,6,8,8,1,3,3,7,1,0,0,0,0,1,0,0,0,0,0
4,1017023,4,1,1,3,2,3,1,1,1,0,0,0,0,0,0,0,0,0


In [82]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 20 columns):
Sample code number             699 non-null int64
Clump Thickness                699 non-null int64
Uniformity of Cell Size        699 non-null int64
Uniformity of Cell Shape       699 non-null int64
Marginal Adhesion              699 non-null int64
Single Epithelial Cell Size    699 non-null int64
Bland Chromatin                699 non-null int64
Normal Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
Bare Nuclei_1                  699 non-null int64
Bare Nuclei_10                 699 non-null int64
Bare Nuclei_2                  699 non-null int64
Bare Nuclei_3                  699 non-null int64
Bare Nuclei_4                  699 non-null int64
Bare Nuclei_5                  699 non-null int64
Bare Nuclei_6                  699 non-null int64
Bare Nuclei_7                  699 non-null i

In [83]:
data.columns

Index(['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size',
       'Uniformity of Cell Shape', 'Marginal Adhesion',
       'Single Epithelial Cell Size', 'Bland Chromatin', 'Normal Nucleoli',
       'Mitoses', 'Class', 'Bare Nuclei_1', 'Bare Nuclei_10', 'Bare Nuclei_2',
       'Bare Nuclei_3', 'Bare Nuclei_4', 'Bare Nuclei_5', 'Bare Nuclei_6',
       'Bare Nuclei_7', 'Bare Nuclei_8', 'Bare Nuclei_9'],
      dtype='object')

In [84]:
# We will create the model and check the different score, 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import r2_score, accuracy_score , confusion_matrix
from sklearn.model_selection import train_test_split


In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [86]:
model = LogisticRegression()

In [87]:
model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [88]:
y_pred = model.predict(X_test)

In [89]:
# prind different scores and check the data 

print('R2 score : ' , r2_score(y_test, y_pred))
print('Accuracy score ', accuracy_score(y_test,y_pred))
print('Çonfusion matrix', confusion_matrix(y_test,y_pred))


R2 score :  -0.4685314685314683
Accuracy score  0.680952380952381
Çonfusion matrix [[143   0]
 [ 67   0]]
