#**Using Predictive Analysis To Predict Diagnosis of a Breast Tumor**
-By Mohd.Shoaib Nadaf

## 1. Identify the problem
Breast cancer is the most common malignancy among women, accounting for nearly 1 in 3 cancers diagnosed among women in the United States, and it is the second leading cause of cancer death among women. Breast Cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer - tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer performed.

## 1.1 Expected outcome
Given breast cancer results from breast fine needle aspiration (FNA) test (is a quick and simple procedure to perform, which removes some fluid or cells from a breast lesion or cyst (a lump, sore or swelling) with a fine needle similar to a blood sample needle). Since this build a model that can classify a breast cancer tumor using two training classification:

1= Malignant (Cancerous) - Present
0= Benign (Not Cancerous) -Absent

## 1.2 Objective
Since the labels in the data are discrete, the predication falls into two categories, (i.e. Malignant or benign). In machine learning this is a classification problem.

Thus, the goal is to classify whether the breast cancer is benign or malignant and predict the recurrence and non-recurrence of malignant cases after a certain period. To achieve this we have used machine learning classification methods to fit a function that can predict the discrete class of new input.

## 1.3 Identify data sources
The Breast Cancer datasets is available machine learning repository maintained by the University of California, Irvine. The dataset contains 569 samples of malignant and benign tumor cells.

The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively.
The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.





**Getting Started: Load libraries and set options**


In [None]:
#Breast cancer Detection
#importing library
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


**Load Dataset**


In [None]:
#Load the data
#from google.colab import files 
#uploaded = files.upload()
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df.head(7);

**Inspecting the data**

In [None]:
print(df.head())


**Data Cleaning:**

In [None]:
#Count the number of rows and columns 
print("(rows,cols)",df.shape,"rows means no of patients" )


In [None]:
#Count the number of empty values in each column (NAN,NaN,na)
df.isna().sum()

We Found here , Unnamed col has 569 Na values so we will drop the column 


In [None]:
#drop col
df = df.dropna(axis=1)

In [None]:
#count the No of rows and cols
df.shape

In [None]:
#count the number of Malognant - M and Benign - B 
df['diagnosis'].value_counts()

In [None]:
#visualize the count 
sns.countplot(df['diagnosis'],label = 'count')

In [None]:
#data type of df
df.dtypes

From the results above, diagnosis is a categorical variable, because it represents a fix number of possible values (i.e, Malignant, of Benign. The machine learning algorithms wants numbers, and not strings, as their inputs so we need some method of coding to convert them.

In [None]:
#Encoding the catagorial data values 
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
labelencoder_Y.fit_transform(df.iloc[:,1].values)
df.iloc[:,1]

## **Exploratory Data Analysis**

## Objectives of Data Exploration
**Exploratory data analysis (EDA)** is a very important step which takes place after feature engineering and acquiring data and it should be done before any modeling. This is because it is very important for a data scientist to be able to understand the nature of the data without making assumptions. The results of data exploration can be extremely useful in grasping the structure of the data, the distribution of the values, and the presence of extreme values and interrelationships within the data set.

### The purpose of EDA is:

to use summary statistics and visualizations to better understand data, *find clues about the tendencies of the data, its quality and to formulate assumptions and the hypothesis of our analysis
For data preprocessing to be successful, it is essential to have an overall picture of your data Basic statistical descriptions can be used to identify properties of the data and highlight which data values should be treated as noise or outliers.**

Next step is to explore the data. There are two approached used to examine the data using:

**Descriptive statistics** is the process of condensing key characteristics of the data set into simple numeric metrics. Some of the common metrics used are mean, standard deviation, and correlation.

**Visualization** is the process of projecting the data, or parts of it, into Cartesian space or into abstract images. In the data mining process, data exploration is leveraged in many different steps including preprocessing, modeling, and interpretation of results.

In [None]:
#basic descriptive statistics
df.describe()
df.skew()

The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew. From the graphs, we can see that radius_mean, perimeter_mean, area_mean, concavity_mean and concave_points_mean are useful in predicting cancer type due to the distinct grouping between malignant and benign cancer types in these features. We can also see that area_worst and perimeter_worst are also quite useful.

## **Unimodal Data Visualizations**
One of the main goals of visualizing the data here is to observe which features are most helpful in predicting malignant or benign cancer. The other is to see general trends that may aid us in model selection and hyper parameter selection.

Apply 3 techniques that you can use to understand each attribute of your dataset independently.

- Histograms.
- Density Plots.
- Box and Whisker Plots.

In [None]:
sns.set_style("white")
sns.set_context({"figure.figsize": (10, 8)})

## Visualise distribution of data via histograms
Histograms are commonly used to visualize numerical variables. A histogram is similar to a bar graph after the values of the variable are grouped (binned) into a finite number of intervals (bins).

Histograms group data into bins and provide you a count of the number of observations in each bin. From the shape of the bins you can quickly get a feeling for whether an attribute is Gaussian, skewed or even has an exponential distribution. It can also help you see possible outliers.

**Separate columns into smaller dataframes to perform visualization**

In [None]:
data_id_diag=df.loc[:,["id","diagnosis"]]
data_diag=df.loc[:,["diagnosis"]]

#For a merge + slice:
data_mean=df.iloc[:,1:11]
data_se=df.iloc[:,11:22]
data_worst=df.iloc[:,23:]

## Histogram the "_mean" suffix designition

In [None]:
hist_mean=data_mean.hist(bins=10, figsize=(15, 10),grid=False,)

## Histogram the "_se" suffix designition

In [None]:
hist_se=data_se.hist(bins=10, figsize=(15, 10),grid=False,)

## Histogram the "_worst" suffix designition

In [None]:
hist_worst=data_worst.hist(bins=10, figsize=(15, 10),grid=False,)

**Observation**

We can see that perhaps the attributes concavity,and concavity_point may have an exponential distribution ( ). We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

## Visualize distribution of data via density plots
**Density plots "_mean" suffix designition**

## Density plots "_mean" suffix designition

In [None]:
#Density Plots
plt = data_mean.plot(kind= 'density', subplots=True, layout=(4,3), sharex=False, 
                     sharey=False,fontsize=12, figsize=(15,10))

##Density plots "_se" suffix designition

In [None]:
#Density Plots
plt = data_se.plot(kind= 'density', subplots=True, layout=(4,3), sharex=False, 
                     sharey=False,fontsize=12, figsize=(15,10))

##Density plots "_worst" suffix designition

In [None]:
#Density Plots
plt = data_worst.plot(kind= 'density', subplots=True, layout=(4,3), sharex=False, 
                    sharey=False,fontsize=12, figsize=(15,10))



**Observation**

We can see that perhaps the attributes perimeter,radius, area, concavity,ompactness may have an exponential distribution ( ). We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

## **Visualise distribution of data via box plots**
## Box plot "_mean" suffix designition

In [None]:
# box and whisker plots
plt=data_mean.plot(kind= 'box' , subplots=True, layout=(4,4), sharex=False, sharey=False,fontsize=12)

## Box plot "_se" suffix designition

In [None]:
# box and whisker plots
plt=data_se.plot(kind= 'box' , subplots=True, layout=(4,4), sharex=False, sharey=False,fontsize=12)

## Box plot "_worst" suffix designition

In [None]:
# box and whisker plots
plt=data_worst.plot(kind= 'box' , subplots=True, layout=(4,4), sharex=False, sharey=False,fontsize=12)

**Observation**

We can see that perhaps the attributes perimeter,radius, area, concavity,ompactness may have an exponential distribution ( ). We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

## **Multimodal Data Visualizations**

**Scatter Plot**

In [None]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
sns.set_style("white")

# Compute the correlation matrix
corr = data_mean.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
df, ax = plt.subplots(figsize=(8, 8))
plt.title('Breast Cancer Feature Correlation')

# Generate a custom diverging colormap
cmap = sns.diverging_palette(260, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, vmax=1.2, square='square', cmap=cmap, mask=mask, 
            ax=ax,annot=True, fmt='.2g',linewidths=2)

**Observation:**

We can see strong positive relationship exists with mean values paramaters between 1-0.75;.

- The mean area of the tissue nucleus has a strong positive correlation with mean values of radius and parameter;

- Some paramters are moderately positive corrlated (r between 0.5-0.75)are concavity and area, concavity and perimeter etc

- Likewise, we see some strong negative correlation between fractal_dimension with radius, texture, parameter mean values.

**Corelation Matrix**

In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df = df.dropna(axis=1)
#Encoding the catagorial data values 
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
labelencoder_Y.fit_transform(df.iloc[:,1].values)
df.iloc[:,1]

#create pair plot 
#index 1 to 6
import seaborn as sns
sns.pairplot(df.iloc[:,1:8],hue='diagnosis')

**Summary**

- Mean values of cell radius, perimeter, area, compactness, concavity and concave points can be used in classification of the cancer. Larger values of these parameters tends to show a correlation with malignant tumors.

- mean values of texture, smoothness, symmetry or fractual dimension does not show a particular preference of one diagnosis over the other.

- In any of the histograms there are no noticeable large outliers that warrants further cleanup.

In [None]:
#coorelation in cols
df.iloc[:,2:].corr()

In [None]:
#split the dataset into independent x and dependent y 

X = df.iloc[:,2:31].values
Y = df.iloc[:,1].values

In [None]:
#split dataset 75% into trainning and 25% into testing
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test= train_test_split(X,Y,test_size= 0.25 , random_state =0) 

In [None]:
# feature scaling / data normaization 

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

X_train

In [None]:
# create function
def models(X_train , Y_train):
  #logistic regression 

  from sklearn.linear_model import LogisticRegression
  log = LogisticRegression(random_state = 0)
  log.fit(X_train,Y_train)

  #Decison Treee

  from sklearn.tree import DecisionTreeClassifier
  tree = DecisionTreeClassifier(criterion = 'entropy' ,random_state = 0 )
  tree.fit(X_train,Y_train)

  #Random Forest
  from sklearn.ensemble import RandomForestClassifier
  forest =  RandomForestClassifier(n_estimators = 10 , criterion = 'entropy', random_state = 0)
  forest.fit(X_train,Y_train)
  #Print the models 
  print('[0] logistic Regression ', log.score(X_train,Y_train))
  print('[1] Decision Tree ', tree.score(X_train,Y_train))
  print('[2] Random Forest ', forest.score(X_train,Y_train))
  return log,tree,forest

In [None]:
#getting all models 
model = models(X_train,Y_train)

In [None]:
#test model accuracy on test data on confusion matrix 

from sklearn.metrics import confusion_matrix
for i in range(len(model)):
  print('model:',i)
  cm = confusion_matrix(Y_test,model[0].predict(X_test))

  TP = cm[0][0]
  TN = cm[1][1]
  FN = cm[1][0]
  FP = cm[0][1]
  print(cm)
  print('Testing accurancy : ', (TP+TN)/(TP+TN+FN+FP))
  print()

[[True positive False postive]

 [False negative True negative]] 

In [None]:
#show another way to get matrics of the model 

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

for i in range(len(model)):
  print('model :',i)
  print(classification_report(Y_test,model[i].predict(X_test)))
  print(accuracy_score(Y_test,model[i].predict(X_test)))



In [None]:
#print the prediction of Random forest classifier model 
pred = model[2].predict(X_test)
print(pred)
print()
print(Y_test)

## Saving the Output into CSV file 
- column1 = Actual
- column2 = predicted 

In [None]:
op = pd.DataFrame(Y_test,pred)

#save the Op dataframe into csv
op_file = op.to_csv('pred_op.csv')

op