# Problem
The data set contains diagnostic information of Stage of Breast Cancer (Malignant or Benign) of 569 patients.  
It has the following information: 
1. Patient ID
2. Stage of Breast Cancer (M = Malignant, B = Benign)
3. Columns 3-32 contain 10 real valued features listed below with their respective Mean (Cols 3-13), Standard Error(Cols 13-23) and Worst (Cols 23-32)

    a) radius (mean of distances from center to points on the perimeter)
    b) texture (standard deviation of gray-scale values)
    c) perimeter
    d) area
    e) smoothness (local variation in radius lengths)
    f) compactness (perimeter^2 / area - 1.0)
    g). concavity (severity of concave portions of the contour)
    h). concave points (number of concave portions of the contour)
    i). symmetry
    j). fractal dimension ("coastline approximation" - 1)

I want to develop a basic Machine Learning model to predict the Stage of Breast Cancer based on these above mentioned 10 variables, visualize the known data along with predicted to check for accuracy of the model and test data predcition.  

Steps I followed:
1. Read the data, clean, sort and inspect.
2. Analyzing correlation between variables.
3. Apply ML methods
    3.1 Split the test & train data
    3.2 Trained a logistic regression model to build a classification model and fit  data
    3.3 Visualize to compare the accuracy of the used method
4. Test Dataset Prediction
5. There are in all 30 features in the data Mean values of the above listed 10 parameters, their Std. errors, and worst values. How can one leverage all this information to develop a better model. Do we need a dimension reduction or the insights derived from correlation among features is good enough.  

In [None]:
#Import required python libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


In [None]:
data = pd.read_csv("../input/data.csv", index_col=0)

In [None]:
data.head()

In [None]:
data.info()

Let's remove the empty column "Unamed: 32'

In [None]:
data = data.drop(['Unnamed: 32'], axis =1)

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.isna().any().head() #Check for Missing data 

In [None]:
sns.countplot(x = 'diagnosis', data = data).set_title('Histogram plot for both type of diagnosis')

In [None]:
data.describe()

In [None]:
sns.pairplot(data, hue = 'diagnosis',palette='coolwarm')

In [None]:
dataM=data[data['diagnosis'] == "M"]
dataB=data[data['diagnosis'] == "B"]

Visualization distribution of each of the diagnostic image parameter mean values for ``M`` and ``B`` cases.  

In [None]:
sns.kdeplot(dataM.texture_mean, shade=True, label= "M");
sns.kdeplot(dataB.texture_mean, shade=True, label= "B");

In [None]:
sns.kdeplot(dataM.radius_mean, shade=True, label= "M");
sns.kdeplot(dataB.radius_mean, shade=True, label= "B");

In [None]:
sns.kdeplot(dataM.area_mean, shade=True, label= "M");
sns.kdeplot(dataB.area_mean, shade=True, label= "B");

In [None]:
sns.kdeplot(dataM.perimeter_mean, shade=True, label= "M");
sns.kdeplot(dataB.perimeter_mean, shade=True, label= "B");

In [None]:
sns.kdeplot(dataM.smoothness_mean, shade=True, label= "M");
sns.kdeplot(dataB.smoothness_mean, shade=True, label= "B");

In [None]:
sns.kdeplot(dataM.compactness_mean, shade=True, label= "M");
sns.kdeplot(dataB.compactness_mean, shade=True, label= "B");

In [None]:
sns.kdeplot(dataM.concavity_mean, shade=True, label= "M");
sns.kdeplot(dataB.concavity_mean, shade=True, label= "B");

In [None]:
sns.kdeplot(dataM['concave points_mean'], shade=True, label= "M");
sns.kdeplot(dataB['concave points_mean'], shade=True, label= "B");

In [None]:
sns.kdeplot(dataM['symmetry_mean'], shade=True, label= "M");
sns.kdeplot(dataB['symmetry_mean'], shade=True, label= "B");

In [None]:
sns.kdeplot(dataM['fractal_dimension_mean'], shade=True, label= "M");
sns.kdeplot(dataB['fractal_dimension_mean'], shade=True, label= "B");

From the correlation of all the mean parameters and also from the visualiation of distribution of mean values of paramters (for both ``M`` and ``B`` cases), we can make the following observations:

1. Strong correlation among radius_mean, perimeter_mean and area_mean, and also similar distribution of these three parameters clearly indicate that they have similar impact on the diagnosis. 
2. Also, same holds for concavity mean and concave points_mean. 
3. The distribution of fractal_dimension_mean almost overlaps for ``M`` and ``B`` cases, which indicates that diagnosis is hardly dependant on this particular parameter.

# Analysis and modelling (Trial 1)

For the first trial, I am using all the mean values as parameter variables for predictional modelleing. 
LogisticRegression has been used to fit and predict the data.

There are 569 enteries in this data, I am using first 400 for predicition and the remaining 169 for testing the prediction. 
In a similar way, as we did for the cryptocurrency example in the last class. 

This problem has been done by hundreds of kaggle users, but my approach is very basic here. I am using all the 10 mean parameter values even though some of them are highly correlated among themselves (for eg. radius, perimeter and area .. so are concavity and concave points )

In [None]:
train_data = data[0:400]
train_data.shape

In [None]:
test_data = data[400:]
test_data.shape

In [None]:
from scipy import stats
from sklearn import linear_model
logreg = linear_model.LogisticRegression(solver='liblinear')

In [None]:
data.columns

In [None]:
logreg.fit(train_data[['radius_mean','texture_mean', 'perimeter_mean','area_mean', 'smoothness_mean', 'compactness_mean', 
                         'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']],
         train_data['diagnosis']);
slopes_list = logreg.coef_
u = logreg.intercept_

In [None]:
print(slopes_list,u)

In [None]:
predicted_diag = logreg.predict(test_data[['radius_mean','texture_mean', 'perimeter_mean', 'area_mean','smoothness_mean', 'compactness_mean', 
                         'concavity_mean', 'concave points_mean','symmetry_mean', 'fractal_dimension_mean']]);

In [None]:
data_predicted = test_data.copy()
data_predicted ["Predicted_diagnosis"] = predicted_diag.tolist()
data_predicted[['diagnosis','Predicted_diagnosis']].head()

# Result (Trial 1) 
Accuracy of prediction for Trial 1 is ~89.9%

In [None]:
fig,ax =plt.subplots(1,2)
sns.countplot(data_predicted['Predicted_diagnosis'], ax=ax[0]).set_title('Predictive modelling using all 10 mean parameters')
sns.countplot(data_predicted['diagnosis'], ax=ax[1])

In [None]:
test_prediction_accuracy = (data_predicted["Predicted_diagnosis"] == data_predicted['diagnosis']).sum()*100/169
test_prediction_accuracy

# Analysis and modelling (Trial )

For the second trial, I am using all the mean values of texture, perimeter, smoothness, compactness, concavity, and symmtery. Again, LogisticRegression has been used to fit and predict the data.

Similarly the test and train data has 400 and 169 enteries as Trial 1. 

In [None]:
data1 = data.copy()
logreg.fit(train_data[['texture_mean', 'perimeter_mean', 'smoothness_mean', 'compactness_mean', 
                         'concavity_mean', 'symmetry_mean']],
         train_data['diagnosis']);
slopes_list1 = logreg.coef_
u1 = logreg.intercept_
print(slopes_list1,u1)


In [None]:
prediction2= logreg.predict(test_data[['texture_mean', 'perimeter_mean', 'smoothness_mean', 'compactness_mean', 
                                       'concavity_mean', 'symmetry_mean']]);
data_predicted ["Predicted_diagnosis2"] = prediction2.tolist()
data_predicted[['diagnosis','Predicted_diagnosis','Predicted_diagnosis2']].head()

# Result (Trial 2) 
Accuracy of prediction for Trial 1 is ~91.1%

In [None]:
test_prediction_accuracy2 = (data_predicted["Predicted_diagnosis2"] == data_predicted['diagnosis']).sum()*100/169
test_prediction_accuracy2

In [None]:
fig,ax =plt.subplots(1,2)
sns.countplot(data_predicted['diagnosis'], ax=ax[0])
sns.countplot(data_predicted['Predicted_diagnosis2'], ax=ax[1])
