# A sample code for Breast Cancer Diagnosis.

## First, I import all the libraries I need; 
1. __Pandas__ for data manipulation, \n
2. __train_test_split__ for dividing the data into two: training and testing data. \n
3. __metrics__, to determine how well my model performed compared to actual data and \n
4. __RandomForestClassifier__, as it is my model of choice for this application.\n

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier


## After that, I start manipulating the dataset. Please do not mind the directory name.
1. I determine the contents of the .csv file through _pd.read_csv_
2. I then replace the M and B with 1 and 0 respectively so the model can respond/understand the inputs through _.replace_
3. I shuffle the dataset through _.sample_
4. I drop the "id" column and the "Unnamed" column, as they won't contribute to the model's learning.
5. I then run a check if there are NaN data in the csv file, thankfully there is none.

In [3]:
breast_set = pd.read_csv("C:\\Users\\Dingus-Elite\\Downloads\\data.csv")
breast_set
breast_set["diagnosis"] = breast_set["diagnosis"].replace(["M","B"], [1,0])

breast_set = breast_set.sample(frac=1)
breast_set = breast_set.drop(["id","Unnamed: 32"], axis=1)
breast_set_null = breast_set.isnull().sum()
breast_set.describe()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,0.372583,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,0.483918,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,0.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,0.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,0.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,1.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,1.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


## I then start to define which are the x and y values and split the data.
1. I set y to be the first column; as it highlights the outputs.
2. I set X to be the remaining columns as they will be the basis for data to be learned.
3. I then split the data with train_test_split

In [4]:
X = breast_set.iloc[:,1:]
y = breast_set.iloc[:,0]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.20, random_state=5)

## Model Construction
1. I declare Random Forest Classifier as my model, the selected depths and random state are manually tested for best outputs
2. I fit the model with the training data
3. I then evaluate the model through *.predict*
4. I then print the outputs to see how my model performed.

In [31]:
model = RandomForestClassifier(max_depth=7, random_state=2)
model.fit(train_X,train_y)
predictions = model.predict(test_X)
classification_report(test_y, predictions)
confusion_matrix(test_y,predictions)

array([[71,  0],
       [ 2, 41]], dtype=int64)

In [32]:
accuracy_score(test_y,predictions)

0.9824561403508771

## Model testing with real data
1. I call the test data, please do not mind the directory.
2. I used my model to predict the values found in test data
3. I created a list comprehension which creates a new array based on the outputs of test data.
4. I then write the results to thhe output .csv file.

In [14]:
test_data = pd.read_csv("C:\\Users\\Dingus-Elite\\Downloads\\test.csv")
prediction_test = model.predict(test_data)
pred_conv = ['M' if i > 0.5 else 'B' for i in prediction_test]
prediction_test_out = pd.DataFrame(pred_conv, columns=['diagnosis']).to_csv("C:\\Users\\Dingus-Elite\\Downloads\\output.csv")