# Breast Cancer Dataset Exploration

## Introduction

Breast cancer (BC) is one of the most common cancers among women worldwide, representing the majority of new cancer cases and cancer-related deaths according to global statistics, making it a significant public health problem in today’s society.
In this model we are going to predict the type of Breast Cancer.

In [3]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(filename='Image/Breast Cancer.jpeg') 

<IPython.core.display.Image object>

## Steps Involved:
 - Importing the Library
 - Loading the Dataset
 - Structure of the Dataset
 - Univariate Exploration
 - Bivariate Exploration
 - Multivariate Exploration
 - Using Inbuilt KNN Classifier
 - Implementation of KNN Classifier
 - Conclusion

## Importing the Library

In [1]:
#Import all the required Packages like numpy ,pandas etc.
#Import the library for the plots also

## Loading the Dataset

In [8]:
'''Load the dataset of Breast Cancer''';
#Use the inbuilt dataset feature of sklearn for importing the Data;
#Put that object in some variable like cancer_data or anything else;

Let's take over tha data

In [9]:
#Print the data in cancer_data with inbuilt .data in sklearn;

There are target values present

In [10]:
#Print the Target value in cancer_data with .target in sklearn;

**There are two values present in the target value as tha dataset based on the Classification Model**

1 means Malignant

0 means Benign

In [12]:
#Convert the cancer_data into dataframe for easy of analysis as pandas provide several inbuilt functions 
#Use numpy for this, there is feature_names present in cancer_data which is having the list of all the features present
#cancer_data is having the data and target that too be inserted into numpy and use pd.Dataframe to make the cancer_data 
#into Dataframe and save this object into some other variable like df or anything else.;

## What is the structure of the Dataset?

In [13]:
#Print the shape of the Dataframe using .shape function

Here you'll find that Dataframe Contains 569 records and 31 features. You can have glimpse of the Dataset below.

In [14]:
#print the first five records of the dataset. Use .head function

### Missing or Null Data points

In [15]:
#Check for is there any null value present. Use isnull function for this.

In [16]:
#Check for is there any nan value present. Use isnan function for this.

## Univariate Exploration

Before you further proceed into the Univariate analysis , Let take look at the statistical summary of the data and also our data depth

In [17]:
#For statistical summary of the Dataframe use .describe function provided by the pandas library

#### Get some insight from the above output

In [18]:
#Check for how many value present without the null value in the Dataframe. Use isnull and .count function.

From above you'll find which feature is having how much not-null value. Try to get some more information regarding,
the features value present in dataframe

Let's take a look at it graphically , for a better understanding.

In [19]:
#Plot the Bar graph of the above code with x-axis as features name and y-axis as number of not null value present
'''Use .count, isnull(),.sum(),.plot(),.show() function.
   For giving the name of the axes you can use .set function
   You are free to choose any other method also.''';

Here you'll find whether you need to check for rows where null value present or not.
If yes, go for either removing i.e dropping that row or, using some statistics methods to put that value.
If no, you can move ahead.

###  radius

mean of distances from center to points on the perimeter

In [20]:
#plot the histogram for the mean radius feature present in the dataframe to check for the feature value distribution
'''Use .plot,.hist.,.show function for this. You are allowed to take help of other graphs too.''';

In [21]:
#Use describe for the feature that we are dealing now, i.e mean radius
'''use .to_frame() function with .describe function''';

In [22]:
#Check for the range of number where feature value is more concentrated
'''Use .plot,.box,.show function for this.
   Try to put some colors for better visualization by using color parameter in .box function''';

From above you'll see that more data is concentrated in the range of _ to _ with maximum value as _ and mean as _;
Here "_" means that you are going to find after evaluating the graph

In [23]:
#Use that range and with that plot the value using histogram to check is there any skewness present for not.
'''Use .query function for this''';

From above you'll see about the feature that whether it is normal,or any skewness is present.

### Note: Similarly perform this analysis for all the features for better understanding of the data

**Is data filteration or preprocessing required?**

After getting knowledge of all the values and features insight, check if there any filteration required.
If yes, use the suitable method for that.

**Of the features you investigated, were there any unusual distributions?**

By seeing the graphical analysis, check if there any skewness present for not.
If yes, go for proper measure for that.

## Bivariate Exploration

Let's take a look at this very interesting section. Here you can actually check the correlation.

In [24]:
#For checking the correlation, you'll use heatmap for that.
#use Seaborn library for that.
'''Use .set,.figure,.heatmap,.show function for this.
   You can play with the heat map by using several parameters present in that, like linecolor etc.''';

A heat map uses a warm-to-cool color spectrum to show dataset analytics, namely which parts of data receive the most attention.

In [26]:
'''lets divied the features according to their category''' ;
#Take out the features in set of 10, like put the first ten features in one list, next ten in other and rest in other.
#Print the features to check whether your code is working fine or not.

In [27]:
'''You can plot the heat map for different sets of the list and check for the correlation in more deeply. '''
'''You can skip this part, if you got the idea from the above heat map''';

## Note:
    Try to plot different feature1 vs feature2 graph to understand various relations more precisely
    You can use matplotlib or seaborn library for this.

**Is there any relations among the features?**

After taking the correlation and plotting several features graph with each other. You'll came to know some relations from that.
Check if that realtion is helpful or normal. It help you to reduce the features if any features is of no use present, or you 
can go with the all the features.

## Multivariate Exploration

In this let's discuss the about the dataset and how to proceed with the implementation of the algorithm

In [28]:
#Print the list of the all the features present in the dataset
'''Use .columns function with the dataframe''';

###  Split of the dataframe

In [29]:
#Before moving to algorithm section, first split the data into training and testing part.
'''Use train_test_split function for that''';

### Inbuilt KNN algorithm

In [30]:
#Use sklearn.neighbors to import the KNN classifier
'''Call the classifier and save the object in some variable.
   Here you can pass the k value as parameter of the classifier
   in order to relate the accuracy of inbuilt with our implemented''';

In [31]:
#Use .fit function of the classifier on the train set, like x_train and y_train

In [None]:
#Use the .predict function of the classifier on the testing set,like x_test and save the output in some variable.

### Implementation of KNN algorithm

**Training**

For K-Nearest Neighbour we know that it is used for the classification and in training phase you'll only plot the points and do not do any other task

In [31]:
def train(x_train,y_train):
    ##just return the values
    pass

**Prediction**

Here you are going to form the group on the basis of the k-values in the algorithm

In [32]:
'''Predicting the X_test data one by one '''
def predict_one(x_train,y_train,x_test,k):
    #distances list will save the distance of the testing data point
    #from all training data point and sort the distance 
    #and return the most common points 
    pass

In [33]:
'''Prediction function which take x_train,y_train and predict the classes for x_test_data,
   here you'll passing the k which is telling how many nearest neighbors we want to consider'''
def predict(x_train,y_train,x_test_data,k):
    #calling the predict_one function 
    pass

After prediction check for the accuracy of the model 

**Then compare the accuracy of the inbuilt algorithm with the algorithm implemented** 

For this you can use the classification report for both inbuilt and implemented.
Since KNN depends upon the value of K, so it is very obvious that with the change in the value of the K, the score gets change