<a href="https://colab.research.google.com/github/val93s/Machine_learning/blob/main/Copy_of_11_4_1_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#11.4.1 Activity

#**Breast Cancer Detection**

According to the [American Cancer Society (2022)](https://www.cancer.org/cancer/breast-cancer/about/how-common-is-breast-cancer.html),
 breast cancer is the most common cancer in American women, except for skin cancers. The average risk of a woman in the United States developing breast cancer sometime in her life is about 13%. This means there is a 1 in 8 chance she will develop breast cancer.

Mammograms are used to detect breast cancer—hopefully at an early stage.  However, many masses that appear on a mammogram are not actually cancer.  Developing a machine learning model to predict whether a tumor is benign or cancerous would be helpful for physicians as they guide and treat patients. 

In this activity, we will explore standardizing and normalizing the values in the dataset.



##Step 1: Download and save the `cancer.csv` dataset from the class materials  

Make a note of where you saved the file on your computer.

##Step 2: Upload the `cancer.csv` dataset by running the following code block 

When prompted, navigate to and select the `cancer.csv` dataset where you saved it on your computer.

In [None]:
#Step 2

from google.colab import files
cancer = files.upload()

Saving cancer.csv to cancer.csv


##Step 3: Import necessary packages

```
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler


```

In [None]:
#Step 3
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler




## Step 4: Create a Pandas DataFrame from the CSV file
* Name the DataFrame `cancer`.
* Print the first five observations of `cancer`.  Note the kinds of data it contains.

In [None]:
#Step 4
cancer = pd.read_csv('cancer.csv')
cancer.head()





Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


##Step 5: Convert the variable `diagnosis` into a numeric data type  
* You have done this many times by now, but it doesn't hurt to practice!
* There are many ways to accomplish this, but you may choose to work with the example shown below.  

```
cancer.loc[cancer['diagnosis'] == 'M', 'cancer_present'] = 1
cancer.loc[cancer['diagnosis'] == 'B', 'cancer_present'] = 0

```
* Name the result `cancer_present` and code malignant tumors with a `1` and benign tumors with a `0`.






In [None]:
#Step 5
cancer.loc[cancer['diagnosis'] == 'M', 'cancer_present'] = 1
cancer.loc[cancer['diagnosis'] == 'B', 'cancer_present'] = 0



##Step 6: Split the data into the target variable and the feature of interest
* You have done this many times by now, but it doesn't hurt to practice!
* You want to predict whether a tumor is benign or malignant (`cancer_present`) using the mean tumor perimeter measure (`perimeter_mean`).
* Select the all of the features of the `cancer` DataFrame **except** `id`, `diagnosis`, and `cancer_present`, and name the resulting DataFrame `X`.
* Select the column `cancer_present` from the `cancer` DataFrame and name it `y`. Make sure `y` is also a DataFrame and not a Series.

In [None]:
#Step 6
y = cancer[['cancer_present']]

X = cancer[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean','compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]


##Step 7: Split the data into a training/validation dataset and a test dataset
* You have done this many times by now, but it doesn't hurt to practice!
* Use `train_test_split` from `sklearn.model_selection`.
* Name the *X* training/validation set `X_train_val` and the *y* training/validation set `y_train_val`.
* Name the *X* test set `X_test` and the *y* test set `y_test`.
* Set the `test_size = 0.25` and `random_state = 42`.






In [None]:
#Step 7
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)



##Step 8: Split the training/validation dataset into a training set and validation set
* You have done this many times by now, but it doesn't hurt to practice!
* Use `train_test_split` from `sklearn.model_selection` to split `X_train_val` and `y_train_val` into `X_train`, `X_val`, `y_train`, and `y_val`.
* Set the `test_size = 0.333` (this will be the size of the validation set) and `random_state = 42`.





In [None]:
#Step 8
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.333, random_state=42)



##Step 9: Look at the summary measures for each feature in the training data
* Compute the summary measures for each feature using `.describe()`.
* Which feature has the largest mean? What is it?
* Which feature has the smallest mean? What is it?




In [None]:
#Step 9
X_train.describe()


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
count,284.0,284.0,284.0,284.0,282.0,284.0,284.0,284.0,283.0,282.0
mean,14.326711,19.35838,93.248556,674.802817,0.095419,0.102521,0.089788,0.049496,0.180857,0.06224
std,3.603878,4.370346,24.885064,361.376372,0.01373,0.053271,0.082404,0.03923,0.026913,0.007075
min,7.691,9.71,47.92,170.4,0.05263,0.01938,0.0,0.0,0.1167,0.04996
25%,11.89,16.465,76.495,433.525,0.085902,0.061257,0.028445,0.020745,0.16145,0.057347
50%,13.415,18.895,86.47,556.95,0.09415,0.09128,0.06446,0.034605,0.1788,0.06102
75%,16.3875,21.685,108.5,838.675,0.103575,0.130525,0.128425,0.077387,0.1958,0.065715
max,27.42,39.28,186.9,2501.0,0.1634,0.3114,0.4268,0.2012,0.2655,0.09744


###Answer:



##Step 10: Impute missing data
* Run the code `imp_ mean = SimpleImputer(missing_values=np.nan, strategy='mean')` to instantiate a Simple Imputer where missing values are identified as `NaN` and the strategy used for imputing data is the mean value.
* Use `imp_mean.fit_transform(X_train)` to fit the imputer and transform the data.  Name the results `X_train_`.
* Cast `X_train_` as a DataFrame named `X_train`.
* Check for missing values in the new `X_train`.

In [None]:
#Step 10

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train_ = imp_mean.fit_transform(X_train)
X_train = pd.DataFrame(X_train)






##Step 11: Standardize the data
* Run the code `std = StandardScaler()` to instantiate a Standard Scaler. 
* Use `std.fit_transform(X_train)` to fit the scaler and transform the data.  Name the results `X_train_std_`.
* Cast `X_train_std_` as a DataFrame named `X_train_std`.
* Use `print(round(X_train_std.describe(),2))` to examine the summary statistics for the standardized data.
* What is the mean and SD for each variable?  




In [None]:
#Step 11
std = StandardScaler()
X_train_std = std.fit_transform(X_train)
X_train_std = pd.DataFrame(X_train_std)
print(round(X_train_std.describe(), 2))



            0       1       2       3       4       5       6       7       8  \
count  284.00  284.00  284.00  284.00  282.00  284.00  284.00  284.00  283.00   
mean     0.00    0.00    0.00    0.00   -0.00   -0.00    0.00    0.00   -0.00   
std      1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00   
min     -1.84   -2.21   -1.82   -1.40   -3.12   -1.56   -1.09   -1.26   -2.39   
25%     -0.68   -0.66   -0.67   -0.67   -0.69   -0.78   -0.75   -0.73   -0.72   
50%     -0.25   -0.11   -0.27   -0.33   -0.09   -0.21   -0.31   -0.38   -0.08   
75%      0.57    0.53    0.61    0.45    0.60    0.53    0.47    0.71    0.56   
max      3.64    4.57    3.77    5.06    4.96    3.93    4.10    3.87    3.15   

            9  
count  282.00  
mean    -0.00  
std      1.00  
min     -1.74  
25%     -0.69  
50%     -0.17  
75%      0.49  
max      4.98  


###Answer: 

##Step 12: Normalize the data
* Run the code `norm = MinMaxScaler().fit(X_train)` to instantiate a MinMax Scaler. 
* Use `norm.transform(X_train)` to fit the scaler and transform the data.  Name the results `X_train_norm_`.
* Cast `X_train_norm_` as a DataFrame named `X_train_norm`.
* Use `print(round(X_train_norm.describe(),2))` to examine the summary statistics for the normalized data.
* What are the *min* and *max* for each variable?




In [None]:
#step12
#Fit Scaler on training data
norm = MinMaxScaler().fit(X_train)

# transforms training data
X_train_norm_ = norm.transform(X_train)

#As usual Don't Forget
X_train_norm = pd.DataFrame(X_train_norm_)

print(round(X_train_norm.describe(), 2))



            0       1       2       3       4       5       6       7       8  \
count  284.00  284.00  284.00  284.00  282.00  284.00  284.00  284.00  283.00   
mean     0.34    0.33    0.33    0.22    0.39    0.28    0.21    0.25    0.43   
std      0.18    0.15    0.18    0.16    0.12    0.18    0.19    0.19    0.18   
min      0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
25%      0.21    0.23    0.21    0.11    0.30    0.14    0.07    0.10    0.30   
50%      0.29    0.31    0.28    0.17    0.37    0.25    0.15    0.17    0.42   
75%      0.44    0.40    0.44    0.29    0.46    0.38    0.30    0.38    0.53   
max      1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00   

            9  
count  282.00  
mean     0.26  
std      0.15  
min      0.00  
25%      0.16  
50%      0.23  
75%      0.33  
max      1.00  


###Answer: 