<a href="https://colab.research.google.com/github/val93s/Machine_learning/blob/main/Copy_of_11_4_2_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#11.4.2 Activity

#**Breast Cancer Detection**

According to the [American Cancer Society (2022)](https://www.cancer.org/cancer/breast-cancer/about/how-common-is-breast-cancer.html),
 breast cancer is the most common cancer in American women, except for skin cancers. The average risk of a woman in the United States developing breast cancer sometime in her life is about 13%. This means there is a 1 in 8 chance she will develop breast cancer.

Mammograms are used to detect breast cancer—hopefully at an early stage.  However, many masses that appear on a mammogram are not actually cancer.  Developing a machine learning model to predict whether a tumor is benign or cancerous would be helpful for physicians as they guide and treat patients. 

In this activity, we will create pipelines for two models—one where we standardize the data and one where we normalize the data—and see which one performs the best.



##Step 1: Download and save the `cancer.csv` dataset from the class materials  

Make a note of where you saved the file on your computer.

##Step 2: Upload the `cancer.csv` dataset by running the following code block 

When prompted, navigate to and select the `cancer.csv` dataset where you saved it on your computer.

In [None]:
#Step 2

from google.colab import files
cancer = files.upload()

Saving cancer.csv to cancer (1).csv


##Step 3: Import necessary packages

```
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, confusion_matrix, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
import numpy as np

```

In [None]:
#Step 3
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, confusion_matrix, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
import numpy as np





## Step 4: Create a Pandas DataFrame from the CSV file
* Name the DataFrame `cancer`.
* Print the first five observations of `cancer`.  Note the kinds of data it contains.

In [None]:
#Step 4
cancer = pd.read_csv('cancer.csv')
cancer.head()
cancer.describe()



Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
count,569.0,569.0,569.0,569.0,569.0,565.0,569.0,569.0,569.0,564.0,563.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.096336,0.104341,0.088799,0.048919,0.181233,0.062792
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014029,0.052813,0.07972,0.038803,0.027453,0.00705
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.161975,0.05775
50%,906024.0,13.37,18.84,86.24,551.1,0.09586,0.09263,0.06154,0.0335,0.17925,0.06155
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744


##Step 5: Convert the variable `diagnosis` into a numeric data type  
* There are many way to accomplish this, but you may choose to work with the example shown below.  

```
cancer.loc[cancer['diagnosis'] == 'M', 'cancer_present'] = 1
cancer.loc[cancer['diagnosis'] == 'B', 'cancer_present'] = 0

```
* Name the result `cancer_present` and code malignant tumors with a `1` and benign tumors with a `0`.






In [None]:
#Step 5

cancer.loc[cancer['diagnosis'] == 'M', 'cancer_present'] = 1
cancer.loc[cancer['diagnosis'] == 'B', 'cancer_present'] = 0




##Step 6: Split the data into the target variable and the feature of interest
* You want to predict whether a tumor is benign or malignant (`cancer_present`) using the mean tumor perimeter measure (`perimeter_mean`).
* Select the all of the features of the `cancer` DataFrame **except** `id`, `diagnosis`, and `cancer_present`, and name the resulting DataFrame `X`.
* Select the column `cancer_present` from the `cancer` DataFrame and name it `y`. Make sure `y` is also a DataFrame and not a Series.

In [None]:
#Step 6
y = cancer[['cancer_present']]

X = cancer[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean','compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]

##Step 7: Split the data into a training/validation dataset and a test dataset
* Use `train_test_split` from `sklearn.model_selection`.
* Name the *X* training/validation set `X_train_val` and the *y* training/validation set `y_train_val`.
* Name the *X* test set `X_test` and the *y* test set `y_test`.
* Set the `test_size = 0.25` and `random_state = 42`. 






In [None]:
#Step 7
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


##Step 8: Split the training/validation dataset into a training set and validation set
* Use `train_test_split` from `sklearn.model_selection` to split `X_train_val` and `y_train_val` into `X_train`, `X_val`, `y_train`, and `y_val`.
* Set the `test_size = 0.333` (this will be the size of the validation set) and `random_state = 42`.





In [None]:
#Step 8
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.333, random_state=42)


##Step 9: Build a pipeline that will impute and standardize the data and fit a logistic regression model
* The first step is `SimpleImputer(missing_values=np.nan, strategy='mean'))`.
* The second step is `StandardScaler()`.
* And the third step is `LogisticRegression(random_state=0)`.
* Name the pipeline `pipe_std`.





In [None]:
#Step 9
pipe_std = Pipeline([('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
                    ('scaler', StandardScaler()),
                    ('log_reg', LogisticRegression(random_state=0))])



##Step 10: Fit the pipeline to the training data and calculate the model accuracy  
* Fit the pipeline to the training data using `pipe_std.fit()`.
* Calculate the accuracy of the model on the training data using `pipe_std.score()` and print the results.





In [None]:
#Step 10
pipe_std.fit(X_train, y_train)
train_accuracy_std = pipe_std.score(X_train, y_train)
print('The standardized data training accuracy is', train_accuracy_std)




The standardized data training accuracy is 0.954225352112676


  y = column_or_1d(y, warn=True)


##Step 11: Calculate the accuracy of the model on the validation data
* Calculate the accuracy of the model on the *validation* data using `pipe_std.score()` and print the results.





In [None]:
#Step 11
val_accuracy_std = pipe_std.score(X_val, y_val)
print('The standardized data valiadtion accuracy is', val_accuracy_std)



The standardized data valiadtion accuracy is 0.9225352112676056


##Step 12: Build a pipeline that will impute and normalize the data and fit a logistic regression model
* The first step is `SimpleImputer(missing_values=np.nan, strategy='mean'))`.
* The second step is `MinMaxScaler()`.
* And the third step is `LogisticRegression(random_state=0)`.
* Name the pipeline `pipe_norm`.
* Fit the pipeline to the training data using `pipe_norm.fit()`.
* Calculate the accuracy of the model on the training data using `pipe_norm.score()` and print the results.
* Calculate the accuracy of the model on the validation data using `pipe_norm.score()` and print the results.
* Which model has the highest accuracy? The model with the standardized data or the model with the normalized data?





In [None]:
#Step 12
pipe_norm = Pipeline([('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
                      ('norm', MinMaxScaler()),
                      ('log_reg', LogisticRegression(random_state=0))])
pipe_norm.fit(X_train, y_train)
train_accuracy_norm = pipe_norm.score(X_train, y_train)
val_accuracy_norm = pipe_norm.score(X_val, y_val)

print('The normalized data training accuracy is', train_accuracy_norm)
print('The normalized data validation accuracy is', val_accuracy_norm)




The normalized data training accuracy is 0.9330985915492958
The normalized data validation accuracy is 0.9014084507042254


  y = column_or_1d(y, warn=True)


###Answer:



##Step 13: Fit the model built using standardized data to the test set
Calculate and print the test set accuracy.




In [None]:
# step 13
test_accuracy_std = pipe_std.score(X_test, y_test)
print('The testing accuracy is', test_accuracy_std)





The testing accuracy is 0.9440559440559441
