# Module 2 Assignment


A few things you should keep in mind when working on assignments:

1. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. 
2. Make sure that you save your work.
3. Upload your notebook to Compass.

-----


# Predicting Breast Cancer

In this assignment, we will work with a breast cancer data set to create classification models to predict breast cancer.

-----


# Problem 1: Load and Pre-process Data

To complete the task, do the following:
1. Import needed modules.
2. Load the dataset from `breast-cancer-wisconsin.csv` to DataFrame `df`. 
3. Display the first 5 rows in `df`.
4. Display basic information of the DataFrame `df`. Verify there's no missing values and all values are numeric.
5. Choose column 'class' as label and assign it to variable **label**. Note: since DataFrame has an attribute 'class', you can't refer to the 'class' column by using `df.class`. Use `df['class']` instead.
6. Choose the following columns as training data and assign it to variable **data**:  
'clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli' and, 'mitoses'.   
__data__ should be a DataFrame.
7. Split the `data` and `label` to training and testing set.
    - Assign the training and testing data to variable `d_train` and `d_test`.
    - Assign the training and testing label to variable `l_train` and `l_test`.
    - The `test_size` argument in `train_test_split` should be set to 0.3.
    - Don't set `random_state` argument in `train_test_split`.

After this problem, there are six new variables defined: **data, label, d_train, d_test, l_train** and __l_test__.

-----

In [1]:
# Import modules, load dataset, dispaly first 5 rows.
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('breast-cancer-wisconsin.csv')
df.head()

Unnamed: 0,id,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [2]:
# Display basic information of df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683 entries, 0 to 682
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   id                     683 non-null    int64
 1   clump thickness        683 non-null    int64
 2   uniformity cell size   683 non-null    int64
 3   uniformity cell shape  683 non-null    int64
 4   marginal adhesion      683 non-null    int64
 5   epithelial cell size   683 non-null    int64
 6   bare nuclei            683 non-null    int64
 7   bland chromatin        683 non-null    int64
 8   normal nucleoli        683 non-null    int64
 9   mitoses                683 non-null    int64
 10  class                  683 non-null    int64
dtypes: int64(11)
memory usage: 58.8 KB


In [3]:
# Define and split label and data

label = df['class']
data = df[['clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses']]
d_train, d_test, l_train, l_test = train_test_split(data, label, test_size=0.3)


---

# Problem 2: Create and Train a Logistic Regression Model

Your task for this problem is to build and use the scikit learn library's `LogisticRegression` estimator to make predictions on the breast cancer dataset.  

To complete the task, do the following:
1. Import `LogisticRegression` from `sklearn`.
2. Create a `LogisticRegression` estimator **lr_model** by using scikit learn. Accept default values for all arguments.
3. Fit the `LogisticRegression` estimator using d_train and l_train created in problem 1.

After this problem, there will be a trained logistic regression model **lr_model**.

-----

In [4]:
# Your answer
from sklearn.linear_model import LogisticRegression

# Create and fit our logistic regression model to training data
lr_model = LogisticRegression()
lr_model.fit(d_train, l_train)


LogisticRegression()

---

# Problem 3: Calculate Accuracy Score

For this problem, you will compute the accuracy score of the `lr_model` defined in Problem 2.  

To complete the task, do the following:

1. Import needed modules.
2. Apply lr_model `predict` function on d_test to get predicted label, assign it to variable **predicted**.
3. Compute the accuracy score using `accuracy_score` function in `metrics` module with true label **l_test** and predicted label __predicted__.
4. Print out the accuracy score.

-----

In [5]:
# Your answer
from sklearn import metrics

predicted = lr_model.predict(d_test)
mas_score = metrics.accuracy_score(l_test, predicted)
mas_score

0.9707317073170731

---

# Problem 4: Create and Train a Decision Tree Classifier

Your task for this problem is to build and use the scikit learn library's `DecisionTreeClassifier` estimator to make predictions on the breast cancer dataset. 

To complete the task, do the following:

1. Import needed modules.
2. Create a `DecisionTreeClassifier` estimator **dtc_model**. Accept default values for all hyperparameters.
3. Fit the `DecisionTreeClassifier` estimator using `d_train` and `l_train` created in problem 1.

After this problem, there will be a trained decision tree classification model **dtc_model**.

-----

In [6]:
# Your answer

from sklearn.tree import DecisionTreeClassifier

# Create and fit our decision tree model to training data
dtc_model = DecisionTreeClassifier()
dtc_model.fit(d_train, l_train)


DecisionTreeClassifier()

---

# Problem 5: Calculate Accuracy Score

For this problem, you will compute the accuracy score of the `dtc_model` defined in Problem 4.  

To complete the task, do the following:

1. Use the decision tree model's `score()` method to get accuracy score from d_test and l_test.
2. Print out the accuracy score.

-----

In [7]:
# Your answer
score = dtc_model.score(d_test, l_test)
score

0.9414634146341463