# CEBD 1260 - FINAL EXAM

## QUESTION 1
This database contains data on cancer patients with tumors, characteristics of those tumors, and biospy results indicating whether the tumor is Malignant or Benign.

<img src="files/diagnosis.jpg">

In cancer_data.txt you will find the following variables:
   - radius (mean of distances from center to points on the perimeter)
   - texture (standard deviation of gray-scale values)
   - perimeter
   - area
   - smoothness (local variation in radius lengths)
   - compactness (perimeter^2 / area - 1.0)
   - concavity (severity of concave portions of the contour)
   - concave_points (number of concave portions of the contour)
   - symmetry 
   - fractal_dimension ("coastline approximation" - 1)
   - cancer (0 = Benign, 1 = Malignant)  *target*

### Machine learning Algorithm Used
K-Nearest-Neighbors (KNN) was used since this is a supervised dataset with a binary target (0, 1).
The KNN algorithm will consist of 2 main blocks of code:
1. Training - Assign 80% of original dataset records to train algorithm.
2. Testing -  Assign 20% of original dataset records to testing the algorithm's accuracy.

The KNN Machine learning algirthm will be implemented below in Python by using scikit-learn.

In [10]:
# Import libraries
# Import KNN algorithm and train_test_split to split original dataset into training and testing datasets
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

Load cancer dataset and assign it to a data frame.

In [11]:
df = pd.read_csv('cancer_data.csv')
df = df[["radius", "texture", "perimeter", "area", "smoothness", "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension", "cancer"]]

Below is a sample of the first 10 records in the dataframe.
The first 10 columns represent features (variables) and the last column is the target (cancer).  The binary target values represent the following: 0 = Benign and 1 = Malignant

In [14]:
df[:10]

Unnamed: 0,radius,texture,perimeter,area,smoothness,compactness,concavity,concave_points,symmetry,fractal_dimension,cancer
0,18.0,10.4,123.0,1000.0,0.12,0.28,0.3,0.15,0.24,0.08,0.0
1,20.6,17.8,133.0,1330.0,0.08,0.08,0.09,0.07,0.18,0.06,0.0
2,19.7,21.3,130.0,1200.0,0.11,0.16,0.2,0.13,0.21,0.06,0.0
3,11.4,20.4,77.6,386.0,0.14,0.28,0.24,0.11,0.26,0.1,0.0
4,20.3,14.3,135.0,1300.0,0.1,0.13,0.2,0.1,0.18,0.06,0.0
5,12.4,15.7,82.6,477.0,0.13,0.17,0.16,0.08,0.21,0.08,0.0
6,18.3,20.0,120.0,1040.0,0.09,0.11,0.11,0.07,0.18,0.06,0.0
7,13.7,20.8,90.2,578.0,0.12,0.17,0.09,0.06,0.22,0.07,0.0
8,13.0,21.8,87.5,520.0,0.13,0.19,0.19,0.09,0.24,0.07,0.0
9,12.5,24.0,84.0,476.0,0.12,0.24,0.23,0.09,0.2,0.08,0.0


Next, define the target (dependent variable) as y

In [27]:
y = df.cancer

Use train_test_split function in order to split data data into a train and test dataset.  Will will assign 80% of the data to training, and 20% to testing.
Set random seed to make sure you obtain the same split everytime we run this script.

In [53]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size =0.2, random_state = 42)

In [54]:
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

(455, 11) (455,)
(114, 11) (114,)


Define variable k which represents the number of nearest neighbors.

In [55]:
k = 5
knn = KNeighborsClassifier(n_neighbors = k)

Fit the KNN model on the training data

In [56]:
model = knn.fit(X_train,y_train)

Use your KNN model to predict and view these predictions

In [57]:
predictions = knn.predict(X_test)
predictions

array([ 1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,
        0.,  1.,  0.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,
        1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,
        0.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  1.,
        1.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,
        1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,
        1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  0.])

Display the accuracy of the model

In [58]:
print ("Score:", model.score(X_test, y_test))

Score: 0.929824561404


Here is a log of various neighboring values and classifier accuracy:
k=1   Score: 0.877192982456  
k=2   Score: 0.877192982456  
k=3   Score: 0.90350877193  
k=4   Score: 0.921052631579    
k=5   Score: 0.929824561404  
k=6   Score: 0.912280701754  
k=7   Score: 0.929824561404  
k=8   Score: 0.912280701754  
k=9   Score: 0.929824561404  
k=10  Score: 0.929824561404

<img src="files/classifierAccuracyGraph.jpg">

#### Predicting Category fo a New Patient
Now we will use our model to predict the category of a new patient with a tumor with the following features:
   - radius: 14
   - texture: 14
   - perimeter: 88
   - area: 566
   - smoothness: 1
   - compactness: 0.08
   - concavity: 0.06
   - concae points: 0.04
   - symmetry: 0.18
   - fractal dimension: 0.05

Pass this new observation to the KNN prediction model.  The result will be either 0 = Benign and 1 = Malignant

In [59]:
newPatient = knn.predict([14, 14,88,566,1,0.08,0.06,0.04,0.18,0.05,0])



In [60]:
#print
print(newPatient)

[ 1.]


#### New Patient Prediction Result
Our model predicted that the new patient has a malignant tumor.

# QUESTION 2

Errors in code:
error ---> fix

1. pipeline  --->  Pipeline
2. sklearn_linear_model  --->  sklearn.linear
3. twenty_train = fetch_20newsgroups(subset='test'  --->  subset='train'
4. twenty_test = fetch_20newsgroups(subset='train'  --->  subset='test'
5. random_state=42)),])  --->  remove ","

In [61]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline                     #from sklearn.pipeline import pipeline
from sklearn.linear_model import SGDClassifier            #from sklearn_linear_model import SGDClassifier

In [62]:
categories = [ 'rec.sport.baseball','rec.sport.hockey']
twenty_train = fetch_20newsgroups(subset='train', categories=categories)       #subset = "test"
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)      #subset = "train"

In [63]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42))])  #remove ","

In [64]:
text_clf.fit(twenty_train.data, twenty_train.target) 

Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...     penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False))])

In [65]:
predicted = text_clf.predict(twenty_test.data)

### QUESTIONS
#### 2-1 - How many observations are in the training dataset?
1197 rows

In [66]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(1197, 18571)

#### 2-2 - How many features are in the training dataset?
18571 features

#### 2-3 - How well did your model perform?
The model performed quite well given the high scores for both classifiers.

In [67]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target,
                              predicted,
                              target_names=twenty_test.target_names))

                    precision    recall  f1-score   support

rec.sport.baseball       0.97      0.96      0.97       397
  rec.sport.hockey       0.97      0.97      0.97       399

       avg / total       0.97      0.97      0.97       796



#### BONUS  
Baseball = 0, hockey = 1

In [68]:
list(twenty_train.target_names)

['rec.sport.baseball', 'rec.sport.hockey']