# Final Assignment Machine Learning

## Introduction

Below 2 cases are shortly presented. 

The first case has a small amount of data and is fairly easy, and the second case is of intermediate level.

For this final assignment you should work out both cases. Every case can be considered as a typical classification problem. The data of both cases is available on the UCI website. Both cases have labels.

For each case the following should be done:
+ Formulate the question are you trying the answer?
+ Clearly describe the problem that you want to solve.
+ What are the features and labels to start with, motivate your choices (e.g. based on literature).
+ Make a description of the dataset.
+ Find out which are the most important features, should you add and remove features?
+ Show how far can you go with K-means clustering?
+ Apply different classification algorithms, vary the values of the most important parameters, play with the number of features and keep records of algo scores. 
+ Motivate your choices, and of course, support your research journey with appealing and informative graphs and diagrams.

## Case 1 - Wine Quality

**Data Set Information**

The data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. 

See: [UCI Wine](http://archive.ics.uci.edu/ml/datasets/Wine)

## Case 2 - Heart Disease

**Data Set Information**

A number of attributes are listed that possibly influence heart diseases. The presence of heart disease in the patient is an integer valued from 0 (no presence) to 4. 

The names and social security numbers of the patients were recently removed from the database, and replaced with dummy values. 

One file has been "processed", i.e. the Cleveland database (use this one!). 

See: [UCI Heart Disease](http://archive.ics.uci.edu/ml/datasets/Heart+Disease)

### Goodluck

# CASE 1



In [2]:
import pandas as pd
import numpy as np

names=["Class","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols",
       "Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline"]

df = pd.read_csv("data/wine.data", header=None,names=names)
df.head()

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [3]:
from sklearn.cluster import KMeans
from sklearn import datasets, cluster
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt


X = df[["Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols",
       "Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline"]]
y= df["Class"]

In [4]:
est = cluster.KMeans(3)  # 4 clusters
est.fit(X)

y_est = est.predict(X)

labels = est.labels_

In [5]:
plt.imshow(confusion_matrix(y, labels), cmap='Blues', interpolation='nearest')
plt.colorbar()
plt.grid(False)
plt.ylabel('true')
plt.xlabel('predicted');
confusion_matrix(y, labels)

array([[ 0,  0,  0,  0],
       [13, 46,  0,  0],
       [20,  1, 50,  0],
       [29,  0, 19,  0]])

In [6]:
from sklearn.metrics import recall_score, precision_score, f1_score
precision = precision_score(y, labels, average='micro')
recall = recall_score(y, labels,average='micro')
f1 = f1_score(y, labels, average='micro')
print ("precision: "+str(precision))
print ("recal l: "+ str(recall))
print ("f1 score: "+ str(f1))

precision: 0.539325842697
recal l: 0.539325842697
f1 score: 0.539325842697


In [7]:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, test_size=0.25, random_state=0)

In [8]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [9]:
from sklearn.svm import LinearSVC

svc = LinearSVC()
svc

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [10]:
clf=svc.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [11]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
print("Accuracy: "+str(score))
#overfitted hehe

Accuracy: 1.0


In [12]:
from sklearn.feature_selection import SelectKBest, chi2, f_regression

b= SelectKBest(f_regression,k=6)
b.fit(X_train,y_train)
##accuracy 
#k1 = 0.777777777778
#k2 = 0.733333333333
#k3 = 0.444444444444
#k4 = 0.533333333333
#k5 = 0.933333333333
#k6 = 0.955555555556 <<-- amount of accuracy is the highest with 6 features and doesnt increase when adding more.
#k7 = 0.955555555556
#k8 = 0.955555555556

SelectKBest(k=6, score_func=<function f_regression at 0x0000000008B846A8>)

In [13]:
X_train_k = b.fit_transform(X_train, y_train)
X_test_k = b.fit_transform(X_test,y_test)

In [14]:
clf1=svc.fit(X_train_k,y_train)
y_pred1 = clf1.predict(X_test_k)

In [15]:
score = accuracy_score(y_pred1, y_test)
print("Accuracy: "+str(score))

Accuracy: 0.955555555556


In [16]:
params = b.get_params()
params

{'k': 6,
 'score_func': <function sklearn.feature_selection.univariate_selection.f_regression>}

In [17]:
b.get_support()
# >> Therefore, best features are 6(Total phenols),7(Flavanoids)
# 9(Proanthocyanins), 11(Hue), 12(OD280/OD315 of diluted wines), 13(Proline)

array([False, False, False, False, False,  True,  True, False,  True,
       False,  True,  True,  True], dtype=bool)

# Case 2

In [18]:
names1= ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']      

dfc = pd.read_csv("data/processed.cleveland.data", header=None,names=names1)
dfc.head()
# dfc.dtypes


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


In [19]:
meanslope =round (dfc['slope'].mean(),0)


dfc['slope']=dfc['slope'].replace("?", meanslope)
dfc['ca']=dfc['ca'].replace("?", 0)
dfc=dfc.replace("?",0)


In [20]:
Xc = dfc[["age",'sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']]
yc = dfc['num']

In [21]:
Xc_train, Xc_test, yc_train, yc_test = cross_validation.train_test_split(Xc,yc, test_size=0.25, random_state=0)


In [22]:
estc = KMeans(n_clusters=4)
estc.fit(Xc)
yc_est = estc.predict(Xc)

labels = estc.labels_

In [23]:
confusion_matrix(yc, labels)

array([[55, 35,  3, 71,  0],
       [13, 19,  0, 23,  0],
       [ 9, 14,  1, 12,  0],
       [12, 13,  0, 10,  0],
       [ 4,  4,  1,  4,  0]])

In [32]:
svc1 = LinearSVC()
clfc=svc1.fit(Xc_train, yc_train)
yc_pred_svc = clfc.predict(Xc_test)

In [33]:
score = accuracy_score(yc_pred_svc, yc_test)
print("Accuracy: "+str(score))

Accuracy: 0.355263157895


In [66]:
k= SelectKBest(f_regression,k=10)
k.fit(Xc_train,yc_train)
X_train_c = k.fit_transform(Xc_train, yc_train)
X_test_c = k.fit_transform(Xc_test,yc_test)

clfk=svc1.fit(X_train_c,yc_train)
y_predk = clfk.predict(X_test_c)

In [67]:
score = accuracy_score(y_predk, yc_test)
print("Accuracy: "+str(score))
# Accuracy k
# k1 = 0.605263157895 <<<----
# k2 = 0.605263157895
# k3 = 0.513157894737
# k4 = 0.144736842105
# k5 = 0.184210526316
# k6 = 0.0789473684211
# k7 = 0.552631578947
# k8 = 0.0789473684211
# k9 = 0.197368421053
# k10 = 0.144736842105

Accuracy: 0.144736842105


In [26]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc = rfc.fit(Xc_train,yc_train)
yc_pred = rfc.predict(Xc_test)

In [27]:
from sklearn import metrics
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(yc_pred, yc_test)),"\n")

Accuracy:0.868 



In [28]:
print(max(rfc.feature_importances_))
print(rfc.feature_importances_)

0.397639512354
[ 0.07178817  0.01609739  0.02137915  0.05585446  0.06769376  0.01890613
  0.0163169   0.08547199  0.04505878  0.10422264  0.01407712  0.03183869
  0.05365532  0.39763951]
