## Dataset : Parkinsons Disease Data Set
## Domain : Medical

### We are going to build the classification and ensemble of models using the above dataset classify the patients into the respective labels using the attributes from their voice recordings

## 1. Load the dataset

In [None]:
import warnings # to ignore warnings
warnings.filterwarnings('ignore')
import pandas as pd # data processing,
df = pd.read_csv("../input/parkinson-disease-detection/Parkinsson disease.csv")

## 2. Eye-ball raw data to get a feel of the data in terms of number of records, structure of the file, number of attributes, types of attributes and a general idea of likely challenges in the dataset.

In [None]:
# Displaying the head of the dataset
df.head(10)

### 2.1 Check shape of the data

In [None]:
# Displaying the shape and datatype for each attribute

print('Shape of the dataset: ',df.shape,'\n\n')

df.info()

### There are 196 records and 24 columns

### There are 24 attributes with one dependent attribute i.e. 'status', except that all are 'float' datatype and there are no null values in the dataset

### Encoding the Categorical values into numerical values is not required in this dataset. Because all values we have floating and integer type only. we have name column as a categorical values but we are not going to use that column in model prediction as it doesn't hold any value.

In [None]:
# Dispalying the descriptive statistics describe each attribute

df.describe().T

### Almost all the columns' mean is greater than the median(50%)

### The mean is greater we can say that there are more number of columns are highly skewed to the right.




In [None]:
# Checking Null or Empty Values

df.isna().sum()

### We can see there are no null values in the dataset so now we can safely go ahead

In [None]:
df = df.drop('name',1)  # as we said earlier dropping the 'name' column as it is not significant for model building

## 3. Using univariate & bivariate analysis to check the individual attributes for their basic statistics such as central values, spread, tails, relationships between variables etc.

### Univariate analysis

In [None]:
# Plotting histogram of the columns to study the data distribution

import seaborn as sns  #importing seaborn for plotting
import matplotlib.pyplot as plt   #importing matplotlib


k=1
plt.figure(figsize=(20,30))

# using for loop to iterate over all the columns in the dataframe and plot the histogram of those

for col in df.columns[0:]:
    plt.subplot(6,4,k)
    plt.hist(df[col],color='red', edgecolor = 'black', alpha = 0.5)
#     sns.distplot(df[col],kde=False)
    plt.title(col)
    k=k+1

#### -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Observations


## * The measures of vocal fundamental frequency are shown in the first 3 histograms

### There is a positive skewness for minimum vocal fundemental frequency(MDVP:Flo(Hz)) with more high values between 75Hz and 125Hhz. 

### The average vocal frequency is almost normally distributed(MDVP:Fo(Hz)) with more values ranging 115Hz and 125Hz. We can see that big bar is there.

### The high vocal frequency(MDVP:Fhi(Hz)) does not have any skewness, but some range of values are at the right most tail and more values are at left.

#### -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## * The measure of tonal component of frequency is shown above i.e (NHR, HNR)

### The value NHR is right skewed for there are so many observations in the area, but they seem to be with very minimal values. The maximum number of observations is between 0 and 0.04. 

### The value HNR looks like slightly normally distributed, but it look there seems to be a slight negative skewness in the data.



#### -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## * MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ,           Shimmer:DDA 

### For all of the above columns ditribution, we can observe that the measure of variation in amplitude is positively skewed.


#### -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Spread1 and Spread2 columns looks normally distributed and we are also going to see how its impacting on target attribute

### Bivariate analysis

In [None]:
# Using histogrm from seaborn plotting of spread1 for status column

sns.distplot( df[df.status == 0]['spread1'],color='red'); # spread1 for who are normal
sns.distplot( df[df.status == 1]['spread1'],color='blue'); # spread1 for who have PD

### From the above distribution we can observe the followings

### * Spread1 is normally distributed betweeen the person who is normal and who have PD

### * more person who have spread1 between -8.5 and -7.5 has PD

### * more person who have spread1 between -6.5 and -5 are normal

In [None]:
fig, ax = plt.subplots(1,2,figsize=(15,6))

# Bivariate Boxplot to see the difference between NHR and HNR
sns.boxplot(x=df['status'],y=df['NHR'],ax=ax[0]);   # boxplot of status Vs NHR
sns.boxplot(x=df['status'],y=df['HNR'],ax=ax[1]);   # boxplot of status Vs NHR

### * NHR,HNR - Two measures of ratio of noise to tonal components in the voice

### * As i studied lower NHR and Higher HNR indicate superior voice quality.

### * People who have PD(status=1) has higher NHR and opposite for normal people. And we can also observe the outliers that there are many people who has higher level of NHR. 

### * Also loking at the HNR ratio people who have PD(status=1) has lower levels


## The target column distribution.

In [None]:
plt.figure(figsize=(8,8))
plt.pie(df.status.value_counts(),colors=['lightblue','yellow'],explode=[0,0.02],autopct='%1.0f%%',labels=['0(healthy)',"1(parkinson's)"]);

### We can see that there are more number of healthy patients in the dataset than who's having parkinson's disease i.e.(75:25)

#

In [None]:
# checking the correlation of dataset 
fig, ax = plt.subplots(figsize=(20, 20))
ax = sns.heatmap(df.corr(),cmap="YlGnBu",square=True,annot = True,linewidth=0.2)

### We can clearly see that there are number of columns which are highly positively correlated to each other and almost all the columns are highly negatively correlated to HNR column

### MDVP:Jitter(%) has a very high correlation with MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP

### MDVP:Shimmer has a very high correlation with MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA this may be because they are related to each other. 

### The target variable status has a weak positive corelation with all the variables in the dataset

In [None]:
# correlation coefficient values in each attributes.

correlation_values=df.corr()['status']
pd.DataFrame(correlation_values.sort_values(ascending=False))

### Above is the correlation values in descending order, we have correaltion values in each attribute
### we can see that the below columns in the dataframe have lower corelation to the target attribute

## 4. Split the dataset into training and test set in the ratio of 70:30 (Training:Test)

In [None]:
from sklearn.model_selection import train_test_split

x = df.drop('status',1)  # predictors
y = df.status            # target attributez

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state = 42)  # making 70:30 split

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

## 5. Prepare the data for training - Scale the data if necessary, get rid of missing values (if any) etc

In [None]:
## as we checked above there are no null values in the dataset

In [None]:
### As the almost columns in the dataset are skewed so we are going to use MinMax scaler

In [None]:
from sklearn.preprocessing import MinMaxScaler

rc = MinMaxScaler() # instantiating the object for minmaxscaler

columns = list(x_train.columns)  # storing the columns

x_train_scaled = pd.DataFrame(rc.fit_transform(x_train))
x_train_scaled.columns = columns  # assigning the columns after scaling the values

x_test_scaled = pd.DataFrame(rc.fit_transform(x_test))
x_test_scaled.columns = columns  # assigning the columns after scaling the values

## 6. Train at least 3 standard classification algorithms - Logistic Regression, Naive Bayes’, SVM, K-NN etc.

#### 6.1 Logistic Regression

In [None]:
# **Logistic Regression is a classification algorithm. 
# **It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables,

from sklearn.linear_model import LogisticRegression

# create an instance for LogisticRegression
Logistic = LogisticRegression(solver="liblinear")

# fit the model
Logistic.fit(x_train_scaled, y_train)

# predict on created model
logistic_predict = Logistic.predict(x_test_scaled)

In [None]:
# checking the score of the testset
acc_logistic_test = Logistic.score(x_test_scaled, y_test)*100

In [None]:
# storing accuracy results of each model in the dataframe for final comparision 
result_df = pd.DataFrame({'Model': ['Logistic Regression'], 'Accuracy' : [acc_logistic_test]}).drop_duplicates()
result_df

#### 6.2 Naive Bayes classifier

In [None]:
# Bayes Theorem assumes predictors or input features are independent of each other,

from sklearn.naive_bayes import GaussianNB # using Gaussian algorithm from Naive Bayes as all the columns are numerical

# create an instance for GaussianNB
naive_model = GaussianNB()

# fit the model
naive_model.fit(x_train_scaled, y_train)

# prediction using created model
naive_predict = naive_model.predict(x_test_scaled)

In [None]:
# checking the score of the test set
acc_naive_test = naive_model.score(x_test_scaled, y_test)*100

In [None]:
# storing accuracy results of each model in the dataframe for final comparision
tempResult_df = pd.DataFrame({'Model': ['Naive Bayes'], 'Accuracy' : [acc_naive_test]})
result_df = pd.concat([result_df,tempResult_df]).drop_duplicates()
result_df

#### 6.3 K-Nearest Neighberhood Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# create instance for KNeighborsClassifier and using k value = 5
knn_model = KNeighborsClassifier(n_neighbors=5)

# fit the model
knn_model.fit(x_train_scaled, y_train)

# prediction using created model
knn_predict = knn_model.predict(x_test_scaled)

In [None]:
# checking the score of the test set
acc_knn_test = knn_model.score(x_test_scaled, y_test)*100 

In [None]:
# storing accuracy results of each model in the dataframe for final comparision 
tempResult_df = pd.DataFrame({'Model': ['KNN Scaled'], 'Accuracy' : [acc_knn_test]})
result_df = pd.concat([result_df,tempResult_df]).drop_duplicates()
result_df
result_df

### Building KNN model without scaling the data

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# create instance for KNeighborsClassifier and using k value = 5
knn_model2 = KNeighborsClassifier(n_neighbors=5)

# fit the model
knn_model2.fit(x_train, y_train)  # fiiting the model on data for which the scaling operation is not made

# prediction using created model
knn_predict2 = knn_model2.predict(x_test)
acc_knn_test2 = knn_model2.score(x_test, y_test)*100 

In [None]:
# storing accuracy results of each model in the dataframe for final comparision 
tempResult_df = pd.DataFrame({'Model': ['KNN Not Scaled'], 'Accuracy' : [acc_knn_test2]})
result_df = pd.concat([result_df,tempResult_df]).drop_duplicates()
result_df
result_df

#### K-NN is a supervised algorithm, it is non-parametric and lazy (instance-based) it does not care about dependency of the variables.For KNN the input consists of the k closest training examples in the feature space.If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

#### This is a clasification problem,that is why classifier works better like K-NN as you can see 91% in the test set

#### 6.4 Decision Tree classifier

In [None]:
# Decision tree algorithm falls under the category of supervised learning. 
# Decision tree uses the tree representation to solve the problem in which each leaf node corresponds to a class label and attributes are represented on the internal node of the tree

from sklearn.tree import DecisionTreeClassifier

# using entropy technique we are making splits
decision_tree = DecisionTreeClassifier(criterion = 'gini', max_depth = 6, random_state = 100) 

# fitting the model
decision_tree.fit(x_train_scaled, y_train) 

# predicting the model on test set
descion_pred = decision_tree.predict(x_test_scaled)

In [None]:
# checking the score of the testset
acc_DT_test = decision_tree.score(x_test_scaled, y_test)*100

In [None]:
# storing accuracy results of each model in the dataframe for final comparision 
tempResult_df = pd.DataFrame({'Model': ['Decision Tree'], 'Accuracy' : acc_DT_test})
result_df = pd.concat([result_df,tempResult_df])
result_df

## 7. Train a meta-classifier and note the accuracy on test data

In [None]:
from mlxtend.classifier import StackingClassifier  # importing stacking classifier package

In [None]:
from sklearn.svm import SVC  # importing SVM classifier

# creating four individual classification models
model1 = DecisionTreeClassifier(criterion = 'entropy',max_depth = 6)
model2 = KNeighborsClassifier(n_neighbors=5)
model3 = GaussianNB()
model4 = SVC(C = 10,gamma=0.01)

# giving logistic regression as meta classifier/model
meta_model = LogisticRegression()

In [None]:
# calling stacking classifier with all the base models and meta model
stcl = StackingClassifier(classifiers = [model1,model2,model3,model4], meta_classifier = meta_model)

In [None]:
from sklearn.model_selection import cross_val_score

# loop through all the models created with meta model
for models, label in zip ([model1,model2,model3,model4, stcl], ['DecisionTreeClassifier','KNN','NaiveBayes','SVM','StackingClassifier']):
    
    scores = cross_val_score (models, x, y, cv=10, scoring='accuracy')
    print(scores,label)
#     print("Accuracy:",scores.mean(),label)

#### Stacking is ensemble learning technique where the predictions of multiple classifiers are used as new features to train a meta-classifier. The meta-classifier can be any classifier of choice.

#### The predictions of individual weak learners get stacked to the meta classifier and are used as features to train the meta-classifier which makes the final prediction

#### So here we are doing kfold cross validation by making 10 splits and taking mean accuracy of all the individual model using we are using cross validation score.

#### Creation of individual models we got good accuracy as you can see above but using stacking technique we are combining individual weak learners and we are slightly getting better accuracy but not more.

In [None]:
# storing accuracy results of each model in the dataframe for final comparision 
tempResult_df = pd.DataFrame({'Model': ['Stacking Classifier'], 'Accuracy' : scores.mean()*100})
result_df = pd.concat([result_df,tempResult_df])
result_df

## 8. Train at least one standard Ensemble model - Random forest, Bagging, Boosting etc, and note the accuracy

#### 8.1 Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier  # importing random forest classifier

rfcl = RandomForestClassifier() # calling the randomforest with 20 decision trees
rfcl = rfcl.fit(x_train_scaled, y_train)  # fitting the model

In [None]:
rfcl.score(x_test_scaled, y_test)  # score of train and test set

In [None]:
# Importing classification report and confusion matrix from sklearn metrics
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
rf_pred = rfcl.predict(x_test_scaled)

In [None]:
# Let's check the report of our default model
print(classification_report(y_test,rf_pred))

In [None]:
# Printing the accuracy score of actual values and predictions
acc_rf = accuracy_score(y_test,rf_pred)*100
print('Accuracy score of Random Forest Classifier: ',acc_rf,'%','\n')

# Printing confusion matrix
cm = confusion_matrix(y_test,rf_pred)

df1 = pd.DataFrame(cm,columns=['No','Yes'], index = ['No','Yes'])
print('\t\tConfusion matrix')
sns.heatmap(df1,annot=True,cbar=False);

In [None]:
df1

### We can see that as the recall score for predicting the 1's is 100%, so zero misclassifications on predicting 1's

In [None]:
# storing accuracy results of each model in the dataframe for final comparision
tempResult_df = pd.DataFrame({'Model': ['Random Forest'], 'Accuracy' : acc_rf})
result_df = pd.concat([result_df,tempResult_df]).drop_duplicates()
result_df

## Grid Search to Find Optimal Hyperparameters

In [None]:
# Creating the parameter grid based on the results of random search 
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [2,4,8,10],
    'n_estimators': [50,100,200, 300], 
    'max_features': [5, 10, 15]
    }

# Create a base model
rf = RandomForestClassifier(random_state=100)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1,verbose = 1,scoring='accuracy')

In [None]:
grid_search.fit(x_train_scaled, y_train);

In [None]:
# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
rf_tuned = RandomForestClassifier(max_depth= 8, max_features= 5, n_estimators= 50)

In [None]:
rf_tuned.fit(x_train_scaled,y_train)

In [None]:
rf_tuned.score(x_test_scaled,y_test)  

In [None]:
result_df

In [None]:
### Random forest performing better

## 9. Comparing all the models and pick the best one among them

## Accuracies of all the Models implemented so far

In [None]:
result_df

### From the above Data Frame we can observe that 'Random Forest' and 'KNN Scaled' models are having highest accuracy i.e. 91.52%(KNN) and 89.83%(RF) compared to all other models.

### All other models have 80+ accuracy

### If we perform the scaling operation KNN tends to increase the accuracy and if we dont scale, decreases the accuracy.

### As per my observation here KNN with scaled data performs better than any other model as it does not care about dependency of the variables.For KNN the input consists of the k closest training examples in the feature space. ### If k = 1, then the object is simply assigned to the class of that single nearest neighbor, but we can implement many other models also like in the boosting so that we can get the better accuracy.

### If we don't perform scaling we can say 'Random Forest' is the best model as we can observe it gives higher accuracy after tuning the parameters as it consists a large number of decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

### I can say both the models i.e. KNN and Random Forest generalizes well in the production neither overfitting nor underfitting.


In [None]:
print('RandomForest train accuracy',rf_tuned.score(x_train_scaled,y_train)*100)
print('KNN train accuracy',knn_model.score(x_train_scaled,y_train)*100)

In [None]:
from sklearn import metrics
print('KNN')
pd.DataFrame(metrics.confusion_matrix(y_test,knn_predict))

In [None]:
print('Random Forest')
pd.DataFrame(metrics.confusion_matrix(y_test,rf_pred))

### We can observe confusion matrix of both the models, there are very less number of mis-classifications


## Ultimately we can conclude that based on scaling operation, model performance differs but looking at the confusion matrix there is one more misclassification in the random forest than KNN but we can improve the score of RF by tuning the parameters again and we can get more accuracy than KNN.

### As we are dealing with the medical domain In the real-world, predicting the person as not having the disease but when he/she actually has disease is more dangerous than predicting the person has a disease when he/she actually don't have it. Hence it is more important for us to identify True Positive.

### Since there are zero mis-classifications(True Positive Rate is 100%) on predicting the 1's for both the models and it is more accurate,we can say both of them performs well in the production.

### As per my observation i can say that 'Random Forest' is the best model and performs well in this dataset.

In [None]:
### Bar plot to show the models accuracyfig=plt.figure(figsize=(12,5))
fig.suptitle('All the models comparision')
sns.barplot(result_df['Model'],result_df['Accuracy']);