# >> Detecting Parkinson's Disease << 

## What is Parkinson's Disease ??

###  1) Parkinson's disease is a progressive nervous system disorder that affects movement. Symptoms start gradually, sometimes starting with a barely noticeable tremor in just one hand. Tremors are common, but the disorder also commonly causes stiffness or slowing of movement.

### 2) In the early stages of Parkinson's disease, your face may show little or no expression. Your arms may not swing when you walk. Your speech may become soft or slurred. Parkinson's disease symptoms worsen as your condition progresses over time.

### 3) Although Parkinson's disease can't be cured, medications might significantly improve your symptoms. Occasionally, your doctor may suggest surgery to regulate certain regions of your brain and improve your symptoms.

   ## >>Regression vs Classification<<
   ## Which one is to be used for this Problem Statement??
   ## Regression algorithms are used to predict the continuous values such as price, salary, age, etc. 
   ## Classification algorithms are used to predict/Classify the discrete values such as Male or Female, True or False, Spam or Not Spam, etc.

In [1]:
# Install XGboost as we would be using it further!
!pip install xgboost



# What is XGBoost?

#### 1) XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements Machine Learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting to solve many data science problems in a fast and accurate way.

#### 2) XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed.

#### 3) It can be used for both Classification and Regression models

# When to Use XGBoost?
#### 1 When you have large number of observations in training data.
#### 2 Number features < number of observations in training data.
#### 3 It performs well when data has mixture numerical and categorical features or just numeric features.
#### 4 When the model performance metrics are to be considered.

In [2]:
#make necessary imports
import numpy as np
import pandas as pd
import os, sys
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#### NumPy is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays.
#### Pandas in Python is a package that is written for data analysis and manipulation. Pandas offer various operations and data structures to perform numerical data manipulations and time series. Pandas is an open-source library that is built over Numpy libraries. Pandas library is known for its high productivity and high performance. Pandas is popular because it makes importing and analyzing data much easier.
#### The OS module in Python provides functions for interacting with the operating system. OS comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality. The *os* and *os.path* modules include many functions to interact with the file system.
#### The sys module in Python provides various functions and variables that are used to manipulate different parts of the Python runtime environment. It allows operating on the interpreter as it provides access to the variables and functions that interact strongly with the interpreter.
#### The Min-Max scalar Transforms features by scaling each feature to a given range. This estimator scales and translates each feature individually such that it is in the given range on the training set.Usually MinMaxScaler scales all the data features in the range [0, 1] or else in the range [-1, 1] if there are negative values in the dataset. This scaling compresses all the inliers in the narrow range [0, 0.005].
#### The train-test split is a technique for evaluating the performance of a machine learning algorithm. It can be used for classification or regression problems and can be used for any supervised learning algorithm. The procedure involves taking a dataset and dividing it into two subsets
#### Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition: Accuracy = Number of correct predictions Total number of predictions.

In [3]:
#Read the data form a csv file using pandas!
df=pd.read_csv('https://raw.githubusercontent.com/chaitanyabaranwal/ParkinsonAnalysis/master/parkinsons.csv')
df.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE,status
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654,1
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674,1
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634,1
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975,1
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335,1


## Get the features(used to predict the output(labels)) and labels(what you're trying to predict with help of features(input))

### Method-1 ( Using loc )

In [4]:
features=df.loc[:,df.columns!='status'].values[:,1:] 
labels=df.loc[:,'status'].values

##### Here 'name' column doesn't help in predicting the output. Hence we don't include it in the features(so, we use .values[:,1:] which takes columns from 1(ignoring 0 column(i.e.,name)) ) )

### Method-2 ( Using iloc )

In [5]:
features=df.iloc[:,:-1].values[:,1:]
labels=df.iloc[:,-1].values

In [6]:
features,labels

(array([[119.992, 157.302, 74.997, ..., 0.266482, 2.301442, 0.284654],
        [122.4, 148.65, 113.819, ..., 0.33559, 2.486855, 0.368674],
        [116.682, 131.111, 111.555, ..., 0.311173, 2.342259, 0.332634],
        ...,
        [174.688, 240.005, 74.287, ..., 0.158453, 2.679772, 0.131728],
        [198.764, 396.961, 74.904, ..., 0.207454, 2.138608, 0.123306],
        [214.289, 260.277, 77.973, ..., 0.190667, 2.555477, 0.148569]],
       dtype=object),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1,

In [7]:
#Get the count of each label (i.e.,0 and 1) in labels
print(str(labels[labels==1].shape[0])+" 1's" + " & "+ str(labels[labels==0].shape[0]) +" 0's")

147 1's & 48 0's


In [8]:
#Scale the features to between -1 and 1
scaler=MinMaxScaler((-1,1)) #SCales the data into a range (-1,1)
x=scaler.fit_transform(features) 
y=labels

#### Usually fit_transform is used on the training data so that we can scale the training data and also learn the scaling parameters. Here, the model built will learn the mean and variance of the features of the training set. These learned parameters are then further used to scale our test data.

### >> fit() vs transform() vs fit_transform() <<
#### The fit() method identifies and learns the model parameters from a training data set. For example, standard deviation and mean for normalization. Or Min (and Max) for scaling features to a given range.
#### The transform() method applies parameters learned from the fit() method. The transform() method transforms the training data and the test data (aka. unseen data)
#### The fit_transform() method first fits, then transforms the data-set in the same implementation. The fit_transform() method is an efficient implementation of the fit() and transform() methods. fit_transform() is only used on the training data set as a “best practice”
#### Since fit_transform() is already computing and transforming the training data, so only transformation for testing data is left,since parameter needed for transformation is already computed and stored only transformation() of testing data is left therefor only transform() is used instead of fit_transform().
#### >> The fit method is calculating the mean and variance of each of the features present in our data. The transform method is transforming all the features using the respective mean and variance. Now, we want scaling to be applied to our test data too and at the same time do not want to be biased with our model. We want our test data to be a completely new and a surprise set for our model. The transform method helps us in this case.
#### >> If we will use the fit method on our test data too, we will compute a new mean and variance that is a new scale for each feature and will let our model learn about our test data too. Thus, what we want to keep as a surprise is no longer unknown to our model and we will not get a good estimate of how our model is performing on the test (unseen) data which is the ultimate goal of building a model using machine learning algorithm.
####  >> This is the standard procedure to scale our data while building a machine learning model so that our model is not biased towards a particular feature of the dataset and at the same time prevents our model to learn the features/values/trends of our test data.

In [9]:
#Split the dataset using train-test-split
x_train,x_test,y_train,y_test=train_test_split(x, y, test_size=0.2, random_state=7)

# 20% test data and 80% training data 

### >> Random state <<
#### If you don't specify the random_state in the code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.
#### However, if a fixed value is assigned like random_state = 0 or 1 or 7 or any other integer then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

In [10]:
#Train the model
model=XGBClassifier() # Finally,We are using XGBoost classifier to train our model
                        # The above code creates an instance of the model
model.fit(x_train,y_train) #.fit() trains the model





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

#### >> The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning. Note that the model is fitted using X(features) and y(target) , but the object holds no reference to X and y .

In [11]:
#Calculating the accuracy of the XGBoost model
y_pred=model.predict(x_test) # y_pred gives the predicted values of the test data using XGBoost classifier
print(accuracy_score(y_test, y_pred)*100) #Using accuracy score we get the model's accuracy


94.87179487179486


#### >> fit() method will fit the model to the input training instances while predict() will perform predictions on the testing instances, based on the learned parameters during fit .

## Finally Trained the model with very few lines of code and accuracy of 94.8% is pretty good for any model!