# LSVT Voice Rehabilitation
**Abstract: 126 samples from 14 participants, 309 features. Aim: assess whether voice rehabilitation treatment lead to phonations considered 'acceptable' or 'unacceptable' (binary class classification problem).**<br>
**Original dataset :- https://archive.ics.uci.edu/ml/datasets/LSVT+Voice+Rehabilitation**<br>

### Lets Start with importing our neccessary libraries
**That includes:**:-<br>
    *1.) Tensorflow*<br>
    *2.) Pandas*<br>
    *3.) Numpy*<br>
    *4.) Matplotlib*<br>
    *5.) Seaborn*<br>
    *6.) Svm classifier*<br>
    *7.) Decision tree cassifier*<br> 
    *8.) Random Forest classifier.*<br>
    *9.) Pandas profiling*<br>
    etc

In [None]:
import tensorflow as tf
import pandas_profiling 
import datetime as dt
import pandas as pd
import numpy as np
import graphviz
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from tensorflow import keras
from sklearn import tree
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
%matplotlib inline

### Lets import our dataset data.csv
**Link:-** <a>https://www.openml.org/d/1484 </a><br>
**I have used dataset from openml because is has a classification column so that we can train and test our model, The testing on this dataset will also work for the original raw dataset from UCI Machine Learning LSVT dataset.**<br>

In [None]:
df = pd.read_csv('../input/lsvt-voice-rehabilation-dataset/data.csv')

#### Lets looks at our Target feature

In [None]:
df['Class']

In [None]:
df.head()

#### Create a Correlation matrix with our target feature to see on which feature our target feature is dependent

In [None]:
#plt.figure(figsize=(7,5))
cor = df.corr()
cor
#heat_map = sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
#plt.show()

In [None]:
cor_target = abs(cor["Class"])

### The Relevant Features of our Target colums are:-
**V80**<br>
**V82**<br>
**V84**<br>
**V85**<br>
**V86**<br>
**V153**<br>
***Note :- The feature Class is itself only thats why correlation is 1.***<br>

In [None]:
relevant_features = cor_target[cor_target>0.48]
relevant_features

### Create a new dataframe from the selected features

In [None]:
data = [df['V80'],df['V82'],df['V84'],df['V85'],df['V86'],df['V153'],df['Class']]

In [None]:
head = ['V80','V82','V84','V85','V86','V153','Class']

In [None]:
new_df = pd.concat(data,axis=1,keys=head)

In [None]:
new_df.head()

In [None]:
new_df.describe

**There are no missing value in our dataset**

In [None]:
new_df.isna().sum() #no missing values

In [None]:
new_df.info()

### Lets plot a box plot to see how the values in our dataset is distributed<br>
**As we can see the values are not evenly distributed some have very large number of negative values like v153 where as some has values between 0 and 1. So we need to scale them up so that we can train our model efficiently**

In [None]:
plt.figure(figsize=(18,9))
new_df[['V80','V82','V84','V85','V86','V153','Class']].boxplot()
plt.title("Numerical variables in Our Dataset", fontsize=20)
plt.show()

In [None]:
new_df.head()

**I am here performing min-max Normalization**<br>
**For more info:- <a>https://www.geeksforgeeks.org/ml-feature-scaling-part-2/</a>**

In [None]:
new_df['V86'] = (new_df['V86']- new_df['V86'].min())/(new_df['V86'].max() - new_df['V86'].min())

In [None]:
new_df['V85'] = (new_df['V85']- new_df['V85'].min())/(new_df['V85'].max() - new_df['V85'].min())

In [None]:
new_df['V84'] = (new_df['V84']- new_df['V84'].min())/(new_df['V84'].max() - new_df['V84'].min())

In [None]:
new_df['V82'] = (new_df['V82']- new_df['V82'].min())/(new_df['V82'].max() - new_df['V82'].min())

In [None]:
new_df['V80'] = (new_df['V80']- new_df['V80'].min())/(new_df['V80'].max() - new_df['V80'].min())

In [None]:
new_df['V153'] = (new_df['V153']- new_df['V153'].min())/(new_df['V153'].max() - new_df['V153'].min())

**For our Target class we need to make it in binary form ie in term of 0 and 1. 0 for false and 1 for true.**

In [None]:
new_df['Class'] = new_df['Class'].factorize()[0]

In [None]:
new_df

## Lets see the box plot after Normalization :-
**Now the Values are Betweeen 0 and 1.**

In [None]:
plt.figure(figsize=(18,9))
new_df[['V80','V82','V84','V85','V86','V153','Class']].boxplot()
plt.title("Numerical variables in Our Dataset", fontsize=20)
plt.show()

## Using pandas profiling :-
**Pandas profiling can be used for expolration data analysis, It plot correlation matrix, gives you the valuable information about the feature and all**<br>
**For more info read this blog :- https://towardsdatascience.com/exploratory-data-analysis-with-pandas-profiling-de3aae2ddff3**

In [None]:
start_time = dt.datetime.now()
print("Started at ", start_time)
report = pandas_profiling.ProfileReport(new_df)
report

**Now lets extract our feature which are :- 'V80', 'V82', 'V84', 'V85', 'V86', 'V153'.**

In [None]:
X= new_df.drop(['Class'],axis=1)

**Our label is our target class ie 'Class'.**

In [None]:
y = new_df['Class']

## Hyperparameter Tuning (hpt is done to get the optimal parameters so that our model can work efficiently as in sklearn you have so many parameter so it become difficult to get the optimal parameter below code can also be used as a template for hpt):-

In [None]:
model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,5,20],
            'kernel': ['rbf','linear'],
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'max_depth':[1,5,9],
            'n_estimators': [1,5,20,100],
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,2,5,10,15]
        }
    }
}


## The Optimal Parameters are(I have used gridsearcv as datset has only 126 instances if dataset is large one can use randomizedsearchcv):-

In [None]:
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(X, y)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_,
    })
    
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df

# Model Training(as it is a classification problem) :- <br>
### 1.) Using Random forest classifier <br>
### 2.) Using Logistic Regression<br>
### 3.) Using SVM <br>

### Random Forest model performance using cross_val_score :-

In [None]:
scores = cross_val_score(RandomForestClassifier(max_depth=9,n_estimators=100),X, y,cv=5)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

### Logistic Regression model performance using cross_val_score :-

In [None]:
scores = cross_val_score(LogisticRegression(solver='liblinear',multi_class='auto',C=1),X, y,cv=5)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

### SVM performance using cross_val_score

In [None]:
scores = cross_val_score(svm.SVC(gamma='auto',kernel='linear',C=20),X, y,cv=5)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

**As it is clearly visivle that Random Forest is best classifier for our prediction**

In [None]:
x_train,x_test,y_train,y_test=train_test_split( X, y, test_size=0.3, random_state=42)

In [None]:
model = RandomForestClassifier(max_depth=9,n_estimators=100)

In [None]:
model.fit(x_train,y_train)

In [None]:
y_pred = model.predict(x_test)

In [None]:
accuracy_score(y_test, y_pred)*100

In [None]:
cm = tf.math.confusion_matrix(labels=y_test,predictions=y_pred)
import seaborn as sns
plt.figure(figsize =(5,4))
sns.heatmap(cm,annot=True,fmt = 'd')
plt.xlabel('Predicted')
plt.ylabel('Truth')

# Conclusion:-
**Random Forest Classifier is the best fitted classifier for our prediction.It has cross_val_score greater than svm and logistic regression.<br>On training and testing on our dataset, Our model gave a accuracy of greater than 90%.**<br>
#### Key learning :-
**Learnt about confusion matrix,classifiers,feature extraction techniques,how to process raw data,feature scaling,how to make model with better accuracy using data preprocessing and pandas profiling.**