<a href="https://colab.research.google.com/github/rahulrajbo/Supervised-Model/blob/main/Patients_Condition_Prediction_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **• DOMAIN: Healthcare**

# **• CONTEXT: Medical research university X is undergoing a deep research on patients with certain conditions.University has an internal AI team. Due to confidentiality the patient’s details and the conditions are masked by the client by providing different datasets to the AI team for developing a AIML model which can predict the condition of the patient depending on the received test results.**

**• DATA DESCRIPTION: The data consists of biomechanics features of the patients according to their current
conditions. Each patient is represented in the data set by six biomechanics attributes derived from the shape and
orientation of the condition to their body part.**
> **1. P_incidence**

> **2. P_tilt**

> **3. L_angle**

> **4. S_slope**

> **5. P_radius**

> **6. S_degree**

> **7. Class**

1. **pelvic_incidence-P_incidence**
2. **pelvic_tilt numeric-P_tilt**
3. **lumbar_lordosis_angle-L_angle**
4. **sacral_slope-S_slope**
5. **pelvic_radius-P_radius**
6. **degree_spondylolisthesis-S_degree**
7. **class**

# Importing Necessary Packages

In [None]:
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from scipy import stats
%matplotlib inline
sns.set_style('darkgrid')
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn import model_selection
import warnings
warnings.filterwarnings("ignore")

# 1. Import and warehouse data:

**Dataset 1**

In [None]:
df1=pd.read_csv('../input/patients-condition/Part1 - Normal.csv')

**Checking First 5 Rows**

In [None]:
df1.head()

**Shape of the dataset**

In [None]:
df1.shape

**We have 7 columns and 100 rows**

**Dataset-2**

In [None]:
df2=pd.read_csv('../input/patients-condition/Part1 - Type_H.csv')

**Shape of the dataset**

In [None]:
df2.shape

**Checking First 5 Rows**

In [None]:
df2.head()

**Shape of the dataset**

In [None]:
df2.shape

**We have 7 columns and 60 rows**

**Dataset-3**

In [None]:
df3=pd.read_csv('../input/patients-condition/Part1 - Type_S.csv')

**Checking First 5 Rows**

In [None]:
df3.head()

**Shape of the dataset**

In [None]:
df3.shape

**We have 7 columns and 150 rows**

**Final Dataframe**

In [None]:
df=df1.append([df2,df3])

**Shape of the dataset**

In [None]:
df.shape

**Final Dataset have 7 columns and 310 rows**

# 2. Data cleansing:

**Information about the data**

In [None]:
df.info()

**Checking Datatypes**

In [None]:
df.dtypes

**There is no junk values in the dataset**

**Class is object we need to change the datatype of this column**

**Missing Value Check**

In [None]:
df.isnull().sum()

**There is no missing value in the dataset**

**Target Variable:**

In [None]:
df['Class'].value_counts()

**Here tp_s and Type_S, Normal and Nrmal,Type_H and type_ h represents same class.**

In [None]:
df.loc[df['Class']=='tp_s','Class']='Type_S'
df.loc[df['Class']=='Nrmal','Class']='Normal'
df.loc[df['Class']=='type_h','Class']='Type_H'

In [None]:
df['Class'].value_counts()

In [None]:
df['Class']=df['Class'].astype('category') #changing to category datatype

In [None]:
df['Class'].nunique()

**Here we have three different class in our dataset**

# 3.Data Analysis & Visulaization

**5 Point Summary**

In [None]:
df.describe()

**P_incidence:**

>  **Mean and Median are nearly equal .**

> **Distribution might be normal. we have 75 % of values are less than 72 but maxiumum value is 129**

**P_tilt:**

> **Mean and median are nearly equal.**

> **Distribution might be normal.**

> **It contains negative values**

> **75 % of values are less than 22 but maximum value is 49 so there might be little right skewness**

**L_angle:**

> **Mean and Median are nearly equal. There is no  deviation.**

> **Distribution might be normal**

> **There might be few outliers because of the maximum value**

**S_slope:**

> **Mean and Median are nearly equal.**

> **Towards the end there is little devation. 75% of values are lesser than 52 but maximum value is 121.**

**P_radius:**

> **Distribution might be normal.**

> **There is no much Deviation.**

**S_Degree:**

> **Mean is greater than Median so there might be right skewness in the data .**

> **We can see 75% of values are less than 41 but maximum value is 418 so there is obvious outliers in the data.**



# **Univariate Analysis**

**Distribution and outlier analysis of numerical variables**

**P_incidence**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(17,7))
sns.boxplot(x = 'P_incidence', data=df,  orient='h' , ax=axes[1],color='Green')
sns.distplot(df['P_incidence'],  ax=axes[0],color='Green')
axes[0].set_title('Distribution plot')
axes[1].set_title('Box plot')
plt.show()
#checking count of outliers.
q25,q75=np.percentile(df['P_incidence'],25),np.percentile(df['P_incidence'],75)
IQR=q75-q25
Threshold=IQR*1.5
lower,upper=q25-Threshold,q75+Threshold
Outliers=[i for i in df['P_incidence'] if i < lower or i > upper]
print('{} Total Number of outliers in P_incidence: {}'.format('\033[1m',len(Outliers)))

> **Normality is maintained with very less extreme values**

> **We can see three outliers exists in the column**

**P_tilt**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(17,7))
sns.boxplot(x = 'P_tilt', data=df,  orient='h' , ax=axes[1],color='Green')
sns.distplot(df['P_tilt'],  ax=axes[0],color='Green')
axes[0].set_title('Distribution plot')
axes[1].set_title('Box plot')
plt.show()
#checking count of outliers.
q25,q75=np.percentile(df['P_tilt'],25),np.percentile(df['P_tilt'],75)
IQR=q75-q25
Threshold=IQR*1.5
lower,upper=q25-Threshold,q75+Threshold
Outliers=[i for i in df['P_tilt'] if i < lower or i > upper]
print('{} Total Number of outliers in P_tilt: {}'.format('\033[1m',len(Outliers)))

> **Data is Normally distributed and we can see one peakness in the center**

> **It is has little skewness towards right side**

> **We can see one outlier in negative end and few outliers in positive end.**

**L_angle**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(17,7))
sns.boxplot(x = 'L_angle', data=df,  orient='h' , ax=axes[1],color='Green')
sns.distplot(df['L_angle'],  ax=axes[0],color='Green')
axes[0].set_title('Distribution plot')
axes[1].set_title('Box plot')
plt.show()
#checking count of outliers.
q25,q75=np.percentile(df['L_angle'],25),np.percentile(df['L_angle'],75)
IQR=q75-q25
Threshold=IQR*1.5
lower,upper=q25-Threshold,q75+Threshold
Outliers=[i for i in df['L_angle'] if i < lower or i > upper]
print('{} Total Number of outliers in L_angle: {}'.format('\033[1m',len(Outliers)))

> **It is Normally distributed**

> **Little right skewness because of one outlier**

**S_slope**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(17,7))
sns.boxplot(x = 'S_slope', data=df,  orient='h' , ax=axes[1],color='Green')
sns.distplot(df['S_slope'],  ax=axes[0],color='Green')
axes[0].set_title('Distribution plot')
axes[1].set_title('Box plot')
plt.show()
#checking count of outliers.
q25,q75=np.percentile(df['S_slope'],25),np.percentile(df['S_slope'],75)
IQR=q75-q25
Threshold=IQR*1.5
lower,upper=q25-Threshold,q75+Threshold
Outliers=[i for i in df['S_slope'] if i < lower or i > upper]
print('{} Total Number of outliers in S_slope: {}'.format('\033[1m',len(Outliers)))

> **There is right skewness due to one outlier**

**P_radius**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(17,7))
sns.boxplot(x = 'P_radius', data=df,  orient='h' , ax=axes[1],color='Green')
sns.distplot(df['P_radius'],  ax=axes[0],color='Green')
axes[0].set_title('Distribution plot')
axes[1].set_title('Box plot')
plt.show()
#checking count of outliers.
q25,q75=np.percentile(df['P_radius'],25),np.percentile(df['P_radius'],75)
IQR=q75-q25
Threshold=IQR*1.5
lower,upper=q25-Threshold,q75+Threshold
Outliers=[i for i in df['P_radius'] if i < lower or i > upper]
print('{} Total Number of outliers in P_radius: {}'.format('\033[1m',len(Outliers)))

> **Data is normally distributed**

> **We can see outliers at both the ends.**

**S_Degree**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(17,7))
sns.boxplot(x = 'S_Degree', data=df,  orient='h' , ax=axes[1],color='Green')
sns.distplot(df['S_Degree'],  ax=axes[0],color='Green')
axes[0].set_title('Distribution plot')
axes[1].set_title('Box plot')
plt.show()
#checking count of outliers.
q25,q75=np.percentile(df['S_Degree'],25),np.percentile(df['S_Degree'],75)
IQR=q75-q25
Threshold=IQR*1.5
lower,upper=q25-Threshold,q75+Threshold
Outliers=[i for i in df['S_Degree'] if i < lower or i > upper]
print('{} Total Number of outliers in S_Degree: {}'.format('\033[1m',len(Outliers)))

> **There is Positive Skewness in the data**

> **Hugely affected by Outliers**

**Distribution of Target Variable**

In [None]:
f,axes=plt.subplots(1,2,figsize=(17,7))
df['Class'].value_counts().plot.pie(autopct='%1.1f%%',ax=axes[0])
sns.countplot('Class',data=df,ax=axes[1])
axes[0].set_title('Response Variable Pie Chart')
axes[1].set_title('Response Variable Bar Graph')
plt.show()

**Type_S variable has 48.4% of total values followed by Normal and Type_H**

 # **Bi Variate Analysis**

**Class vs P_incidence**

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(x='Class', y='P_incidence', data= df)
plt.show()

> **P_Incidence Value is larger for Type_S Class. We can see some extreme values as well**

> **Normal Value is slightly higher than Type_H**

**Class vs P_tilt**

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(x='Class', y='P_tilt', data= df)
plt.show()

> **Mean of Type_S is slightly higher than rest two**

> **Few cases Normal and Type_H also has huge values**

**Class vs L_angle**

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(x='Class', y='L_angle', data= df)
plt.show()

> **L_Angle has higher value for Type_S Class**

>**We can see Normal class has higher values compared to type_H class**

> **Each class contains one outlier**

**Class vs S_slope**

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(x='Class', y='S_slope', data= df)
plt.show()

> **S_slope has huge values for Type_S class**

>**Normal class has high s_slope compared to Type_H**

**Class vs P_radius**

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(x='Class', y='P_radius', data= df)
plt.show()

> **We can see P_radius value is more for Normal Class**

> **There is some extreme values for Type_s class**

> **All classes has higher and lower Value**

**Class vs S_Degree**

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(x='Class', y='S_Degree', data= df)
plt.show()

> **S_Degree has extreme values for type_S Class**

>**Few Normal class also has huge values for S_Degree**

# **Multivariate Analysis**

In [None]:
df.columns

**Pair Plot of independent Variables**

In [None]:
sns.pairplot(df)
plt.show()

> **Along the diagonal we can see the distribution of individual variable**

> **P_incidence has  postive realtionship with all variables except P_radius. Relationship is higher for S_slope and L_angle**

> **P_tilt has Higher Relationship with P_incidence and L_angle.There is no Relationship with s_slope and p_radius**

> **L_angle has postive Relationship with p_tilt,s_slope and s_degree. It has no Relationship with P_radius**

> **s_slope has positive Relationship with L_angle and s_degree**

> **p_radius has no Relationship with s_degree,p_tilt,l_angle.**

> **S_degree has no strong positive Relationship with any of the variables.**

In [None]:
sns.pairplot(df,hue='Class')

> **Along the diagonal we can see distribution of variable for three claases are not same.We can prove that statistically as well**

> **It is evident that type_s class is more compared to other two**

> **Normal class has higher values compared to Type_H**

In [None]:
class_summary=df.groupby('Class') #getting mean values of each class for all independent variables
class_summary.mean().reset_index()

**It is clear that s_Degree of Type_S contains larger values.**


# **Hypotesis Testing**

# Is the distribution of independent variables across normal,type_H and type_s, the same?

**Here we are using one-way anova to do statistical test.**

In [None]:
col=['P_incidence','P_tilt','L_angle','S_slope','P_radius','S_Degree']
for i in col:
    print('{} Ho: Class types does not affect the {}'.format('\033[1m',i))
    print('\n')
    print('{} H1: Class types affect the {}'.format('\033[1m',i))
    print('\n')
    df_normal=df[df.Class=='Normal'][i]
    df_typeH=df[df.Class=='Type_H'][i]
    df_typeS=df[df.Class=='Type_S'][i]
    f_stats,p_value=stats.f_oneway(df_normal,df_typeH,df_typeS)
    print('{} F_stats: {}'.format('\033[1m',f_stats))
    print('{} p_value: {}'.format('\033[1m',p_value))
    if p_value < 0.05:  # Setting our significance level at 5%
        print('{} Rejecting Null Hypothesis.Class types has efect on {}'.format('\033[1m',i))
    else:
        print('{} Fail to Reject Null Hypothesis.Class types has no effect on {}'.format('\033[1m',i))
    print('\n')

**We can see class type affects each and every independent variables**

In [None]:
plt.figure(dpi = 120,figsize= (5,4))
mask = np.triu(np.ones_like(df.corr()))
sns.heatmap(df.corr(),mask = mask, fmt = ".2f",annot=True,lw=1,cmap = 'plasma')
plt.yticks(rotation = 0)
plt.xticks(rotation = 90)
plt.title('Correlation Heatmap')
plt.show()

**Correlation between s_degree and p_incidence have high correlation.**

**S_degree and p_radius has negative correlation**

# 4. Data Pre-processing

# **Outlier Analysis**

**As we have seen in our EDA we have very less outliers which needs to be handled**

**We are imputing outiers with mean**

In [None]:
for c in col:
    #getting upper lower quartile values
    q25,q75=np.percentile(df[c],25),np.percentile(df[c],75)
    IQR=q75-q25
    Threshold=IQR*1.5
    lower,upper=q25-Threshold,q75+Threshold
    Outliers=[i for i in df[c] if i < lower or i > upper]
    print('{} Total Number of outliers in {} Before Imputing : {}'.format('\033[1m',c,len(Outliers)))
    print('\n')
    #taking mean of a column without considering outliers
    df_include = df.loc[(df[c] >= lower) & (df[c] <= upper)]
    mean=int(df_include[c].mean())
    print('{} Mean of {} is {}'.format('\033[1m',c,mean))
    print('\n')
    #imputing outliers with mean
    df[c]=np.where(df[c]>upper,mean,df[c])
    df[c]=np.where(df[c]<lower,mean,df[c])
    Outliers=[i for i in df[c] if i < lower or i > upper]
    print('{} Total Number of outliers in {} After Imputing : {}'.format('\033[1m',c,len(Outliers)))
    print('\n')

> **We have imputed all outliers with mean value**

# **Encoding Target Variable**

In [None]:
le=LabelEncoder()
df['Class']=le.fit_transform(df['Class'])
df['Class'].value_counts()

**Normal: 0**

**Type_H: 1**

**Type_S: 2**

In [None]:
df['Class']=df['Class'].astype('category') #changing datatype to category.

# **Checking on Target Imbalance**

In [None]:
f,axes=plt.subplots(1,2,figsize=(17,7))
df['Class'].value_counts().plot.pie(autopct='%1.1f%%',ax=axes[0])
sns.countplot('Class',data=df,ax=axes[1],order=[2,0,1])
axes[0].set_title('Response Variable Pie Chart')
axes[1].set_title('Response Variable Bar Graph')
plt.show()

**We have imbalanced target variable**

**Every class is not equally distributed.**

**48% of data is occupied by Type_S**

**When you have imbalance dataset model does not learn about less distributed classes. This gives
poor performance in unseen data**

# Train - Test Split

In [None]:
# Arrange data into independent variables and dependent variables
X=df.drop(columns='Class')
y=df['Class'] #target

In [None]:
X.describe()

# **Scaling Independent Variables**

In [None]:
X_Scaled=X.apply(zscore)

In [None]:
X_Scaled.describe().T

> **We have scaled independent variables to corresponding z-score.**

> **We can see Mean becomes close to zero and Standard Deviation becomes 1**

In [None]:
# Split X and y into training and test set in 70:30 ratio

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=10)

# 5. Model training, testing and tuning

# **KNN Classifier**

**Basic Model**

In [None]:
KNN = KNeighborsClassifier(n_neighbors= 5 , metric = 'euclidean' ) #Building knn with 5 neighbors

In [None]:
KNN.fit(X_train, y_train)
predicted_labels = KNN.predict(X_test)

# Classification Accuracy

In [None]:
print('Accuracy on Training data:',KNN.score(X_train, y_train) )
print('Accuracy on Test data:',KNN.score(X_test, y_test) )

> **Training Acuracy is 0.89 and Testing Accuracy is 0.77. Performance is less in test data.**

> **This is due to overfitting of data**

In [None]:
cm = confusion_matrix(y_test, predicted_labels, labels=[0, 1,2])

df_cm = pd.DataFrame(cm, index = [i for i in ["Normal","Type_H","Type_S"]],
                  columns = [i for i in ["Normal","Type_H","Type_S"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
plt.show()

> **Our model predicts Type_S correctly most of the time. Only two misclassification on this class**

> **Misclassification of labels are more when predicting normal class**

# Classification Report

In [None]:
print("classification  Matrix:\n",classification_report(y_test,predicted_labels))

> **Precision for Normal class: It tells,out of all predicted normal class what fraction are predicted correctly**

> **Recall(sensitivity or TPR) for Normal class: Out of all actual Normal class how much fraction we identified correctly**

> **class 0 predicted correctly for 68% of time. similary for class 1 48% and class 2 98%**

> **By F1 score we can say that precison and recall is balanced for class 0 by 60% and for class 1 by 56 %**

> **We have maximum F1 score for class 2.**

# **Finding best K value**

In [None]:
train_score=[]
test_score=[]
for k in range(1,51):
    KNN = KNeighborsClassifier(n_neighbors= k , metric = 'euclidean' )
    KNN.fit(X_train, y_train)
    train_score.append(KNN.score(X_train, y_train))
    test_score.append(KNN.score(X_test, y_test))

In [None]:
plt.plot(range(1,51),train_score)
plt.show()

> **Here training accuracy decreases when increase k value**

In [None]:
plt.plot(range(1,51),test_score)
plt.show()

> **The maximum accuracy occures when k is less than 20.**

> **We will fix k value as less than 20.**

In [None]:
k=[1,3,5,7,9,11,13,15,17,19]
for i in k:
    KNN = KNeighborsClassifier(n_neighbors=i, metric = 'euclidean' ) #Building knn with 5 neighbors
    KNN.fit(X_train, y_train)
    predicted_labels = KNN.predict(X_test)
    print('Accuracy on Training data for k {} is {}:'.format(i,KNN.score(X_train, y_train)))
    print('Accuracy on Test data for k {} is {}:'.format(i,KNN.score(X_test, y_test)))
    print("classification  Matrix:\n",classification_report(y_test,predicted_labels))

> **For K=13 we have balanced train and test error**

> **we can use k value as 13 because when we increase this value the precision becomes100% for class 2**

# K-Fold CV for finding best model

In [None]:
LR_model=LogisticRegression()
KNN_model=KNeighborsClassifier(n_neighbors=13)
GN_model=GaussianNB()
svc_model_linear = SVC(kernel='linear',C=1,gamma=.6)
svc_model_rbf = SVC(kernel='rbf',degree=2,C=.009)
svc_model_poly  = SVC(kernel='poly',degree=2,gamma=0.1,C=.01)

In [None]:
seed = 7
# prepare models
models = []
models.append(('LR', LR_model))
models.append(('KNN', KNN_model))
models.append(('NB', GN_model))
models.append(('SVM-linear', svc_model_linear))
models.append(('SVM-poly', svc_model_poly))
models.append(('SVM-rbf', svc_model_rbf))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed,shuffle=True)
	cv_results = model_selection.cross_val_score(model,  X,y, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

> **Accuracy is more for KNN,LR and svm-linear. However the standard deviation is less for svm-linear model.**

> **We can tell  svm-linear be a better algorithm for this dataset because of high accuracy and less Standard deviation**

**We will check with scaled values to see whether there is improvement in model**

In [None]:
seed = 7
# prepare models
models = []
models.append(('LR', LR_model))
models.append(('KNN', KNN_model))
models.append(('NB', GN_model))
models.append(('SVM-linear', svc_model_linear))
models.append(('SVM-poly', svc_model_poly))
models.append(('SVM-rbf', svc_model_rbf))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed,shuffle=True)
	cv_results = model_selection.cross_val_score(model,X_Scaled,y, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

**When the scaled values are used instead of normal values Logistic regression is performing well.**

**Logistic Regression gives 81% accuracy with little standard deviation.**

# **6.Conclusion and improvisation:**

> **All the variables has significant effect on target class**

> **class belongs to type_s has higher mean value for alomst all variables**

> **Class belongs to normal has lower values for all variables**

> **For almost all variables the distribution is normal**

> **For Knn, k=13 we are getting balanced train and test error**

> **We can use KNN as a final model because of balanced train and test error also the recall and precision values are good**

> **Clear description on each variables may help to understand problem statement better because of medical domain**