![](https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/gettyimages-923681846-1533217603.jpg?crop=1xw:0.9988221436984688xh;center,top&resize=480:*)

Lower Back Pain Symptoms Dataset : identify a person with back that is abnormal or normal using collected physical spine details.

310 Observations, 13 Attributes (12 Numeric Predictors, 1 Binary Class Attribute - No Demographics).

Lower back pain can be caused by a variety of problems with any parts of the complex, 
interconnected network of spinal muscles, nerves, bones, discs or tendons in the lumbar spine. Typical sources of low back pain include:
-	The large nerve roots in the low back that go to the legs may be irritated
-	The smaller nerves that supply the low back may be irritated
-	The large paired lower back muscles (erector spinae) may be strained
-	The bones, ligaments or joints may be damaged
-	An intervertebral disc may be degenerating.

An irritation or problem with any of these structures can cause lower back pain and/or pain that radiates or is referred to other parts of the body. Many lower back problems also cause back muscle spasms, which don't sound like much but can cause severe pain and disability.
While lower back pain is extremely common, the symptoms and severity of lower back pain vary greatly. A simple lower back muscle strain might be excruciating enough to necessitate an emergency room visit, while a degenerating disc might cause only mild, intermittent discomfort.



- Attribute1  = pelvic_incidence  (numeric) 
- Attribute2 = pelvic_tilt (numeric) 
- Attribute3 = lumbar_lordosis_angle (numeric) 
- Attribute4 = sacral_slope (numeric) 
- Attribute5 = pelvic_radius (numeric) 
- Attribute6 = degree_spondylolisthesis (numeric) 
- Attribute7= pelvic_slope(numeric)
- Attribute8= Direct_tilt(numeric)	
- Attribute9= thoracic_slope(numeric)
- Attribute10= cervical_tilt(numeric)
- Attribute11=sacrum_angle(numeric)
- Attribute12= scoliosis_slope(numeric)
- Attribute class {Abnormal, Normal} 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### EDA(EXPLORATORY DATA ANALYSIS)

In [None]:
data=pd.read_csv('../input/Dataset_spine.csv')

In [None]:
data.info()

In [None]:
data.head(2).T

In [None]:
data=data.drop('Unnamed: 13',axis=1)

In [None]:
data=data.rename({'Class_att':'Dependent variable',
                   'Col1':'pelvic_incidence  (numeric)',  
                   'Col2':'pelvic_tilt  (numeric)',  
                   'Col3':'lumbar_lordosis_angle  (numeric)',  
                   'Col4':'sacral_slope  (numeric)',  
                   'Col5':'pelvic_radius  (numeric)',  
                   'Col6':'degree_spondylolisthesis  (numeric)',  
                   'Col7':'pelvic_slope(numeric)',  
                   'Col8':'Direct_tilt(numeric)',  
                   'Col9':'thoracic_slope(numeric)',  
                   'Col10':'cervical_tilt(numeric)', 
                   'Col11':'sacrum_angle(numeric)', 
                   'Col12':'scoliosis_slope(numeric)'},axis=1)

In [None]:
plt.title('Lower Backpain symptoms target class distribution')
plt.pie(data['Dependent variable'].value_counts(),
        labels=data['Dependent variable'].value_counts().index,
        autopct='%.2f',
       explode=[0,0.05],startangle=90)
plt.show()

- The above chart explains that we have class imbalance of 68:32

### DATA DISTRIBUTION

In [None]:
data.hist(figsize=(15,10),grid=False)
plt.show()

- From the above histogram, we can conclude that most of the independent attributes didn't follow the normal distribution.

### SUMMARY

In [None]:
data.describe().T

- No of tilts for different organs in descending order**(Direct_tilt>pelvic_tilt>cervical_tilt)**

### OUTLIERS

In [None]:
data.plot(kind='box',subplots=True,layout=(4,4),figsize=(15,8))
plt.show()

- Most of the outliers are found in the pelvic related attributes(pelvic_incidence,pelvic_radius,pelvic_tilt).
- Some outliers are present in degree of spondylolisthesis.

### CORRELATION

In [None]:
data=pd.get_dummies(data,drop_first=True)

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(data.corr()[data.corr().abs()>0.3],annot=True)

In [None]:
data.info()

In [None]:
sns.lmplot( x="pelvic_incidence", y="pelvic_radius", hue="Dependent variable_Normal", data=data)

- From this we can assume that most of the **normal** people have **pelvic radius of between 110 and 140**

In [None]:
sns.lmplot( x="pelvic_incidence", y="degree_spondylolisthesis", hue="Dependent variable_Normal", data=data)

- Most of the **normal** have zero degree of spondylolisthesis (**low severity**)

In [None]:
sns.lmplot( x="pelvic_incidence", y="lumbar_lordosis_angle", hue="Dependent variable_Normal", data=data)

- Most of the **normal** people have the **LLA(Lumbar Lordosis Angle) between 20 and 60**

In [None]:
data.corr()['Dependent variable_Normal'].sort_values(ascending=False)[1:].plot.barh(title='Correlation b/w dependent & Independent Variable')

### Reduce variable using PCA

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore

In [None]:
PCA_x=data.drop('Dependent variable_Normal',axis=1) # dropping target column

In [None]:
# Standardisation
sc = StandardScaler()
X_std =  sc.fit_transform(PCA_x)          
cov_matrix = np.cov(X_std.T)

In [None]:
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)

In [None]:
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)

In [None]:
pd.DataFrame(cum_var_exp).T.plot.bar(title='Cumulative variance at each variable')

- From this the 80% of variance is explained by first 7 variables.

In [None]:
PCA_x_std = sc.fit_transform(PCA_x)
bpc_reduced = PCA(n_components=7).fit_transform(PCA_x_std)

In [None]:
BP_reduced=pd.DataFrame(bpc_reduced)
BP_reduced.head()

- Above data frame shows the reduced columns by using PCA(Principal Component Analysis).

In [None]:
sns.heatmap(BP_reduced.corr()[BP_reduced.corr().abs()>0.3],annot=True)

- No correlation exist between the PCA reduced columns.

### MODEL CONSTRUCTION
### 1.KNN

- In this model I am using the pre-processed data done by PCA

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
X =  BP_reduced
y =  data[["Dependent variable_Normal"]]
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.30, random_state=1)

In [None]:
modelKNN=KNeighborsClassifier(n_neighbors=4)
modelKNN=modelKNN.fit(X_train,Y_train)

In [None]:
print('train score',modelKNN.score(X_train,Y_train))
print('test score',modelKNN.score(X_test,Y_test))

- This model is overfit and not performing well.

### 2)	Decision Tree 

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
data.corr()[data.corr().abs()>0.2]['Dependent variable_Normal'].sort_values(ascending=False)[1:7]

- selecting the features having correlation value greater than 0.2

In [None]:
x_features=data.corr()[data.corr().abs()>0.2]['Dependent variable_Normal'].sort_values(ascending=False)[1:7].index

In [None]:
X =  data[x_features]
y =  data[["Dependent variable_Normal"]]
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.30, random_state=1)

In [None]:
dt_model = DecisionTreeClassifier()
dt_model=dt_model.fit(X_train,Y_train)
dt_model.score(X_train,Y_train)

In [None]:
# Reguralisaction
dt_model = DecisionTreeClassifier(max_depth=2)
dt_model=dt_model.fit(X_train,Y_train)
print('Train accuracy',dt_model.score(X_train,Y_train))

In [None]:
print('Test accuracy',dt_model.score(X_test,Y_test))

- This model is almost good the test and train is almost equal.

### 3)	Logistic regression

- I am using the same features that are used in decision tree.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr_model=LogisticRegression()
lr_model=lr_model.fit(X_train,Y_train)

In [None]:
print('Train Accuracy',lr_model.score(X_train,Y_train))
print('Test Accuracy',lr_model.score(X_test,Y_test))

- This model is underfitting where the test accuracy is high

In [None]:
lr_model=LogisticRegression(penalty='l1')
lr_model=lr_model.fit(X_train,Y_train)
print('Train Accuracy',lr_model.score(X_train,Y_train))
print('Test Accuracy',lr_model.score(X_test,Y_test))

- After tuning some parameters,train accuracy has been improved.
- When compared with others Logistic regression performs well.