**COMPARISON OF REGRESSION MODELS FROM BIOMECHANICAL FEATURES OF ORTHOPEDIC PATIENTS DATASET**

In this kernel I compare linear regression, polynomial regression, decision tree regression and random forest regression models from orthopedic patients'data. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
data = pd.read_csv('../input/column_2C_weka.csv')

**ANALYSING THE DATA**

First of all let's look at our dataset:

In [None]:
data.head()

In [None]:
data.info()

We can see 310 patients'orthopedic values are used in this dataset. And their overall status is classified as normal or abnormal.

In [None]:
data['class'].value_counts()

Lets see the statistics of the each feature:

In [None]:
d_stat = data.describe()
d_stat

In order to see the correlation between the features we can use: 

In [None]:
data.corr()

Features closer to 1 are more correlated with the other. 
From the correlation analysis one can say pelvic incidence and sacral slope are related, because of high correlation value (0.81).

Now lets visualize our 310 patients' mean orthopedic status:

In [None]:
x_names = list(d_stat.columns) # Using the columns for the each orthopedic status'names.
y_values = list(d_stat.mean()) # Using the mean values for the each feature.
plt.figure(figsize=(10,5))
sns.barplot(x=x_names, y=y_values)
plt.xticks(rotation=45)
plt.ylabel('Mean values of 310 the patients')
plt.show()

**REGRESSION ANALYSIS**

I will be using only two features 'sacral slope' and 'pelvic incidence' which are higly correlated. And filter only abnormal patients.

In [None]:
# Filtering only abnormal patients:
d_new = data[data['class'] == 'Abnormal']

# Our graphics x will be pelvic incidence and y will be sacral slope values:
x = np.array(d_new.loc[:,'pelvic_incidence']).reshape(-1,1)
y = np.array(d_new.loc[:, 'sacral_slope']).reshape(-1,1)

In [None]:
# Now visualize the values:
plt.figure(figsize=(9,9))
plt.scatter(x=x, y=y, color='purple', alpha=0.3)
plt.xlabel('Pelvic Incidence Values')
plt.ylabel('Sacral Slope Values')
plt.show()

From the graph above one can see if pelvic incidence value is high, sacral slope value of the patient is high as well. 
Now lets make a linear regression model for our data.

**LINEAR REGRESSION**

Let's import sklearn linear regression library and then create our linear regression model based on x and y features which are pelvic incidence and sacral slope values.

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(x,y)

In [None]:
# Predicting new y values from our data's x values:
y_head = lr.predict(x)

plt.figure(figsize=(9,9))
plt.scatter(x,y, color='purple', alpha=0.3, label='patients')
plt.plot(x, y_head, color='blue', label='linear')
plt.xlabel('Pelvic Incidence Values')
plt.ylabel('Sacral Slope Values')
plt.legend()
plt.show()

**Question:** What is the sacral slope value of a patient with a pelvic incidence value of 110?

As you can see from the graph above, we don't have a patient with a 110 pelvic incidence status. 
From our model we want the computer to guess the result.

In [None]:
print('Sacral Slope Value of the Patient with 100 Pelvic Incidence Value: ', lr.predict([[110]]))

So we created a linear regression model and we can predict sacral slope values from given pelvic incidence values. 
But how can we know that our model is good enough? 
In order to see our models correction, we have to evaluate r square score value of our model.

In [None]:
from sklearn.metrics import r2_score
print('Our models r square score is: ', r2_score(y, y_head))

Model is working better as the r square score value reaches to 1.
Our model's r square score value is 0.64 which is not that great.
Now we can try another model like polynomial linear regression.

**POLYNOMIAL LINEAR REGRESSION**

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)

x_poly = poly.fit_transform(x)

lr2 = LinearRegression()
lr2.fit(x_poly, y)
y_head2 = lr2.predict(x_poly)

In [None]:
plt.figure(figsize=(9,9))
plt.scatter(x,y, color='purple', alpha=0.3, label='patients')
plt.plot(x, y_head, color='blue', alpha=0.6, label='linear')
plt.plot(x, y_head2, color='green', alpha=0.6, label='poly')
plt.xlabel('Pelvic Incidence Values')
plt.ylabel('Sacral Slope Values')
plt.legend()
plt.show()

Now let's ask the same question to the computer. And compare the results by linear and polynomial models.

In [None]:
# Polynomial Linear Regression Prediction:
print('Sacral Slope Value of the Patient with 100 Pelvic Incidence Value: ', lr2.predict([[1,110,12100]]))

# Linear Regression Prediction:
print('Sacral Slope Value of the Patient with 100 Pelvic Incidence Value: ', lr.predict([[110]]))

If we compare our linear regression and polynomial regression model rˆ2 score values:

In [None]:
from sklearn.metrics import r2_score
print('Our linear models r square score is: ', r2_score(y, y_head))
print('Our polynomial models r square score is: ', r2_score(y, y_head2))

Polynomial linear regression rˆ2 score is slightly greater (closer to 1) which is better.

**DECISION TREE LINEAR REGRESSION**

Now lets create our decision tree linear regression using the dataset. 
Then we can create a fitted graph.
And finally we can predict the same value and compare the result with our previous models.

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor()

tree.fit(x,y)

x_tree = np.arange(min(x), max(x),0.1).reshape(-1,1)
y_tree = tree.predict(x_tree)

In [None]:
plt.figure(figsize=(9,9))
plt.scatter(x,y, color='purple', alpha=0.4, label='patients')
plt.plot(x, y_head, color='blue', alpha=0.6, label='linear') # Linear regression model fitted line
plt.plot(x, y_head2, color='green', alpha=0.6, label='poly') # Polynomial linear regression model fitted line
plt.plot(x_tree, y_tree, color='orange', alpha=0.5, label='decision tree') # Decision tree regression model fitted line
plt.xlabel('Pelvic Incidence Values')
plt.ylabel('Sacral Slope Values')
plt.legend()
plt.show()

And now compare the same results for the three models:

In [None]:
# Linear Regression Prediction:
print('Sacral Slope Value of the Patient with 100 Pelvic Incidence Value: ', lr.predict([[110]]))

# Polynomial Linear Regression Prediction:
print('Sacral Slope Value of the Patient with 100 Pelvic Incidence Value: ', lr2.predict([[1,110,12100]]))

# Decision Tree Linear Regression Prediction:
print('Sacral Slope Value of the Patient with 100 Pelvic Incidence Value: ', tree.predict([[110]]))

**RANDOM FOREST REGRESSION**

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(n_estimators=100, random_state=42)

forest.fit(x,y)

x_for = np.arange(min(x), max(x),0.1).reshape(-1,1)
y_for = forest.predict(x_for)

In [None]:
plt.figure(figsize=(15,15))
plt.subplot(2,2,1)
plt.scatter(x,y, color='purple', alpha=0.3, label='patients')
plt.plot(x, y_head, color='blue', alpha=0.6, label='linear')
plt.xlabel('Pelvic Incidence Values')
plt.ylabel('Sacral Slope Values')
plt.legend()

plt.subplot(2,2,2)
plt.scatter(x,y, color='purple', alpha=0.3, label='patients')
plt.plot(x, y_head2, color='green', alpha=0.6, label='poly')
plt.xlabel('Pelvic Incidence Values')
plt.ylabel('Sacral Slope Values')
plt.legend()

plt.subplot(2,2,3)
plt.scatter(x,y, color='purple', alpha=0.3, label='patients')
plt.plot(x_tree, y_tree, color='orange', alpha=0.6, label='decision tree')
plt.xlabel('Pelvic Incidence Values')
plt.ylabel('Sacral Slope Values')
plt.legend()

plt.subplot(2,2,4)
plt.scatter(x,y, color='purple', alpha=0.3, label='patients')
plt.plot(x_for, y_for, color='red', alpha=0.6, label='random forest')
plt.xlabel('Pelvic Incidence Values')
plt.ylabel('Sacral Slope Values')
plt.legend()
plt.show()

Finally compare the results from the four models:

In [None]:
# Linear Regression Prediction:
print('Sacral Slope Value of the Patient with 100 Pelvic Incidence Value: ', lr.predict([[110]]))

# Polynomial Linear Regression Prediction:
print('Sacral Slope Value of the Patient with 100 Pelvic Incidence Value: ', lr2.predict([[1,110,12100]]))

# Decision Tree Linear Regression Prediction:
print('Sacral Slope Value of the Patient with 100 Pelvic Incidence Value: ', tree.predict([[110]]))

# Random Forest Regression Prediction:
print('Sacral Slope Value of the Patient with 100 Pelvic Incidence Value: ', forest.predict([[110]]))