# Regression

In this section, we will implement models that aim to predict the melanoma tumor size of a patient based on other attributes.

We will be doing a multiple linear regression.  
Our features / Independent variables (X1, X2, etc..) will be the columns of the dataset, other than tumor size which is Y.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style
style.use("seaborn-darkgrid")

In [None]:
data = pd.read_csv(r"../input/melanoma-tumor-size-prediction-machinehack/Train.csv") #insert file path into the "read_csv" function
data.head()

In [None]:
list(data.columns)

All of these variables shown up (other than tumor_size) are the features that we will use for prediction.  
tumor_size is the variable we will try predicting.

Checking Missing Values:

In [None]:
data.isna().sum()

(Number of patients, number of variables):

In [None]:
data.shape

In [None]:
data.info()

X: variables/features that we will use to make predictions  
Y: The variable that will try to predict correctly

In [None]:
x = data.drop("tumor_size", axis=1)
y = data["tumor_size"]

In [None]:
x.head()

In [None]:
y.head()

We have to split our data into train and test sets.  
This means that we will not show all patients/examples to our model.  
We will let the model learn from a certain number of patients/examples, then we will test on the ones it didn't see.  
This is to check if the model makes good predictions for patients that it hasn't seen before.  
We need to give the model many examples to learn from, because the more examples you give it, the better it learns.  
So we show it most of our data (usually, about 70% to 90%) and keep a small number of examples/patients for testing.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.1, random_state=11)

## Linear Regression

![](http://miro.medium.com/max/1838/1*uLHXR8LKGDucpwUYHx3VaQ.png)

![](https://www.researchgate.net/profile/Hieu-Tran-17/publication/333457161/figure/fig3/AS:763959762247682@1559153609649/Linear-Regression-model-sample-illustration.ppm)

![](https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/MultipleLinearRegression-Plane.png)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

If we represent our model the following way: Y = b1.X1 + b2.X2 + ... + bn.Xn + C,  
then the fit function allows the model to be find the best coefficient bi for each variable Xi, and the best constant term C.  
In other words, it gives them values that produce an error that is as small as possible.

In [None]:
model.fit(xtrain, ytrain)

Now, we need to test our model.  
We will evaluate the predictions it makes for the patients it hasn't seen before.  

The predict function tells the model to make predictions.

In [None]:
ypred = model.predict(xtest)

Now we need to compare the values that the model predicted with the real values (which the model doesn't know).  
To do this, we can calculate the "mean absolute error" between the real values and the predicted values.  
The Mean Absolute Error simply means "On average, how much error does the model make when predicting the tumor size of the patient?"

In [None]:
from sklearn.metrics import mean_absolute_error

> random_state

In [None]:
mean_absolute_error(ytest, ypred)

The MAE value is 3.99 or basically 4.  
Is that a small error or a big one?

Let's take a look at the real values of tumor sizes:

In [None]:
ytest

If we take patient number 6700, we can see that the size of the tumor is 0.88  
Our model makes an average error of 4.2, so it will make a prediction around 0.88 + 4.2 = 5.08.  
5.08 is different from 4.2, and thus our model is making a BIG error for this patient.  
If you do look at the other patients you can see that an error of 4.2 will cause predictions to be really bad and far from the truth.  
##### What does this mean?  
It means that a linear regression model can not make good predictions on this dataset.  
We can also say that the equation Y = b1.X1 + b2.X2 + ... + bn.Xn + C is not a good approximation of reality in this case.  
##### What do we do now?  
Well, we'll be trying another model: the regression tree

## Regression Tree

A decision tree predicts by asking certain questions then making judgements based on the answer.  
Training (fitting) consists of learning what are the questions that should be asked, and what judgement to make in each possible answer.  
The picture below illustrates this. It shows that the tree consists of nodes, each node being a question. Based on the answer, an appropriate y-value is decided. It also shows the number of samples/rows that correspond to that answer.

![](https://www.statology.org/wp-content/uploads/2020/11/tree3.png)

*But how does the tree learn which questions to ask?*  
Well, questions separate points, which represent clients in our case.  
Points that represent the "yes" answer will be in a group, while ones that correspond to the "no" answer will be in a different group.  
If points in a same group have very different y values then this question isn't useful to ask.  
However if the question separates points into groups such that points in the same group have very similar y values then the question is good to ask.  
**But how does the tree find the good questions?**  
It (kind of) asks many possible questions then picks the best ones.  
**But what kind of questions can be asked?**  
The questions are usually inequalities related the values of features (the columns)  
For example, one question could be "x1 > 1.2 ?" and another could be "x5 < 70 ?".  
Each question corresponds to a split in the feature space (which basically represents the x values).  
For example if we ask "x2 < 23.1 ?" then some points will be on the side of the space where x2 < 23.1 and others will be on the side where x2 >= 23.1.  
The following picture illustrates that.

![](https://media.springernature.com/original/springer-static/image/prt%3A978-1-4899-7687-1%2F18/MediaObjects/978-1-4899-7687-1_18_Part_Fig1-717_HTML.gif)

A regression tree basically groups points according to their y value.  
In other words, points that have similar y values (such as 1.2, 1.4 and 1.1) would be in the same group.  
When predicting the y value of a sample/point/observation/example/row, the tree puts it in a group of points that have similar x values, then assigns to it the mean y value of that group.  

**But how does the tree measure the "goodness" of a question/split?**  
Well, the question divides points into two groups.  
The tree algorithm calculates the average error in each group.  
Meaning it calculates the error that corresponds to every point then calculates the average.  
The error is the distance between the y value of that point and the average y value of the group.  
Sometimes we the square of that distance as the error, sometimes not. But both are valid.  
After calculating the average error for each group, the average error of the question is calculated which is the average between the two groups.  
This is done by multiplying each group error by the number of points in that group, adding and dividing by the total number of points.  
The following image shows examples of errors that we can use. The mean absolute error is the average distance between points and their average.

![](https://1.bp.blogspot.com/-kL42RjXdOEc/XMELxXVMe3I/AAAAAAAABRw/mx2RoIheodwWj0CPAqg9chwXJmpOyPyJQCLcBGAs/s1600/Loss_Functions.PNG)

I'm sure that explaining this in paragraphs isn't the most useful thing, but I just couldn't make a tutorial with nothing but code in it.  
There are plenty of video courses/tutorials/explanations that explain the details of decision trees really well so make sure you check them out.

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
regression_tree = DecisionTreeRegressor()

In [None]:
regression_tree.fit(xtrain, ytrain)

In [None]:
ypred_train = regression_tree.predict(xtrain)
mean_absolute_error(ytrain, ypred_train)

In [None]:
ypred_test = regression_tree.predict(xtest)
mean_absolute_error(ytest, ypred_test)

#### Notes:
* The tree achieved better results than Linear Regression. This almost always the case.  
* The tree has practically zero error on the training set. This is because it kept splitting until almost every point is in a group by itself.
* The gap between training and testing scores is larger for the tree than it is for linear regression.
* This is because the tree focused too much on the training data that it basically memorized it. Thus, when we tested it with data that it hasn't seen/memorized it failed to produce good results.
* This is like memorizing 10 math exercices and hoping you get something similar in the test vs understanding exercices.

We can improve the testing performance by telling the tree not to grow too long/deep.  
Growing too long/deep means the tree is making more cuts in the feature space (the space defined by the features x1, x2, etc..). By making more cuts the tree adjusts better to the data that it's training on.  
We don't want the tree to adjust too well to the training data because then it wouldn't perform that well on data that isn't perfectly similar (the testing data).  
The "max_depth" parameter sets the maximum number of successive nodes / the maximum branch length / the maximum number of questions to be asked.

In [None]:
another_regression_tree = DecisionTreeRegressor(max_depth=15)

In [None]:
another_regression_tree.fit(xtrain, ytrain)

In [None]:
ypred_train = another_regression_tree.predict(xtrain)
mean_absolute_error(ytrain, ypred_train)

In [None]:
ypred_test = another_regression_tree.predict(xtest)
mean_absolute_error(ytest, ypred_test)

We can see that the training error increased because the tree isn't perfectly fit to the training data now, but the testing results are better.

# Classification

In this section we will build models that aim to predict whether a client is satisfied or neutral/dissatisfied.  
The following is preprocessing and exploratory analysis. It isn't very commented since it isn't the focus point of this tutorial.

In [None]:
train=pd.read_csv("../input/airline-passenger-satisfaction/train.csv")
test=pd.read_csv("../input/airline-passenger-satisfaction/test.csv")

In [None]:
list(train.columns)

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.drop(["Unnamed: 0", "id"], axis=1, inplace=True)
test.drop(["Unnamed: 0", "id"], axis=1, inplace=True)

In [None]:
train.isna().sum()

In [None]:
test.isna().sum()

In [None]:
train.dropna(inplace=True)
test.dropna(inplace=True)

In [None]:
train.info()

In [None]:
categoricals = ['Gender', 'Customer Type', 'Type of Travel', 'Class', 'satisfaction']

In [None]:
for col in categoricals:
    print(train[col].value_counts())
    print("\n")

In [None]:
sns.countplot(x=train["satisfaction"])
plt.show()

In [None]:
plt.figure(figsize=(20,7))
sns.heatmap(pd.crosstab(train["Class"], train["satisfaction"], normalize="index"),
            annot = True, cmap="Blues", fmt=".2f", annot_kws={"fontsize":20})
plt.show()

I've decided to encode "Class" labels/values based on their relationship with satisfaction:

In [None]:
train["Class"] = train["Class"].map({"Eco":0, "Eco Plus":1, "Business":2})
test["Class"] = test["Class"].map({"Eco":0, "Eco Plus":1, "Business":2})

In [None]:
train = pd.get_dummies(train, drop_first=True)
test = pd.get_dummies(test, drop_first=True)

In [None]:
train.head()

In [None]:
target = "satisfaction_satisfied"
xtrain, ytrain = train.drop(target, axis=1), train[target]
xtest, ytest = test.drop(target, axis=1), test[target]

In [None]:
xtrain.head()

In [None]:
ytrain.head()

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

xtrain_array = ss.fit_transform(xtrain)
xtrain = pd.DataFrame(xtrain_array, columns=xtrain.columns)

xtest_array = ss.transform(xtest)
xtest = pd.DataFrame(xtest_array, columns=xtrain.columns)

In [None]:
xtrain.head()

## Logistic Regression

Logistic Regression is not a regression model but rather a classification model.  
It constructs an S-shaped curve (in case of one variable x) that represents the probability of y=1.

![](https://static.javatpoint.com/tutorial/machine-learning/images/linear-regression-vs-logistic-regression.png)

The logistic curve corresponds to the logistic function, which corresponds to the formula in the following image.

![](https://www.saedsayad.com/images/LogReg_1.png)

The x value that corresponds to y=0.5 is called the "decision boundary".  
It separates points based on their x value into two groups.  
One group corresponds to Y=1 and the other corresponds to y=0.

![](https://media-exp1.licdn.com/dms/image/C4E22AQEUYF4-CUIZkw/feedshare-shrink_800/0/1636572090596?e=1639612800&v=beta&t=V0JxLlmQL5ej6acmoJ1y3Yx9bwXLjXMoY17wvgaBAxE)

In case of 2 variables, the S-shaped curve becomes an S-shaped surface (but kind of flat..) which we can't draw here since that would require 3 dimensions.  
The decision boundary becomes a straight line though which we can draw.  
This line tries to separate points into 2 classes.

![](https://www.researchgate.net/publication/335786324/figure/fig1/AS:802479209971712@1568337361258/Logistic-regression-and-linear-regression.jpg)

Of course, sometimes you can't separate the two classes (y=1 and y=0) perfectly with a straight line.  
The following image on the right is more realistic than the one on the left.

![](https://camo.githubusercontent.com/fa25e1f53a14c7839d4659edf09c1e9b7a8fcad93727a5eea63b2b4b65454164/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f313831382f302a61585578764e7556695f2d716335566b2e706e67)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
logreg = LogisticRegression(solver='liblinear')

In [None]:
logreg.fit(xtrain, ytrain)

In [None]:
y_predicted_testing = logreg.predict(xtest)
accuracy_score(ytest, y_predicted_testing)

In [None]:
y_predicted_training = logreg.predict(xtrain)
accuracy_score(ytrain, y_predicted_training)

# Classification Tree

The same as a regression tree but predicts classes instead of numerical values.

![](https://miro.medium.com/max/569/0*Yclq0kqMAwCQcIV_.jpg)

![](https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/2.png)

The classification tree is more flexible than logistic regression.  
This is because in reality the data points (in this case clients) aren't always distributed in such a way that you can separate them with a straight line.  
The following image is an example where a tree is more suitable than logistic regression.

![](https://miro.medium.com/max/1248/1*Ixw2RgVQ4syGyCD6ArLcZw.png)

The learning algorithm of the classification tree (which is learning what questions to ask / what splits to make) is almost the same as the regression tree algorithm.  
Instead of evaluating a split/question by comparing y values of different groups (which is basically what the regression tree does), a split/question is evaluated by how much it sets the 2 classes apart.

![](https://i.stack.imgur.com/FgdfC.jpg)

![](https://www.researchgate.net/publication/313816842/figure/fig2/AS:962701719257088@1606537384464/A-decision-tree-with-its-decision-boundary-Each-node-of-the-decision-tree-represents-a.gif)

To evaluate how good a split/question is, we calculate the 'Gini Impurity' that corresponds to it.  
The Gini Impurity is calculated using the following formula:

![](https://static.wixstatic.com/media/02b811_5df05513ffd4487d843bb401dfa5e0cb~mv2.png/v1/fit/w_309%2Ch_118%2Cal_c/file.png)

Understand why that formula is good for evaluating splits/questions is out of the scope of this tutorial since it isn't dedicated to understanding the mathematical details of algorithms but rather the main concepts.  
However, there are plenty of resources on the internet that explain these details really well.

The following is an example of a tree that classifies samples / data points that belong to three classes instead of 2 ie [0, 1, 2] instead of [0, 1]

![](https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/52003/versions/3/screenshot.jpg)

Example of a tree that is too short and doesn't separate classes very well (can be improved by allowing the tree to be longer/deeper) and a tree that is much deeper.

![](https://www.learnbymarketing.com/wp-content/uploads/2016/03/linear-sep-decision-tree.png)

In [None]:
from sklearn.tree import DecisionTreeClassifier
mytree = DecisionTreeClassifier()

In [None]:
mytree.fit(xtrain, ytrain)

In [None]:
y_predicted_training = mytree.predict(xtrain)
accuracy_score(ytrain, y_predicted_training)

In [None]:
y_predicted_testing = mytree.predict(xtest)
accuracy_score(ytest, y_predicted_testing)

As always the tree is very adjusted to the training data but less to the testing data.  
However it still gave better results than logistic regression.  
If we tell the tree not to grow too long, then it should get better testing results.

In [None]:
my_other_tree = DecisionTreeClassifier(max_depth=19)

In [None]:
my_other_tree.fit(xtrain, ytrain)

In [None]:
y_predicted_training = my_other_tree.predict(xtrain)
accuracy_score(ytrain, y_predicted_training)

In [None]:
y_predicted_testing = my_other_tree.predict(xtest)
accuracy_score(ytest, y_predicted_testing)

### الحمد لله الذي بنعمته تتم الصالحات