In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# In addition, import...
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import os, shutil

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Introduction 
After reading a task provided by the holders of the competition and having a look on the data I have, I asked myself a question: *are actually all the patients in the data sick?*

Let me explain why this question is important. The thing is that healthy person will most likely have a FVC constant in time. So, meaning, there will be no significant differences in FVCs of a healthy person measured on different weeks. On the other hand, a patient with a defect in lungs will have either decreasing FVC with time (in case of progressing disease), or an increasing FVC with time (in case of remission). The state of patients health wrt pulmonary fibrosis will be further called **health status**. 

If the data of different health status groups are intermixed and one uses them to predict something - this prediction will be unclear. It will predict for some averaged health status which has no information value. Basically, it will predict wrongly for each health status group. In a perfect world one should either predict separately for each health status or include the health status as a further variable which will influence the prediction.

# 2. Look into the Data

Lets (once more ;) ) have a look into the data:

In [None]:
train = pd.read_csv("/kaggle/input/osic-pulmonary-fibrosis-progression/train.csv")
train.head()

The data above is tabular. 
Looking at the variables we have and reading their description [here](https://www.kaggle.com/c/osic-pulmonary-fibrosis-progression/data), first hint towards the health status I would expect to get from the **Percent** variable. For instance, 100% should be a typical average FVC for a person with similar characteristics. So for sure > 100% and < 100% Percent clusters should correspond to different health status. Let us have a look to the **Percent**

# 3. Percent clusters
Below, lets try to look at the distribution of the variable **Percent**:

In [None]:
# Calculate mean percent for each client
clientsPercent = train.groupby('Patient')["Percent"].mean()

# Plot the distribution of mean percent
plt.hist(clientsPercent, bins = 60)
plt.show()

One can definitely distinguish two to three clusters:
* One around 60%
* One around 80%
* Potentially one above 100%, but the statistics there is very low to judge
Before you ask yourselves if it is not an effect of binning. It isn't. I have tried different binning anf the clusters were distinguishable even for a low amount of bins.

I referring to [this](https://www.kaggle.com/c/osic-pulmonary-fibrosis-progression/discussion/172022) discussion topic, the percent values of the FVC are normal if they are 80% and more and ubnormal to different grade if less. So the first conclusion I could make out of this **Percent** distribution is that there are at least two groups by health status. 

Fine, with this knowledge one could think about splitting these at least two groups of patients for further training.

# 4. Evolution of FVC and Percent for Different Health Groups

A very naive expectation would be that the people who are ill and their illness progresses, will have a decreasing with time **FVC** and **Percent**. The opposite evolution is expected for the ill people who are recovering: for them I would expect the **FVC** and **Percent** to grow with time. The healthy or close-to-be healthy people should have a constant **FVC** and **Percent** then.

## 4.1 Rough Estimate: Linear Fit

As a first rough estimate one might assume the simplest evolution of **FVC** and **Percent** with time: linear evolution. For that one needs to perform a linear fit of **FVC** and **Percent** vs time and observe the slope of the line. Negative evolution (evolving disease) should be represented with a negative slope, positive evolution (recovering) should be represented with a positive slope, stable or healthy state will have (almost) no slope - in ranges of fluctuation.

So first the function on linear fitting:

In [None]:
# Define a function lfit, which performs linear fitting and writing of the visualization of the fit to a file
# The fit is done per patient
def lfit (x, y, patient):
    # Detecting which variable is fitted: fraction (percentage), or some absolute one.
    # Note: this is a hack which works on this data only.
    titleAddition = ""
    if y.max() <= 100:
        titleAddition = "Percent_"
    
    # Linear firring with scikit learn
    x = np.array(x).reshape(-1,1)
    y = np.array(y).reshape(-1,1)
    lr = LinearRegression()
    lr.fit(x, y)
    
    # Save the fitted plot to a file
    plt.scatter(x, y, marker = 'o')
    xForPlot = np.array(list(range(x.min(), x.max() + 1))).reshape(-1,1)
    plt.plot(xForPlot, lr.coef_ * xForPlot + lr.intercept_, 'r--')
    plt.savefig(f'/kaggle/working/{titleAddition}{patient.unique()}_linearFit.png')
    plt.clf()
    
    return(float(lr.coef_))

Now let's perform a fit for each client inside the data frame in the mode when one groups by patient. The slopes are written to separate columns of the data. The new variables are called **slopePercent** and **slopeFVC**. 

In [None]:
# Clean the folder to which the plots are written before running a round of fitting for all patients
folder = '/kaggle/working/'
for filename in os.listdir(folder):
    file_path = os.path.join(folder, filename)
    try:
        if os.path.isfile(file_path) or os.path.islink(file_path):
            os.unlink(file_path)
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)
    except Exception as e:
        print('Failed to delete %s. Reason: %s' % (file_path, e))

# Fit for every client
train["slopePercent"] = train.groupby('Patient').apply(lambda x : lfit(x['Weeks'], x['Percent'], x['Patient'])).reindex(train.Patient).values;
train["slopeFVC"] = train.groupby('Patient').apply(lambda x : lfit(x['Weeks'], x['FVC'], x['Patient'])).reindex(train.Patient).values;

## 4.2 Visualization of the Percent and FVC slope
Let us draw the distribution of the slopes of the linear fits of the **FVC** (with time):

In [None]:
plt.hist(np.array(train.groupby('Patient')["slopeFVC"].unique().tolist()).ravel(), bins = 60);

The same for the **Percent**:

In [None]:
plt.hist(np.array(train.groupby('Patient')["slopePercent"].unique().tolist()).ravel(), bins = 60);

One can roughly distinguish three clusters of slopes. Two of them negative, but differently large, third one - more around zero, so "no slope", close to horizontal line. This might point to the fact that (as seen before) we have three different dynamics in desease progression. I have given hypotheses that those are most likely "more ill" and less severely/stable ill patients.

# 5. Now this can be used with Plots for the Model!
Here is how the above derived information can be practically used in this competition:
* **Enrich Training Data With Health Status**: based on the fits above one can group the data to *health status clusters*
* **Train the Model separately for each cluster**: as obviously the dynamics of the desease is completely different for the patients of different clusters, it makes sence to train the models for these clusters differently
* **Let the Image predict the health status!**: but what to do on test data? There we can not perform a fit as there is no FVC dynamics available. The only hint we have is an image. Image should be trained on train data to predict the health status. Once the health status is predicted for a concrete client in the test data, the corresponding model will be chosen and applied to him/her.

Of course, for more precision one can try to fit with another function, or better include a measurement uncertainty of the FVC metric (70 for FVC and correspondingly 70 * sqrt(Percent/FVC) for the Percent. The latter was derived based on simplified calculation of [uncertainty propagation](https://en.wikipedia.org/wiki/Propagation_of_uncertainty)). This will most likely produce more precisely separated clusters by health status.