# Multiple linear regression

This exercise is to try and replicate and then improve the results obtained in https://pubs.acs.org/doi/abs/10.1021/ci9901338.

You've covered performing linear regression using a single feature in a previous notebook.
We can use each descriptor as a variable in a linear regression to predict Log S.
1. The paper uses particular descriptors and has been able to get an $R^2$ of 0.88 - can you replicate this? If you get different results, why?  
2. Can you beat it using different/additional descriptors?
3. At what point are you at risk of [overfitting](https://en.wikipedia.org/wiki/Overfitting)?
      
<b>Notes:</b>
* A reminder for selecting multiple columns in a pandas dataframe:

        x = data_desc[["a","b","c"]]
        y = data["y"]
        model.fit(x,y)
       
* You can use whichever validation technique you prefer, or if you wish, can match that used in the paper.
    * The authors used leave-one-out cross validation (a single sample is held out rather than a number), and then test1 to evaluate model performance. I have given the code for LeaveOneOut as it is a bit tricky. 
* You may prefer to use an alternative cross validation approach, like you've seen in the previous notebook.
* You may not be able to use the exact same descriptors, so find the closest match - some may not even be available to you.
* It is worth using both MSE and $R^2$ to look at your model performance.
* It can be helpful to see scatter plots - but remember these are 2D. (If you have more than one feature, your data will be of higher dimensions).
* Feel free to refer back to previous notebooks.

<b>Steps to include:</b>
1. Load in python modules
2. Load in data - I have demonstrated how to load 'train'.
3. Select descriptors
5. Train model and evaluate performance using cross validation, and then test using test1.


<b>Optional extras:</b>
1. Use a [decision tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html), [for more info](http://scikit-learn.org/stable/modules/tree.html#tree)
2. Use a [random forest model](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) with all descriptors, [for more info](http://scikit-learn.org/stable/modules/ensemble.html#forest)
2. Draw molecules for outliers - the molecules with a high difference between the predicted Log S and the actual Log S - do they have anything in common?

In [8]:
import pandas as pd

In [17]:
#training set:
train_desc = pd.read_csv("../data/train_desc.csv",index_col=0)
train = pd.read_csv("../data/train_nonull.csv", index_col=0)
train_desc["Y"] = train["Log S"].values

#test1 set:
test1_desc = pd.read_csv("../data/test1_desc.csv",index_col=0) #replace with correct code]
test1 = pd.read_csv("../data/test1_nonull.csv", index_col=0) #replace with correct code
test1_desc["Y"] = test1["Log S"].values #replace with correct code

In [18]:
train_desc.head()

Unnamed: 0,BalabanJ,BertzCT,Chi0,Chi0n,Chi0v,Chi1,Chi1n,Chi1v,Chi2n,Chi2v,...,fr_sulfone,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea,qed,Y
0,2.627215,25.119415,4.991564,4.438777,4.438777,2.770056,2.450904,2.450904,1.637697,1.637697,...,0,0,0,0,0,0,0,0,0.536832,-0.29
1,2.034853,654.343876,13.405413,10.459482,11.275979,9.147867,6.002897,6.819394,4.501396,5.585173,...,0,0,0,0,0,0,0,0,0.906552,-3.94
2,2.719442,322.635895,9.844935,7.080832,7.836761,6.092224,3.7011,4.079064,2.592233,3.028668,...,0,0,0,0,0,0,0,0,0.813116,-2.23
3,2.797251,160.698213,7.397341,6.723615,6.723615,4.863703,3.942688,3.942688,2.523603,2.523603,...,0,0,0,0,0,0,0,0,0.58312,-3.75
4,2.010495,476.454663,13.248559,8.440751,13.732254,8.203248,5.317533,8.021755,5.187197,8.966801,...,0,0,0,0,0,0,0,0,0.44055,-6.29


In [None]:
#code for leave one out cross validation -this code will not work unless you have defined a model and features

from sklearn.model_selection import LeaveOneOut, cross_validate
from sklearn.metrics import r2_score, mean_squared_error

loo = LeaveOneOut()
predictions = [] #creates an empty list so we can save the predictions

for sub_train, sub_test in loo.split(train_desc): #loo.split is a generator object, in each iteration sub_test is one sample,
    #sub_train are all the rest
    x = train_desc.loc[sub_train][features] #save x 
    y = train_desc.loc[sub_train]["Y"] #save y
    model.fit(x,y) #fit the model
    test_x = train_desc.loc[sub_test][features] #predict the value for the single sample
    predictions.append(model.predict(test_x)[0]) #append the prediction to a list, we use [0] to state the first (and only) item in the returned array

In [None]:
#work here